搜索
查看: 2251|回复: 2

sam2bam工具把NGS数据分析耗时部分给改进了~~~

[复制链接]

634

主题

1182

帖子

4030

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
4030
发表于 2017-3-5 10:49:56 | 显示全部楼层 |阅读模式
终于有人在这一块做一点事情了,做过组学分析的朋友都知道比对和call variation/peaks都不是问题了, 反而是比对的文件排序压缩特别耗时~~
这个文章提到的工具解决了这一点:http://journals.plos.org/plosone ... ournal.pone.0167100
Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools
我还没用,不过先点赞:
Typical 50x coverage of whole genome sequencing (WGS) can easily generate up to 500-GB FASTQ files. The data processes on modern computers are still very time-consuming for such huge data sets.
普通的实验室或者公司至少需要一天才能完成WGS分析,主要耗时步骤就在:
[AppleScript] 纯文本查看 复制代码
samtools view -bS read.sam >read.bam
samtools flagstat read.bam
samtools sort -@ 5 -o read.sorted.bam read.bam
samtools view -h -F4 -q 5 read.sorted.bam |samtools view -bS|samtools rmdup - read.filter.rmdup.bam

这个上面~~

Pre-processing for the DNA data usually involves five steps.
  • Mapping sequence reads to reference genome This step is usually done by BWA mem [7] or other reference alignment tools and SAM files are generated.
  • Sorting sequence reads based on coordinates This step is usually done by Picard SortSam or samtools sort. Some tools that are used in the following steps, such as when the Picard MarkDuplicates tool requires the sorted input files.
  • Marking duplicate alignments This step is commonly done by the Picard MarkDuplicates tool to remove the alignments of duplicate reads.
  • Performing local realignment around indels This is usually done by GATK RealignerTargetCreator and IndelRealigner tools to reduce artifacts produced in regions around the indels.
  • Recalibrating the base quality score This step is usually done by GATK BaseRecalibrator and PrintReads tools to improve the accuracy of base quality scores that the variant calling step relies on.
Preprocessing is very time-cosuming that usually requires tens of hours for a WGS dataset and hours for a whole exome (WEX) dataset.




你这个问题很复杂,需要打赏,请点击 http://www.bio-info-trainee.com/donate 进行打赏,谢谢
回复

使用道具 举报

4

主题

24

帖子

184

积分

注册会员

Rank: 2

积分
184
发表于 2017-3-6 14:58:41 | 显示全部楼层
这个说的很对。其实SAM文件没有保存的必要。现在流行的做法是用PIPE。硬盘读写占用了相当多的时间,用了PIPE的效率高很多。英文貌似是on the fly。

[Shell] 纯文本查看 复制代码
hisast2 .........  2>sample.log | samtools view -Sb - |samtools sort -@ 5 - sample
samtools index sample.bam

如果需要filter,可以直接在samtools view -Sb 后面加上filter参数,比如-F 4.
回复 支持 1 反对 0

使用道具 举报

8

主题

55

帖子

348

积分

版主

Rank: 7Rank: 7Rank: 7

积分
348
发表于 2018-5-9 00:16:32 | 显示全部楼层
这个软件编译起来好像有问题,也不知道我有没有编译成功,过程中好像出现了bug
我的微博:dulunar
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2019-10-18 17:55 , Processed in 0.029555 second(s), 24 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.