搜索
查看: 3252|回复: 0

soft和hard clippping 区别

[复制链接]

634

主题

1182

帖子

4030

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
4030
发表于 2017-3-1 10:04:55 | 显示全部楼层 |阅读模式
soft-clipping of reads may add unwanted alignments to repetitive regions

soft和hard clippping 区别

soft-clipped: bases in 5' and 3' of the read are NOT part of the alignment.

hard-clipped: bases in 5' and 3' of the read are NOT part of the alignment AND those bases have been removed from the read sequence in the BAM file. The 'real' sequence length would be length(SEQ)+ count-of-hard-clipped-bases

Hard masked bases do not appear in the SEQ string, soft masked bases do.

So, if your cigar is: `10H10M10H` then the SEQ will only be 10 bases long.

if your cigar is 10S10M10S then the SEQ and base-quals will be 30 bases long.

In the case of soft-masking, even though the SEQ is present, it is not used by variant callers and not displayed when you view your data in a viewer. In either case, masked bases should not be used in calculating coverage.

Both of these maskings are different from deletions. Masking simply means the part of the read can not be aligned to the genome (simplified, but a reasonable assumption for most cases, I think). A deletion means that a stretch of genome is not present in the sample and therefore not in the reads.

I'm not sure when H is used instead of the S and vice-versa. I would like to know that.

My understanding of the choice between soft-clipping and hard-clipping is that hard-clipping is applied when the clipped bases align elsewhere in the reference genome, i.e chimeric reads. At least in bwa this appears to be when hard clipping is used. I'm not sure about other aligners?

bwa-mem 0.7.5 release notes from http://seqanswers.com/forums/showthread.php?t=31237:

"Changed the way a chimeric alignment is reported (conforming to the upcoming
SAM spec v1.5). With 0.7.5, if the read has a chimeric alignment, the paired
or the top hit uses soft clipping and is marked with neither 0x800 nor 0x100
bits. All the other hits part of the chimeric alignment will use hard
clipping and be marked with 0x800 if option "-M" is not in use, or marked
with 0x100 otherwise."





然后是如何区分soft clipping和unmap呢?

quite confused about this two terminology. I'm reading Pindel, the split-read algorithm. The author seems to make use of the information of "unmapped" reads. Also there are other split-read-based algorithm, which uses "soft-clipped" reads, which are the unaligned parts of reads.

In my eyes, the two look quite similar. Say we have a 100bp read, 50bp of which cannot map while the 50bp can. Then how would BWA categorize this read? Will BWA think this is "unmapped" read since 50bp cannot be mapped; or it's "mapped" but with 50bp "soft-clipped" sequences?

Or BWA has a scoring system for mapping, which sets a threshold for distinguishing the two?

thx

edit: maybe this is related to "centeredness"? say, if breakpoint locates at 99:1; then this 99bp will be mapped with 1bp as "soft-clipped" sequences. But for 50:50, then BWA may regard it as "unmapped"

As I understand the terminology, It will be "mapped" but with 50bp "soft-clipped" sequences. The unmapped have no sequences mapped to the target query.

I'm not an expert on read mapping and am also still trying to get to grips with it. But from my experience there are cases in which BWA reports extensively soft-clipped reads as matches. Here's an example from a paired end Illumina sequencing project:

CTCAG_6_1205_14418_171577_2     163     gi|261748867|gb|CM000804.1|     25090342        17      61S20M  =       25090377        116     TGCAGCCCCGCTTTGGTGAAAAAACAAGATAGGAACTGTTGTTGTTCAACTGTACTGTCACCTGCAGCACACACAACCTCC       bbbeeeeegggggiiighhiiiiiiiiiiihiifhiiiiiihiihhhihihihiiiggggggeeeeedddcdccccccccc       RG:Z:FCC0ACBACXX_L6_4   XT:A:M  NM:i:0  SM:i:17 AM:i:17 XM:i:0  XO:i:0  XG:i:0  MD:Z:20
As you can see in the CIGAR string 61S20M 61bp have been soft-clipped from the beginning of the read. The flag 163 (=128+32+2+1) indicates that the read was mapped (4th, i.e. "unmapped", bit is 0), paired, mapped in proper pair, second in pair and that its mate mapped to the reverse strand (check out this great site for decoding SAM bit flags).

So it seems that even with >50% soft-clipping BWA reports reads as mapped. So far I could not figure out how to tell BWA not to do that...which I would actually prefer.



上一篇:GWAS研究麻风病
下一篇:老鼠模型WES研究文章一篇
你这个问题很复杂,需要打赏,请点击 http://www.bio-info-trainee.com/donate 进行打赏,谢谢
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2019-9-22 08:33 , Processed in 0.034611 second(s), 28 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.