|
soft-clipping of reads may add unwanted alignments to repetitive regions
soft和hard clippping 区别
soft-clipped: bases in 5' and 3' of the read are NOT part of the alignment.
hard-clipped: bases in 5' and 3' of the read are NOT part of the alignment AND those bases have been removed from the read sequence in the BAM file. The 'real' sequence length would be length(SEQ)+ count-of-hard-clipped-bases
Hard masked bases do not appear in the SEQ string, soft masked bases do.
So, if your cigar is: `10H10M10H` then the SEQ will only be 10 bases long.
if your cigar is 10S10M10S then the SEQ and base-quals will be 30 bases long.
In the case of soft-masking, even though the SEQ is present, it is not used by variant callers and not displayed when you view your data in a viewer. In either case, masked bases should not be used in calculating coverage.
Both of these maskings are different from deletions. Masking simply means the part of the read can not be aligned to the genome (simplified, but a reasonable assumption for most cases, I think). A deletion means that a stretch of genome is not present in the sample and therefore not in the reads.
I'm not sure when H is used instead of the S and vice-versa. I would like to know that.
My understanding of the choice between soft-clipping and hard-clipping is that hard-clipping is applied when the clipped bases align elsewhere in the reference genome, i.e chimeric reads. At least in bwa this appears to be when hard clipping is used. I'm not sure about other aligners?
bwa-mem 0.7.5 release notes from http://seqanswers.com/forums/showthread.php?t=31237:
"Changed the way a chimeric alignment is reported (conforming to the upcoming
SAM spec v1.5). With 0.7.5, if the read has a chimeric alignment, the paired
or the top hit uses soft clipping and is marked with neither 0x800 nor 0x100
bits. All the other hits part of the chimeric alignment will use hard
clipping and be marked with 0x800 if option "-M" is not in use, or marked
with 0x100 otherwise."
然后是如何区分soft clipping和unmap呢?
quite confused about this two terminology. I'm reading Pindel, the split-read algorithm. The author seems to make use of the information of "unmapped" reads. Also there are other split-read-based algorithm, which uses "soft-clipped" reads, which are the unaligned parts of reads.
In my eyes, the two look quite similar. Say we have a 100bp read, 50bp of which cannot map while the 50bp can. Then how would BWA categorize this read? Will BWA think this is "unmapped" read since 50bp cannot be mapped; or it's "mapped" but with 50bp "soft-clipped" sequences?
Or BWA has a scoring system for mapping, which sets a threshold for distinguishing the two?
thx
edit: maybe this is related to "centeredness"? say, if breakpoint locates at 99:1; then this 99bp will be mapped with 1bp as "soft-clipped" sequences. But for 50:50, then BWA may regard it as "unmapped"
As I understand the terminology, It will be "mapped" but with 50bp "soft-clipped" sequences. The unmapped have no sequences mapped to the target query.
I'm not an expert on read mapping and am also still trying to get to grips with it. But from my experience there are cases in which BWA reports extensively soft-clipped reads as matches. Here's an example from a paired end Illumina sequencing project:
CTCAG_6_1205_14418_171577_2 163 gi|261748867|gb|CM000804.1| 25090342 17 61S20M = 25090377 116 TGCAGCCCCGCTTTGGTGAAAAAACAAGATAGGAACTGTTGTTGTTCAACTGTACTGTCACCTGCAGCACACACAACCTCC bbbeeeeegggggiiighhiiiiiiiiiiihiifhiiiiiihiihhhihihihiiiggggggeeeeedddcdccccccccc RG:Z:FCC0ACBACXX_L6_4 XT:A:M NM:i:0 SM:i:17 AM:i:17 XM:i:0 XO:i:0 XG:i:0 MD:Z:20
As you can see in the CIGAR string 61S20M 61bp have been soft-clipped from the beginning of the read. The flag 163 (=128+32+2+1) indicates that the read was mapped (4th, i.e. "unmapped", bit is 0), paired, mapped in proper pair, second in pair and that its mate mapped to the reverse strand (check out this great site for decoding SAM bit flags).
So it seems that even with >50% soft-clipping BWA reports reads as mapped. So far I could not figure out how to tell BWA not to do that...which I would actually prefer. |
|