搜索
查看: 2450|回复: 5

[WGBS] BS-seq分析基本流程

[复制链接]

2

主题

9

帖子

147

积分

注册会员

Rank: 2

积分
147
发表于 2017-12-9 21:46:20 | 显示全部楼层 |阅读模式
相关软件
Linux computer with standard utility applications (including Perl)
R (http://www.r-project.org/; base installation only is needed)
Bismark (http://www.bioinformatics.babraham.ac.uk/projects/bismark/)
FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
FASTX Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
SAMtools (http://samtools.sourceforge.net/)
bedtools (https://github.com/arq5x/bedtools2)

步骤
注:本流程中所用到的脚本链接 http://bio-protocol.org/attached ... 1019024350_8508.zip

A:数据的质控
        1. 复制序列到自己创建的文件夹中
        2. 解压
                gunzip file_name.txt.tar.gz
                tar xvfp file_name.txt.tar
        3. 使用factqc进行质控(http://hannonlab.cshl.edu/fastx_toolkit/
                Usage: fastqc file_name.txt
        4. 查看质控结果,从而进一步filter或者trim数据,去除接头和低质量的reads(less than 75% quality scores above 25)
        5. 对质控筛选过的reads再一次进行质控检测
                Usage: fastqc file_name_trimmed.fastq

B:使用bismark进行比对
        使用bismark将数据比对到参考基因组上(Krueger and Andrews, 2011)
        强烈建议在开始后面的数据分析之前好好读一读由bismark作者写的一篇综述(http://www.ncbi.nlm.nih.gov/pubmed/22290186) 。
                在这里,描述了对单端数据的比对,但是双末端也是可以的。
        1. 将参考基因组进行重亚硫酸盐的预处理转换(处理一次同样需求的以后无需再重新转换)
                Usage: bismark_genome_preparation --path_to_bowtie reference_genome/fasta/
        2. 将数据比对到已经转换过了的参考基因组上(对正向和反向的reads分别进行map)
                Usage: bismark --non_directional -o bismark_file_name_trimmed.n1l50 -n 1 -l 50 reference_genome file_name.trimmed.fastq
                        -non_directional: reads是非特异性的(有一个特异性的参数可以选择)
                        -o 输出文件夹目录
                        -n 允许最大的错配数
                        -l seed的长度(在<n的错配数情况下map的第一个nt的数量first number of nt that are mapped with <n mismatches)
                        *fastq: 输入文件
       
        From this point onwards the analysis follows different steps based on the genotype of the sample used to prepare the library.
        Option #1: Libraries from a single inbred strain

C:将bisamark比对得到文件从SAM转换成BAM文件,并进行排序和建立索引
        cd bismark_file_name_trimmed.n1l50
        Usage: SAM_to_BAM_sort_index.pl file_name.fastq_bismark.sam
        Expected output file: file_name.fastq_bismark.sorted.bam

D:将已经排序的BAM文件转换成SAM文件
        Usage: samtools view -h file_name.fastq_bismark.sorted.bam > file_name_sorted.sam Samtools (Li et al., 2009)

E:去除冗余的reads(Eliminate redundant reads)
        利用SAM文件,保持比对到相同的起始和终止稳点的每条链上只有一条序列(keep only one sequence per strand that maps to the same start and end position)
        去除PCR产生的重复序列。A log file (SAM_redundancy_stats.log.txt) of counts for each read is created.
                Usage: ./make_sorted_SAM_non-redundant_fewest_MM.pl file_name_sorted.sam > file_name_sorted_nr.sam
        这一步工作原理:
        The script reads a sorted SAM file and gets all the reads that map to the same position and strand and then
        1. Sorts them by decreasing prevalence(普遍的) and then by increasing number of mismatches (not counting bisulfite conversions) to the genome.
        2. The most prevalent read with the total highest quality string is kept.
        3. In the case of a tie, the read with the fewest number of mismatches is retained.
        4. In the case of another tie, it prints out every unique read.
                The output SAM file also has 3 new fields added on the end:
                CT == count = number of times this read was found in the input file
                MM == mismatches = number of mismatches compared to the genome sequence (not counting conversions)
                QS == quality score = total quality score for this read
F:计算每个胞嘧啶的甲基化值
        bismark_methylation_extractor提取已经排序的SAM文件中的甲基化信息
        Usage: bismark_methylation_extractor –s file_name_sorted_nr.sam -o bismark_output
        产生12个文件,三种甲基化(CG,CHG,CHH)的4种比对的单个胞嘧啶的甲基化信息:
                OT - original top strand   原始的正链
                CTOT - complementary to original top strand  原始的正链的互补链
                OB - original bottom strand 原始的负链
                CTOB - complementary to original bottom strand  原始的负链的互补链
       
        Shorthand for methylation call according to context:
                z: unmethylated C in CpG context
                Z: methylated C in CpG context
                x: unmethylated C in CHG context
                X: methylated C in CHG context
                h: unmethylated C in CHH context
                H: methylated C in CHH context
       
G:计算重亚硫酸盐的转换率
        The mean bisulfite conversion rate for each library is calculated based on the methylation status of each cytosine from reads mapping to the
        chloroplast(叶绿体) genome, which is expected to be unmethylated.
                1. Run Usage: ./methylation_summarizer_1.pl bismark_output > bismark_SampleName.summary_step1.txt
                2. 获取比对到叶绿体上的每个核酸的甲基化信息 Get info about methylation status of each nucleotide from each read mapping to the chloroplast.
                Usage:
                        grep ChrC bismark_*_trim3.n1l50.summary_step_1.txt > bismark_*_trim3.n1l50.summary_step_1.chloroplast.txt
                        ./methylation_summarizer_2_conversion.pl bismark_*_trim3.n1l50.summary_step_1.chloroplast.txt >bismark_*_trim3.n1l50.conversion.txt
                The output file contains the conversion at each position, and the mean C conversion across the chloroplast is printed to the screen.

Option #2: Libraries from F1 crosses
H: Map reads from hybrid libraries
        In the case of F1 hybrid libraries, two parental genomes may be available as reference. You may decide to map reads to both the parental genomes to
        maximize the number of mapped reads using Bismark as described above. To do that, first map all the reads to the best reference genome, then align
        the “unmapped” reads to the other genome.
        Make file of unmapped reads from the fastq file (e.g., reads not in SAM file).
                Usage: ./get_fastq_reads_not_in_SAM.pl fastqFile samFile > reads.unmapped.fastq
        To assign reads to a particular strain and to retain as many unique reads as possible, separate the reads by strand.
                Script: split_bismark_SAM_by_read_strand.pl
                Usage: split_bismark_SAM_by_read_strand.pl file_name_bismark.sam


I: Classify reads by parent-of-origin (allele specific DNA methylation analysis)
        Classify the forward reads according to their parent-of-origin with a bedfile where the C>T SNPs between the two genomes of interest are ignored but
        all otherSNPs are retained, and classify the reverse reads with a bedfile where the G>A SNPs between the two genomes of interest are ignored but all
        other SNPs are retained. There is only one bedfile, but read strand has to be specified (pos or neg).
                Script: split_bismark_SAM_by_read_strand.pl
                Usage: split_bismark_SAM_by_read_strand.pl file_name_bismark.sam
                Expected output files: file_name_bismark.pos.sam, file_name_bismark.neg.sam
                Script: classify_bismark_reads_by_parent.one_strand.pl
                Usage: classify_bismark_reads_by_parent.one_strand.pl file_name_bismark.pos.sam SNPs.bed pos > file_name_bismark.pos_class.sam"
                Usage: classify_bismark_reads_by_parent.one_strand.pl file_name_bismark.neg.sam SNPs.bed neg > file_name_bismark.neg_class.sam"
        Reads are classified based on their sequence at known SNP positions. After classification, redundant reads from each class are eliminated.
                To differentiate between 4 cases:
                ST:Z: maternal
                ST:Z: paternal
                ST:Z: NE means no evidence for either genome
                ST:Z: both means evidence for both (conflicting data)
                        awk -F"\t" '$19 == "ST:Z: maternal " {print $0}' file_name_bismark.pos_class.sam > maternal.sam
                        awk -F"\t" '$19 == "ST:Z: paternal " {print $0}' file_name_bismark.pos_class.sam > paternal.sam
                        awk -F"\t" '$19 == "ST:Z:NE" {print $0}' file_name_bismark.pos_class.sam > NE.sam
                        awk -F"\t" '$19 == "ST:Z:both" {print $0}' file_name_bismark.pos_class.sam > both.sam
                Run similar commands with file_name_bismark.neg_class.sam


J: Combine the SAM files
        Example: cat file_name_neg.sam file_name_pos.sam > file_name_pos_neg.sam
        You may delete the individual pos and neg files after checking the size of the combined files.

K: Prefix the header to the sam file
        cat header file_name_pos_neg.sam > file_name_pos_neg.header.sam

L: Convert SAM to BAM, sort, and index
        SAM_to_BAM_sort_index.pl file_name_pos_neg.header.sam
        Expected outputfile: file_name_pos_neg.sorted.bam

M:Convert sorted BAM file to SAM format
        samtools view -h file_name_pos_neg.sorted.bam > file_name_pos_neg.sorted.samConvert sorted BAM file to SAM format
        samtools view -h file_name_pos_neg.sorted.bam > file_name_pos_neg.sorted.sam

N:Eliminate redundant reads (see step 5)
        Script: make_sorted_SAM_non-redundant_fewest_MM.pl
        Usage: make_sorted_SAM_non-redundant_fewest_MM.pl file_name_pos_neg.sorted.sam > file_name_pos_neg.sorted_nr.sam

O: Calculate a methylation value for each cytosine (see step 6)
        Run methylation extractor for each set of classified reads as well as for all reads combined.
                Usage: methylation_extractor -s file_name_pos_neg.sorted_nr.sam
        Combine all the nr.sam files into an “allreads_nr.sam” file with cat command and run methylation_extractor again.

P:Summarize the methylation status across genomic windows
        For each class, organize the 12 methylation extractor files in the output files folder. Divide the genome into 300 nt
        (windowWidth) windows, overlapping by 100 nt (windowOverlap).
        Script: make_genome_windows_bed.pl (requiring a file of each chromosome and its length 每条染色体的长度信息)
                Usage: make_genome_windows_bed.pl chromInfo.txt windowWidth windowOverlap > genome_300nt_100_windows.bed
       
        Script: Summarize_by_window.sh
                Usage: bash Summarize_by_window.sh methylation_outputfiles genome_300nt_100_windows.bed
       
        Summarize_by_window.sh features:
                This script summarizes methylation status across overlapping genomic windows of defined size by converting
                the processed Bismark methylation extractor output files into a set of bed files and determining weighted
                methylation values as described in Schultz et al. (2012). Bedgraph files for viewing in a genome browser
                are created in the bedgraph folder. Bismark's methylation extractor output by chr position (after sorting)
                is summarized by converting the methylation string into ummethylated counts, methylated counts, and percent
                methylation, producing output in BED-like format (scorePerPos folder). The folder weighted_summaries_by_window
                has three bed files, which merge sites across each window. For each window it provides counts and weighted
                percent methylation.

Q: 鉴定差异甲基化区域(DMR)
        每个位点至少要求五个reads的覆盖。通过计算两个样本之间差异(sample A-sample B of weighted methylation fractions),和置信
        (p-value from Fisher's exact test)在每个窗口所鉴定道德甲基化模式(CG,CHG,CHH)来鉴定差异甲基化。P values are corrected
        with the Benjamini and Hochberg False Discovery Rate (FDR).CG和CHG DMR的鉴定为每个窗口最少有三个informative的胞嘧啶。比如
        一个严重的差异甲基化至少有35倍并且矫正的P值< 0.01。CHH DMR定义为一个窗口最少有10个overlapping informative cytosines,例如,
        一个严重的差异甲基化最少为10倍并且P < 0.01 . 我们也可以按照自己的需求来定义DMRS.
        To compare mC window counts in two samples the following scripts are needed:
                merge_bedgraph_data_counts_etc.pl
                compare_methylation_counts_by_window.R
        1.Make matrix of counts for 2 samples, each in 2 contexts using files in the 'weighted_summaries_by_window' folders.
        2.Example:
                ./merge_bedgraph_data_counts_etc.pl genome_300nt_100_windows.bed
                sample_A_reads/weighted_summaries_by_window/CpG.bed
                sample_B_reads/weighted_summaries_by_window/CpG.bed
                sample_A_reads/weighted_summaries_by_window/CHG.bed
                sample_B_reads/weighted_summaries_by_window/CHG.bed
                sample_A_reads/weighted_summaries_by_window/CHH.bed
                sample_B_reads/weighted_summaries_by_window/CHH.bed
                > sample_A_vs_sample_B_reads.300nt_100_windows.txt

                The resulting file should have window counts in this order:
                        sample_A/weighted_summaries_by_window/CpG.bed
                        sample_B/weighted_summaries_by_window/CpG.bed
                        sample_A/weighted_summaries_by_window/CHG.bed
                        sample_B/weighted_summaries_by_window/CHG.bed
                        sample_A/weighted_summaries_by_window/CHH.bed
                        sample_B/weighted_summaries_by_window/CHH.bed

        3.Compare counts across two samples in the same context.
                compare_methylation_counts_by_window.R inputCountsFile outputStatsFile
        Usage:
                compare_methylation_counts_by_window.R
                sample_A_vs_sample_B_reads.300nt_100_windows.txt
                sample_A_vs_sample_B_reads.300nt_100_windows.stats.txt

        对于每个窗口,For each window, the output file will have (a) the raw(未处理的) Fisher's exact test p-value reflecting whether the
        fraction of meth/unmeth counts is the same for both samples, and (b) the difference (sample A-sample B) in weighted methylation
        fraction. Methylation fractions appear next to the results of each statistical calculation.

R:Overlap genomic features (genes, transposable elements, etc.) with the windows
        1. Open the stats file in Excel and sort based on p value.
        2. Copy the rows representing selected DMRs [(with a FDR below your threshold (0.01) and
        with a methylation difference above your threshold (e.g. 35%)] into another file and delete
        all but the desired 5 columns (chr, start, end, difference, FDR value).
        3. Save the file with the extension “bed” and intersect with genomic features using Bedtools (Quinlan and Hall, 2010).
                Run:
                bedtools intersect -wao -a sample_A_vs_sample_B.CpG.bed -b genome_GFF3_genes.gff > intersect_ sample_A_vs_sample_B_genes.txt
                bedtools intersect -wao -a sample_A_vs_sample_B.CpG.bed –b genome_GFF3_transposable_element.gff > intersect_ sample_A_vs_sample_B.CpG.bed _TE.txt

protocol来源:http://en.bio-protocol.org/cn/se ... d=20171208113149144
水平有限,很多错误的地方请多多指正,DMR和DMC以及DMG还有后面的可视化,欢迎一起交流。。




上一篇:ChAMP加载IDAT数据的一个小问题
下一篇:HLA—肿瘤免疫治疗又一潜在biomarker
回复

使用道具 举报

1

主题

54

帖子

746

积分

高级会员

Rank: 4

积分
746
发表于 2017-12-10 09:32:17 | 显示全部楼层
最近有用bismark,有个疑问,为什么要对BAM进行排序?我在bismark手册上没发现这一步。难道是我没看清楚??
回复 支持 反对

使用道具 举报

2

主题

9

帖子

147

积分

注册会员

Rank: 2

积分
147
 楼主| 发表于 2017-12-10 15:41:12 | 显示全部楼层
生信小小菜鸟 发表于 2017-12-10 09:32
最近有用bismark,有个疑问,为什么要对BAM进行排序?我在bismark手册上没发现这一步。难道是我没看清楚?? ...

不排序你会发现你到时候extra提取出来的信息坐标比较乱
回复 支持 反对

使用道具 举报

0

主题

2

帖子

27

积分

新手上路

Rank: 1

积分
27
发表于 2018-4-4 17:02:11 | 显示全部楼层
楼主用过methylkit么?
回复 支持 反对

使用道具 举报

2

主题

9

帖子

147

积分

注册会员

Rank: 2

积分
147
 楼主| 发表于 2018-4-16 22:33:54 | 显示全部楼层
赌书斗茶6 发表于 2018-4-4 17:02
楼主用过methylkit么?

用过这个R包
回复 支持 反对

使用道具 举报

1

主题

5

帖子

108

积分

注册会员

Rank: 2

积分
108
发表于 2018-4-20 11:39:13 | 显示全部楼层
生信小小菜鸟 发表于 2017-12-10 09:32
最近有用bismark,有个疑问,为什么要对BAM进行排序?我在bismark手册上没发现这一步。难道是我没看清楚?? ...

我看到有的人的流程里排序之后要去重,再往下做
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树    

GMT+8, 2018-10-24 01:20 , Processed in 0.122321 second(s), 25 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.