|
相关软件
Linux computer with standard utility applications (including Perl)
R (http://www.r-project.org/; base installation only is needed)
Bismark (http://www.bioinformatics.babraham.ac.uk/projects/bismark/)
FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
FASTX Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
SAMtools (http://samtools.sourceforge.net/)
bedtools (https://github.com/arq5x/bedtools2)
步骤
注:本流程中所用到的脚本链接 http://bio-protocol.org/attached ... 1019024350_8508.zip
A:数据的质控
1. 复制序列到自己创建的文件夹中
2. 解压
gunzip file_name.txt.tar.gz
tar xvfp file_name.txt.tar
3. 使用factqc进行质控(http://hannonlab.cshl.edu/fastx_toolkit/ )
Usage: fastqc file_name.txt
4. 查看质控结果,从而进一步filter或者trim数据,去除接头和低质量的reads(less than 75% quality scores above 25)
5. 对质控筛选过的reads再一次进行质控检测
Usage: fastqc file_name_trimmed.fastq
B:使用bismark进行比对
使用bismark将数据比对到参考基因组上(Krueger and Andrews, 2011)
强烈建议在开始后面的数据分析之前好好读一读由bismark作者写的一篇综述(http://www.ncbi.nlm.nih.gov/pubmed/22290186) 。
在这里,描述了对单端数据的比对,但是双末端也是可以的。
1. 将参考基因组进行重亚硫酸盐的预处理转换(处理一次同样需求的以后无需再重新转换)
Usage: bismark_genome_preparation --path_to_bowtie reference_genome/fasta/
2. 将数据比对到已经转换过了的参考基因组上(对正向和反向的reads分别进行map)
Usage: bismark --non_directional -o bismark_file_name_trimmed.n1l50 -n 1 -l 50 reference_genome file_name.trimmed.fastq
-non_directional: reads是非特异性的(有一个特异性的参数可以选择)
-o 输出文件夹目录
-n 允许最大的错配数
-l seed的长度(在<n的错配数情况下map的第一个nt的数量first number of nt that are mapped with <n mismatches)
*fastq: 输入文件
From this point onwards the analysis follows different steps based on the genotype of the sample used to prepare the library.
Option #1: Libraries from a single inbred strain
C:将bisamark比对得到文件从SAM转换成BAM文件,并进行排序和建立索引
cd bismark_file_name_trimmed.n1l50
Usage: SAM_to_BAM_sort_index.pl file_name.fastq_bismark.sam
Expected output file: file_name.fastq_bismark.sorted.bam
D:将已经排序的BAM文件转换成SAM文件
Usage: samtools view -h file_name.fastq_bismark.sorted.bam > file_name_sorted.sam Samtools (Li et al., 2009)
E:去除冗余的reads(Eliminate redundant reads)
利用SAM文件,保持比对到相同的起始和终止稳点的每条链上只有一条序列(keep only one sequence per strand that maps to the same start and end position)
去除PCR产生的重复序列。A log file (SAM_redundancy_stats.log.txt) of counts for each read is created.
Usage: ./make_sorted_SAM_non-redundant_fewest_MM.pl file_name_sorted.sam > file_name_sorted_nr.sam
这一步工作原理:
The script reads a sorted SAM file and gets all the reads that map to the same position and strand and then
1. Sorts them by decreasing prevalence(普遍的) and then by increasing number of mismatches (not counting bisulfite conversions) to the genome.
2. The most prevalent read with the total highest quality string is kept.
3. In the case of a tie, the read with the fewest number of mismatches is retained.
4. In the case of another tie, it prints out every unique read.
The output SAM file also has 3 new fields added on the end:
CT == count = number of times this read was found in the input file
MM == mismatches = number of mismatches compared to the genome sequence (not counting conversions)
QS == quality score = total quality score for this read
F:计算每个胞嘧啶的甲基化值
bismark_methylation_extractor提取已经排序的SAM文件中的甲基化信息
Usage: bismark_methylation_extractor –s file_name_sorted_nr.sam -o bismark_output
产生12个文件,三种甲基化(CG,CHG,CHH)的4种比对的单个胞嘧啶的甲基化信息:
OT - original top strand 原始的正链
CTOT - complementary to original top strand 原始的正链的互补链
OB - original bottom strand 原始的负链
CTOB - complementary to original bottom strand 原始的负链的互补链
Shorthand for methylation call according to context:
z: unmethylated C in CpG context
Z: methylated C in CpG context
x: unmethylated C in CHG context
X: methylated C in CHG context
h: unmethylated C in CHH context
H: methylated C in CHH context
G:计算重亚硫酸盐的转换率
The mean bisulfite conversion rate for each library is calculated based on the methylation status of each cytosine from reads mapping to the
chloroplast(叶绿体) genome, which is expected to be unmethylated.
1. Run Usage: ./methylation_summarizer_1.pl bismark_output > bismark_SampleName.summary_step1.txt
2. 获取比对到叶绿体上的每个核酸的甲基化信息 Get info about methylation status of each nucleotide from each read mapping to the chloroplast.
Usage:
grep ChrC bismark_*_trim3.n1l50.summary_step_1.txt > bismark_*_trim3.n1l50.summary_step_1.chloroplast.txt
./methylation_summarizer_2_conversion.pl bismark_*_trim3.n1l50.summary_step_1.chloroplast.txt >bismark_*_trim3.n1l50.conversion.txt
The output file contains the conversion at each position, and the mean C conversion across the chloroplast is printed to the screen.
Option #2: Libraries from F1 crosses
H: Map reads from hybrid libraries
In the case of F1 hybrid libraries, two parental genomes may be available as reference. You may decide to map reads to both the parental genomes to
maximize the number of mapped reads using Bismark as described above. To do that, first map all the reads to the best reference genome, then align
the “unmapped” reads to the other genome.
Make file of unmapped reads from the fastq file (e.g., reads not in SAM file).
Usage: ./get_fastq_reads_not_in_SAM.pl fastqFile samFile > reads.unmapped.fastq
To assign reads to a particular strain and to retain as many unique reads as possible, separate the reads by strand.
Script: split_bismark_SAM_by_read_strand.pl
Usage: split_bismark_SAM_by_read_strand.pl file_name_bismark.sam
I: Classify reads by parent-of-origin (allele specific DNA methylation analysis)
Classify the forward reads according to their parent-of-origin with a bedfile where the C>T SNPs between the two genomes of interest are ignored but
all otherSNPs are retained, and classify the reverse reads with a bedfile where the G>A SNPs between the two genomes of interest are ignored but all
other SNPs are retained. There is only one bedfile, but read strand has to be specified (pos or neg).
Script: split_bismark_SAM_by_read_strand.pl
Usage: split_bismark_SAM_by_read_strand.pl file_name_bismark.sam
Expected output files: file_name_bismark.pos.sam, file_name_bismark.neg.sam
Script: classify_bismark_reads_by_parent.one_strand.pl
Usage: classify_bismark_reads_by_parent.one_strand.pl file_name_bismark.pos.sam SNPs.bed pos > file_name_bismark.pos_class.sam"
Usage: classify_bismark_reads_by_parent.one_strand.pl file_name_bismark.neg.sam SNPs.bed neg > file_name_bismark.neg_class.sam"
Reads are classified based on their sequence at known SNP positions. After classification, redundant reads from each class are eliminated.
To differentiate between 4 cases:
ST:Z: maternal
ST:Z: paternal
ST:Z: NE means no evidence for either genome
ST:Z: both means evidence for both (conflicting data)
awk -F"\t" '$19 == "ST:Z: maternal " {print $0}' file_name_bismark.pos_class.sam > maternal.sam
awk -F"\t" '$19 == "ST:Z: paternal " {print $0}' file_name_bismark.pos_class.sam > paternal.sam
awk -F"\t" '$19 == "ST:Z:NE" {print $0}' file_name_bismark.pos_class.sam > NE.sam
awk -F"\t" '$19 == "ST:Z:both" {print $0}' file_name_bismark.pos_class.sam > both.sam
Run similar commands with file_name_bismark.neg_class.sam
J: Combine the SAM files
Example: cat file_name_neg.sam file_name_pos.sam > file_name_pos_neg.sam
You may delete the individual pos and neg files after checking the size of the combined files.
K: Prefix the header to the sam file
cat header file_name_pos_neg.sam > file_name_pos_neg.header.sam
L: Convert SAM to BAM, sort, and index
SAM_to_BAM_sort_index.pl file_name_pos_neg.header.sam
Expected outputfile: file_name_pos_neg.sorted.bam
M:Convert sorted BAM file to SAM format
samtools view -h file_name_pos_neg.sorted.bam > file_name_pos_neg.sorted.samConvert sorted BAM file to SAM format
samtools view -h file_name_pos_neg.sorted.bam > file_name_pos_neg.sorted.sam
N:Eliminate redundant reads (see step 5)
Script: make_sorted_SAM_non-redundant_fewest_MM.pl
Usage: make_sorted_SAM_non-redundant_fewest_MM.pl file_name_pos_neg.sorted.sam > file_name_pos_neg.sorted_nr.sam
O: Calculate a methylation value for each cytosine (see step 6)
Run methylation extractor for each set of classified reads as well as for all reads combined.
Usage: methylation_extractor -s file_name_pos_neg.sorted_nr.sam
Combine all the nr.sam files into an “allreads_nr.sam” file with cat command and run methylation_extractor again.
P:Summarize the methylation status across genomic windows
For each class, organize the 12 methylation extractor files in the output files folder. Divide the genome into 300 nt
(windowWidth) windows, overlapping by 100 nt (windowOverlap).
Script: make_genome_windows_bed.pl (requiring a file of each chromosome and its length 每条染色体的长度信息)
Usage: make_genome_windows_bed.pl chromInfo.txt windowWidth windowOverlap > genome_300nt_100_windows.bed
Script: Summarize_by_window.sh
Usage: bash Summarize_by_window.sh methylation_outputfiles genome_300nt_100_windows.bed
Summarize_by_window.sh features:
This script summarizes methylation status across overlapping genomic windows of defined size by converting
the processed Bismark methylation extractor output files into a set of bed files and determining weighted
methylation values as described in Schultz et al. (2012). Bedgraph files for viewing in a genome browser
are created in the bedgraph folder. Bismark's methylation extractor output by chr position (after sorting)
is summarized by converting the methylation string into ummethylated counts, methylated counts, and percent
methylation, producing output in BED-like format (scorePerPos folder). The folder weighted_summaries_by_window
has three bed files, which merge sites across each window. For each window it provides counts and weighted
percent methylation.
Q: 鉴定差异甲基化区域(DMR)
每个位点至少要求五个reads的覆盖。通过计算两个样本之间差异(sample A-sample B of weighted methylation fractions),和置信
(p-value from Fisher's exact test)在每个窗口所鉴定道德甲基化模式(CG,CHG,CHH)来鉴定差异甲基化。P values are corrected
with the Benjamini and Hochberg False Discovery Rate (FDR).CG和CHG DMR的鉴定为每个窗口最少有三个informative的胞嘧啶。比如
一个严重的差异甲基化至少有35倍并且矫正的P值< 0.01。CHH DMR定义为一个窗口最少有10个overlapping informative cytosines,例如,
一个严重的差异甲基化最少为10倍并且P < 0.01 . 我们也可以按照自己的需求来定义DMRS.
To compare mC window counts in two samples the following scripts are needed:
merge_bedgraph_data_counts_etc.pl
compare_methylation_counts_by_window.R
1.Make matrix of counts for 2 samples, each in 2 contexts using files in the 'weighted_summaries_by_window' folders.
2.Example:
./merge_bedgraph_data_counts_etc.pl genome_300nt_100_windows.bed
sample_A_reads/weighted_summaries_by_window/CpG.bed
sample_B_reads/weighted_summaries_by_window/CpG.bed
sample_A_reads/weighted_summaries_by_window/CHG.bed
sample_B_reads/weighted_summaries_by_window/CHG.bed
sample_A_reads/weighted_summaries_by_window/CHH.bed
sample_B_reads/weighted_summaries_by_window/CHH.bed
> sample_A_vs_sample_B_reads.300nt_100_windows.txt
The resulting file should have window counts in this order:
sample_A/weighted_summaries_by_window/CpG.bed
sample_B/weighted_summaries_by_window/CpG.bed
sample_A/weighted_summaries_by_window/CHG.bed
sample_B/weighted_summaries_by_window/CHG.bed
sample_A/weighted_summaries_by_window/CHH.bed
sample_B/weighted_summaries_by_window/CHH.bed
3.Compare counts across two samples in the same context.
compare_methylation_counts_by_window.R inputCountsFile outputStatsFile
Usage:
compare_methylation_counts_by_window.R
sample_A_vs_sample_B_reads.300nt_100_windows.txt
sample_A_vs_sample_B_reads.300nt_100_windows.stats.txt
对于每个窗口,For each window, the output file will have (a) the raw(未处理的) Fisher's exact test p-value reflecting whether the
fraction of meth/unmeth counts is the same for both samples, and (b) the difference (sample A-sample B) in weighted methylation
fraction. Methylation fractions appear next to the results of each statistical calculation.
R:Overlap genomic features (genes, transposable elements, etc.) with the windows
1. Open the stats file in Excel and sort based on p value.
2. Copy the rows representing selected DMRs [(with a FDR below your threshold (0.01) and
with a methylation difference above your threshold (e.g. 35%)] into another file and delete
all but the desired 5 columns (chr, start, end, difference, FDR value).
3. Save the file with the extension “bed” and intersect with genomic features using Bedtools (Quinlan and Hall, 2010).
Run:
bedtools intersect -wao -a sample_A_vs_sample_B.CpG.bed -b genome_GFF3_genes.gff > intersect_ sample_A_vs_sample_B_genes.txt
bedtools intersect -wao -a sample_A_vs_sample_B.CpG.bed –b genome_GFF3_transposable_element.gff > intersect_ sample_A_vs_sample_B.CpG.bed _TE.txt
protocol来源:http://en.bio-protocol.org/cn/se ... d=20171208113149144
水平有限,很多错误的地方请多多指正,DMR和DMC以及DMG还有后面的可视化,欢迎一起交流。。
|
|