|
1)airway简介
在该workflow中,所用的数据集来自RNA-seq,气道平滑肌细胞(airway smooth muscle cells )用氟美松(糖皮质激素,抗炎药)处理。例如,哮喘患者使用糖皮质激素来减少呼吸道炎症,在该实验设计中,4种细胞类型(airway smooth muscle cell lines )用1微米地塞米松处理18个小时。每一个cell lines都有对照与处理组(a treated and an untreated sample).关于实验设计的更多信息可以查看PubMed entry 24926665 ,原始数据 GEO entry GSE52778.
2)Preparing count matrices(准备制作表达矩阵)
这个矩阵必须是原始的表达量,不能是任何矫正过的表达量(RPKM等)。因为DESeq2的统计模型在处理原始表达矩阵会展示出强大实力,且会自动处理library size differences。
2.1 Aligning reads to a reference genome
[Python] 纯文本查看 复制代码 for f in 'cat files'
do
STAR --genomeDir ../STAR/ENSEMBL.homo_sapiens.release-75 --readFilesIn fastq/$f\_1.fastq fastq/$f\_2.fastq --runThreadN 12 --outFileNamePrefix aligned/$f.
done #比对
for f in 'cat files'
do
samtools view -bS aligned/$f.Aligned.out.sam -o aligned/$f.bam; done #转bam文件
done
因为按照原文献跑流程,则数据量太大,耗时间,因此在airway中只用了8个样本的一些子集,来减少运行时间及内存。
[Python] 纯文本查看 复制代码 library("airway")
dir <- system.file("extdata", package="airway", mustWork=TRUE)
list.files(dir)
csvfile <- file.path(dir,"sample_table.csv")
(sampleTable <- read.csv(csvfile,row.names=1))
filenames <- file.path(dir, paste0(sampleTable$Run, "_subset.bam"))
file.exists(filenames)
library("Rsamtools")
bamfiles <- BamFileList(filenames, yieldSize=2000000)
seqinfo(bamfiles[1])
2.2 Defining gene models( 最主要的一步是构建TxDb 对象)
Next, we need to read in the gene model that will be used for counting reads/fragments.Here,we want to make a list of exons grouped by gene for counting read/fragments.
如果是自己的数据,首先下载gtf/gff3文件,并用GenomicFeatures 包中的makeTxDbFromGFF()函数构建TxDb object。(TxDb对象是一个包含frange-based objects, such as exons, transcripts, and genes)
[Python] 纯文本查看 复制代码 library("GenomicFeatures")
gtffile <- file.path(dir,"Homo_sapiens.GRCh37.75_subset.gtf")
(txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character()))
(ebg <- exonsBy(txdb, by="gene"))[/size][/font][/color]
[font=comic sans ms, sans-serif][color=Red][size=12px]2.3 Read counting step[/size][/color][/font]
[font=comic sans ms, sans-serif][size=12px]上述准备工作做完之后,统计表达量非常简单。 利用GenomicAlignments包中的summarizeOverlaps()函数,返回一个包含多种信息的SummarizedExperiment 对象。同时由于数据量大,对于一个包含30 million aligned reads的单个文件,summarizeOverlaps()函数至少需要30 minutes。为了节省时间可以利用BiocParallel包中的register()函数进行多核运行。[/size][/font]
[font=comic sans ms, sans-serif][size=12px][mw_shl_code=python,true]register(SerialParam()) #设置多核运行
se <- summarizeOverlaps(features=ebg, reads=bamfiles,
mode="Union", # describes what kind of read overlaps will be counted.
singleEnd=FALSE,
ignore.strand=TRUE,
fragments=TRUE )
save(se,file='count_matrix.Rdata')
2.4 SummarizedExperiment
上述产生的SummarizedExperiment 对象,由3部分组成:表达矩阵(matrix of counts,)储存在assay中,如果是多个矩阵可以储存在assays当中。rowRanges,包含genomic ranges的信息,及gtf中的gene model信息。样本信息(information about the samples)。
[Python] 纯文本查看 复制代码 dim(se) ## [1] 20 8
assayNames(se) ## "counts"
head(assay(se), 3)
colSums(assay(se))
rowRanges(se) #rowRanges信息
str(metadata(rowRanges(se)))
colData(se) #
(colData(se) <- DataFrame(sampleTable))
3、The DESeqDataSet, sample information, and the design formula
有两种方法构建 DESeqDataSet 对象:
3.1 Starting from SummarizedExperiment( 直接将SummarizedExperiment转化为DESeqDataSet )
以airway自带数据为例:
[Python] 纯文本查看 复制代码 data("airway")
se <- airway
se$dex <- relevel(se$dex, "untrt") #specify that untrt is the reference level for the dex variable
se$dex
round( colSums(assay(se)) / 1e6, 1 )
colData(se)
dds <- DESeqDataSet(se, design = ~ cell + dex) #构建DESeqDataSet对象
3.2 Starting from count matrices( 如果你之前没有准备SummarizedExperiment, 只有count matrix 和 sample information ,利用这两个文件及DESeqDataSetFromMatrix()函数构建 DESeqDataSet对象 )
[Python] 纯文本查看 复制代码 countdata <- assay(se)
head(countdata, 3)
coldata <- colData(se)
(ddsMat <- DESeqDataSetFromMatrix(countData = countdata,colData = coldata,design = ~ cell + dex)
4)Exploratory analysis and visualization(数据探索分析及可视化)
4.1 Pre-filtering the dataset
[Python] 纯文本查看 复制代码 nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)
4.2 The rlog transformation
rlog()函数返回一个SummarizedExperiment对象,其中 assay slot包含 rlog-转换后的值。
[Python] 纯文本查看 复制代码 rld <- rlog(dds, blind=FALSE)
head(assay(rld), 3)
par( mfrow = c( 1, 2 ) )
dds <- estimateSizeFactors(dds)
plot(log2(counts(dds, normalized=TRUE)[,1:2] + 1),pch=16, cex=0.3)
plot(assay(rld)[,1:2],pch=16, cex=0.3)
1
4.3 Sample distances
[Python] 纯文本查看 复制代码 sampleDists <- dist( t( assay(rld) ) )
sampleDists
library("pheatmap") ##用热图展示样品间关系
library("RColorBrewer")
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep="-" )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
3
4.4 PCA plot(另一个展示样品间关系的图形)
[Python] 纯文本查看 复制代码 (data <- plotPCA(rld, intgroup = c( "dex", "cell"), returnData=TRUE)) #调用plotPCA生成要画图的data
percentVar <- round(100 * attr(data, "percentVar")) #PC1,pC2
library("ggplot2") #画图
ggplot(data, aes(PC1, PC2, color=dex, shape=cell)) + geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance"))
5
4.5 MDS plot(另一种PCA图,multidimensional scaling (MDS) function )
[Python] 纯文本查看 复制代码 ##用归一化后的数据产生的距离矩阵(calculated from the rlog transformed)
mdsData <- data.frame(cmdscale(sampleDistMatrix))
mds <- cbind(mdsData, as.data.frame(colData(rld)))
ggplot(mds, aes(X1,X2,color=dex,shape=cell)) + geom_point(size=3)
[attach]3220[/attach]
###用原始数据产生的泊松距离矩阵(calculated from the raw)
mdsPoisData <- data.frame(cmdscale(samplePoisDistMatrix))
mdsPois <- cbind(mdsPoisData, as.data.frame(colData(dds)))
ggplot(mdsPois, aes(X1,X2,color=dex,shape=cell)) + geom_point(size=3)
7
5)Differential expression analysis(差异分析)
5,1)Running the differential expression pipeline用DESeq()函数对原始表达矩阵进行差异表达分析
[Python] 纯文本查看 复制代码 dds <- DESeq(dds)
5.2)Building the results table
利用result函数(),将返回一个数据框对象,包含有 metadata information(每一列的意义)。
[Python] 纯文本查看 复制代码 (res <- results(dds)) #产生log2 fold change 及p值
mcols(res, use.names=TRUE) #查看meta信息
summary(res) #产看其他信息
可以通过FDR,foldchange进行筛选
[Python] 纯文本查看 复制代码 #如果要降低false discovery rate threshold,将要设置的值传递给results(),使用该阈值进行独立的过滤:
res.05 <- results(dds, alpha=.05) #设置阈值FDR=0.05
table(res.05$padj < .05) #产看结果
#如果提升log2 fold change threshold show more substantial changes due to treatment, we simply supply a value on the log2 scale.
resLFC1 <- results(dds, lfcThreshold=1)
table(resLFC1$padj < 0.1)
5.3)Other comparisons及Multiple testing
可以利用result函数中的contrast参数来选择任何2组数据进行比较。
[Python] 纯文本查看 复制代码 results(dds, contrast=c("cell", "N061011", "N61311"))
sum(res$pvalue < 0.05, na.rm=TRUE)
sum(!is.na(res$pvalue))
sum(res$padj < 0.1, na.rm=TRUE)
resSig <- subset(res, padj < 0.1)
head(resSig[ order(resSig$log2FoldChange), ])
head(resSig[ order(resSig$log2FoldChange, decreasing=TRUE), ])
topGene <- rownames(res)[which.min(res$padj)]
plotCounts(dds, gene=topGene, intgroup=c("dex"))
8
用ggplot2画传统的散点图
[Python] 纯文本查看 复制代码 data <- plotCounts(dds, gene=topGene, intgroup=c("dex","cell"), returnData=TRUE)
ggplot(data, aes(x=dex, y=count, color=cell)) +
scale_y_log10() +
geom_point(position=position_jitter(width=.1,height=0), size=3)
ggplot(data, aes(x=dex, y=count, fill=dex)) +
scale_y_log10() +
geom_dotplot(binaxis="y", stackdir="center")
ggplot(data, aes(x=dex, y=count, color=cell, group=cell)) +
scale_y_log10() + geom_point(size=3) + geom_line()
9
10
画 MA-plot图:
[Python] 纯文本查看 复制代码 plotMA(res, ylim=c(-5,5))
11
用修改过的log2 fold change threshold,并标记特定基因:
[Python] 纯文本查看 复制代码 plotMA(resLFC1, ylim=c(-5,5))
topGene <- rownames(resLFC1)[which.min(resLFC1$padj)]
with(resLFC1[topGene, ], {
points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})
12
5.5)Plotting results提取前20个highest variance across samples.基因进行聚类
[Python] 纯文本查看 复制代码 library("genefilter")
topVarGenes <- head(order(rowVars(assay(rld)),decreasing=TRUE),20)
mat <- assay(rld)[ topVarGenes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(rld)[,c("cell","dex")])
pheatmap(mat, annotation_col=df)
13
5.6)Annotating and exporting results
因为这里只有Ensembl gene IDs,而其它的名称也非常有用。用 mapIds()函数, 其中column 参数表示需要哪种信息, multiVals参数表示如果有多个对应值的时候如何选择.,这里选择第一个. 调用两次函数 add the gene symbol and Entrez ID.
[Python] 纯文本查看 复制代码 library("AnnotationDbi")
library("org.Hs.eg.db")
columns(org.Hs.eg.db)
res$symbol <- mapIds(org.Hs.eg.db,
keys=row.names(res),
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
res$entrez <- mapIds(org.Hs.eg.db,
keys=row.names(res),
column="ENTREZID",
keytype="ENSEMBL",
multiVals="first")
resOrdered <- res[order(res$padj),]
head(resOrdered)
5.7)Plotting fold changes in genomic space
[Python] 纯文本查看 复制代码 (resGR <- results(dds, lfcThreshold=1, format="GRanges"))
resGR$symbol <- mapIds(org.Hs.eg.db, names(resGR), "SYMBOL", "ENSEMBL") #加基因名
library("Gviz") #用于可视化for plotting the GRanges and associated metadata: the log fold changes due to dexam- ethasone treatment.
window <- resGR[topGene] + 1e6
strand(window) <- "*"
resGRsub <- resGR[resGR %over% window]
naOrDup <- is.na(resGRsub$symbol) | duplicated(resGRsub$symbol)
resGRsub$group <- ifelse(naOrDup, names(resGRsub), resGRsub$symbol)
sig <- factor(ifelse(resGRsub$padj < .1 & !is.na(resGRsub$padj),"sig","notsig"))
options(ucscChromosomeNames=FALSE)
g <- GenomeAxisTrack()
a <- AnnotationTrack(resGRsub, name="gene ranges", feature=sig)
d <- DataTrack(resGRsub, data="log2FoldChange", baseline=0,
type="h", name="log2 fold change", strand="+")
plotTracks(list(g,d,a), groupAnnotation="group", notsig="grey", sig="hotpink")
14
5.8)Removing hidden batch effects
[Python] 纯文本查看 复制代码 library("sva")
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
svseq$sv
par(mfrow=c(2,1),mar=c(3,5,3,1))
stripchart(svseq$sv[,1] ~ dds$cell,vertical=TRUE,main="SV1")
abline(h=0)
stripchart(svseq$sv[,2] ~ dds$cell,vertical=TRUE,main="SV2")
abline(h=0)
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + dex
ddssva <- DESeq(ddssva)
5.9)Time course experiments
[Python] 纯文本查看 复制代码 library("fission")
data("fission")
ddsTC <- DESeqDataSet(fission, ~ strain + minute + strain:minute)
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ strain + minute)
resTC <- results(ddsTC)
resTC$symbol <- mcols(ddsTC)$symbol
head(resTC[order(resTC$padj),],4)
data <- plotCounts(ddsTC, which.min(resTC$padj),
intgroup=c("minute","strain"), returnData=TRUE)
ggplot(data, aes(x=minute, y=count, color=strain, group=strain)) +
geom_point() + stat_smooth(se=FALSE,method="loess") + scale_y_log10()
resultsNames(ddsTC)
res30 <- results(ddsTC, name="strainmut.minute30", test="Wald")
res30[which.min(resTC$padj),]
15
对显著表达基因进行聚类
[Python] 纯文本查看 复制代码 betas <- coef(ddsTC)
colnames(betas)
library("pheatmap")
topGenes <- head(order(resTC$padj),20)
mat <- betas[topGenes, -c(1,2)]
thr <- 3
mat[mat < -thr] <- -thr
mat[mat > thr] <- thr
pheatmap(mat, breaks=seq(from=-thr, to=thr, length=101),
cluster_col=FALSE)
16
|
|