搜索
查看: 7314|回复: 0

[workflow] airway包

[复制链接]

29

主题

29

帖子

149

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
149
发表于 2018-9-19 17:34:56 | 显示全部楼层 |阅读模式
1)airway简介
在该workflow中,所用的数据集来自RNA-seq,气道平滑肌细胞(airway  smooth muscle cells )用氟美松(糖皮质激素,抗炎药)处理。例如,哮喘患者使用糖皮质激素来减少呼吸道炎症,在该实验设计中,4种细胞类型(airway smooth muscle cell lines )用1微米地塞米松处理18个小时。每一个cell lines都有对照与处理组(a treated and an untreated sample).关于实验设计的更多信息可以查看PubMed entry 24926665 ,原始数据 GEO entry GSE52778.
2)Preparing count matrices(准备制作表达矩阵)
这个矩阵必须是原始的表达量,不能是任何矫正过的表达量(RPKM等)。因为DESeq2的统计模型在处理原始表达矩阵会展示出强大实力,且会自动处理library size differences。
2.1 Aligning reads to a reference genome
[Python] 纯文本查看 复制代码
for f in 'cat files'
do
    STAR --genomeDir ../STAR/ENSEMBL.homo_sapiens.release-75 --readFilesIn fastq/$f\_1.fastq fastq/$f\_2.fastq --runThreadN 12 --outFileNamePrefix aligned/$f.
done   #比对
 
for f in 'cat files'
do
    samtools view -bS aligned/$f.Aligned.out.sam -o aligned/$f.bam; done  #转bam文件
done

因为按照原文献跑流程,则数据量太大,耗时间,因此在airway中只用了8个样本的一些子集,来减少运行时间及内存。
[Python] 纯文本查看 复制代码
library("airway")
dir <- system.file("extdata", package="airway", mustWork=TRUE)  
list.files(dir)
csvfile <- file.path(dir,"sample_table.csv") 
(sampleTable <- read.csv(csvfile,row.names=1))      
filenames <- file.path(dir, paste0(sampleTable$Run, "_subset.bam")) 
file.exists(filenames) 
library("Rsamtools")  
bamfiles <- BamFileList(filenames, yieldSize=2000000) 
seqinfo(bamfiles[1])                                  

2.2 Defining gene models( 最主要的一步是构建TxDb 对象)
Next, we need to read in the gene model that will be used for counting reads/fragments.Here,we want to make a list of exons grouped  by gene for counting read/fragments.

如果是自己的数据,首先下载gtf/gff3文件,并用GenomicFeatures 包中的makeTxDbFromGFF()函数构建TxDb object。(TxDb对象是一个包含frange-based objects, such as exons, transcripts, and genes)
[Python] 纯文本查看 复制代码
library("GenomicFeatures")
gtffile <- file.path(dir,"Homo_sapiens.GRCh37.75_subset.gtf") 
(txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character())) 
(ebg <- exonsBy(txdb, by="gene"))[/size][/font][/color]
[font=comic sans ms, sans-serif][color=Red][size=12px]2.3 Read counting step[/size][/color][/font]
[font=comic sans ms, sans-serif][size=12px]上述准备工作做完之后,统计表达量非常简单。 利用GenomicAlignments包中的summarizeOverlaps()函数,返回一个包含多种信息的SummarizedExperiment 对象。同时由于数据量大,对于一个包含30 million aligned reads的单个文件,summarizeOverlaps()函数至少需要30 minutes。为了节省时间可以利用BiocParallel包中的register()函数进行多核运行。[/size][/font]
[font=comic sans ms, sans-serif][size=12px][mw_shl_code=python,true]register(SerialParam()) #设置多核运行
se <- summarizeOverlaps(features=ebg, reads=bamfiles, 
                        mode="Union",            # describes what kind of read overlaps will be counted.
                        singleEnd=FALSE,
                        ignore.strand=TRUE,
                        fragments=TRUE ) 
save(se,file='count_matrix.Rdata')

2.4 SummarizedExperiment
上述产生的SummarizedExperiment 对象,由3部分组成:表达矩阵(matrix of counts,)储存在assay中,如果是多个矩阵可以储存在assays当中。rowRanges,包含genomic ranges的信息,及gtf中的gene model信息。样本信息(information about the samples)。
[Python] 纯文本查看 复制代码
dim(se)               ## [1] 20 8
assayNames(se)        ## "counts"
head(assay(se), 3) 
colSums(assay(se))   
rowRanges(se)         #rowRanges信息
str(metadata(rowRanges(se))) 
colData(se)         #
(colData(se) <- DataFrame(sampleTable)) 

3、The DESeqDataSet, sample information, and the design formula
有两种方法构建 DESeqDataSet 对象:
3.1  Starting from SummarizedExperiment( 直接将SummarizedExperiment转化为DESeqDataSet )
以airway自带数据为例:
[Python] 纯文本查看 复制代码
data("airway")
se <- airway
se$dex <- relevel(se$dex, "untrt")  #specify that untrt is the reference level for the dex variable
se$dex
round( colSums(assay(se)) / 1e6, 1 ) 
colData(se) 
dds <- DESeqDataSet(se, design = ~ cell + dex)  #构建DESeqDataSet对象

3.2  Starting from  count matrices(  如果你之前没有准备SummarizedExperiment, 只有count matrix 和 sample information ,利用这两个文件及DESeqDataSetFromMatrix()函数构建 DESeqDataSet对象 )
[Python] 纯文本查看 复制代码
countdata <- assay(se) 
head(countdata, 3)
coldata <- colData(se) 
(ddsMat <- DESeqDataSetFromMatrix(countData = countdata,colData = coldata,design = ~ cell + dex) 

4)Exploratory analysis and visualization(数据探索分析及可视化)
4.1 Pre-filtering the dataset
[Python] 纯文本查看 复制代码
nrow(dds) 
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds) 

4.2 The rlog transformation
rlog()函数返回一个SummarizedExperiment对象,其中 assay slot包含 rlog-转换后的值。
[Python] 纯文本查看 复制代码
rld <- rlog(dds, blind=FALSE)
head(assay(rld), 3)
par( mfrow = c( 1, 2 ) )
dds <- estimateSizeFactors(dds) 
plot(log2(counts(dds, normalized=TRUE)[,1:2] + 1),pch=16, cex=0.3) 
plot(assay(rld)[,1:2],pch=16, cex=0.3) 

1

1

4.3 Sample distances
[Python] 纯文本查看 复制代码
sampleDists <- dist( t( assay(rld) ) ) 
sampleDists
library("pheatmap")      ##用热图展示样品间关系
library("RColorBrewer")
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep="-" )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
        col=colors)

3

3

4.4 PCA plot(另一个展示样品间关系的图形)
[Python] 纯文本查看 复制代码
(data <- plotPCA(rld, intgroup = c( "dex", "cell"), returnData=TRUE))  #调用plotPCA生成要画图的data
percentVar <- round(100 * attr(data, "percentVar"))      #PC1,pC2
library("ggplot2")                                    #画图
ggplot(data, aes(PC1, PC2, color=dex, shape=cell)) + geom_point(size=3) +
     xlab(paste0("PC1: ",percentVar[1],"% variance")) +
     ylab(paste0("PC2: ",percentVar[2],"% variance"))     

5

5

4.5 MDS plot(另一种PCA图,multidimensional scaling (MDS) function )
[Python] 纯文本查看 复制代码
##用归一化后的数据产生的距离矩阵(calculated from the rlog transformed)
mdsData <- data.frame(cmdscale(sampleDistMatrix))
mds <- cbind(mdsData, as.data.frame(colData(rld)))
ggplot(mds, aes(X1,X2,color=dex,shape=cell)) + geom_point(size=3)
 [attach]3220[/attach]
###用原始数据产生的泊松距离矩阵(calculated from the raw)
mdsPoisData <- data.frame(cmdscale(samplePoisDistMatrix))
mdsPois <- cbind(mdsPoisData, as.data.frame(colData(dds)))
ggplot(mdsPois, aes(X1,X2,color=dex,shape=cell)) + geom_point(size=3)

7

7

5)Differential expression analysis(差异分析)
5,1)Running the differential expression pipeline用DESeq()函数对原始表达矩阵进行差异表达分析
[Python] 纯文本查看 复制代码
dds <- DESeq(dds)  

5.2)Building the results table
利用result函数(),将返回一个数据框对象,包含有 metadata information(每一列的意义)。
[Python] 纯文本查看 复制代码
(res <- results(dds)) #产生log2 fold change 及p值
mcols(res, use.names=TRUE)  #查看meta信息
summary(res)  #产看其他信息

可以通过FDR,foldchange进行筛选
[Python] 纯文本查看 复制代码
#如果要降低false discovery rate threshold,将要设置的值传递给results(),使用该阈值进行独立的过滤:
res.05 <- results(dds, alpha=.05)  #设置阈值FDR=0.05
table(res.05$padj < .05)          #产看结果 
#如果提升log2 fold change threshold show more substantial changes due to  treatment, we simply supply a value on the log2 scale.
resLFC1 <- results(dds, lfcThreshold=1)
table(resLFC1$padj < 0.1)

5.3)Other comparisons及Multiple testing
可以利用result函数中的contrast参数来选择任何2组数据进行比较。
[Python] 纯文本查看 复制代码
results(dds, contrast=c("cell", "N061011", "N61311"))
sum(res$pvalue < 0.05, na.rm=TRUE)
sum(!is.na(res$pvalue))
sum(res$padj < 0.1, na.rm=TRUE)
resSig <- subset(res, padj < 0.1)
head(resSig[ order(resSig$log2FoldChange), ])
head(resSig[ order(resSig$log2FoldChange, decreasing=TRUE), ])
topGene <- rownames(res)[which.min(res$padj)]
plotCounts(dds, gene=topGene, intgroup=c("dex"))

8

8

用ggplot2画传统的散点图
[Python] 纯文本查看 复制代码
data <- plotCounts(dds, gene=topGene, intgroup=c("dex","cell"), returnData=TRUE)
ggplot(data, aes(x=dex, y=count, color=cell)) +
 scale_y_log10() +
 geom_point(position=position_jitter(width=.1,height=0), size=3)
ggplot(data, aes(x=dex, y=count, fill=dex)) +
 scale_y_log10() +
 geom_dotplot(binaxis="y", stackdir="center")
ggplot(data, aes(x=dex, y=count, color=cell, group=cell)) +
 scale_y_log10() + geom_point(size=3) + geom_line()

9

9

10

10

画 MA-plot图:
[Python] 纯文本查看 复制代码
plotMA(res, ylim=c(-5,5))

11

11

修改过的log2 fold change threshold,并标记特定基因:
[Python] 纯文本查看 复制代码
plotMA(resLFC1, ylim=c(-5,5))
topGene <- rownames(resLFC1)[which.min(resLFC1$padj)]
with(resLFC1[topGene, ], {
 points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
 text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})

12

12

5.5)Plotting results提取前20个highest variance across samples.基因进行聚类
[Python] 纯文本查看 复制代码
library("genefilter")
topVarGenes <- head(order(rowVars(assay(rld)),decreasing=TRUE),20)
mat <- assay(rld)[ topVarGenes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(rld)[,c("cell","dex")])
pheatmap(mat, annotation_col=df)

13

13

5.6)Annotating and exporting results
因为这里只有Ensembl gene IDs,而其它的名称也非常有用。用 mapIds()函数, 其中column 参数表示需要哪种信息, multiVals参数表示如果有多个对应值的时候如何选择.,这里选择第一个. 调用两次函数 add the gene symbol  and Entrez ID.
[Python] 纯文本查看 复制代码
library("AnnotationDbi")
library("org.Hs.eg.db")
columns(org.Hs.eg.db)
res$symbol <- mapIds(org.Hs.eg.db,
          keys=row.names(res),
          column="SYMBOL", 
          keytype="ENSEMBL",
          multiVals="first") 
res$entrez <- mapIds(org.Hs.eg.db,
        keys=row.names(res),
        column="ENTREZID",
        keytype="ENSEMBL",
         multiVals="first")
resOrdered <- res[order(res$padj),]
head(resOrdered)

5.7)Plotting fold changes in genomic space
[Python] 纯文本查看 复制代码
(resGR <- results(dds, lfcThreshold=1, format="GRanges"))
resGR$symbol <- mapIds(org.Hs.eg.db, names(resGR), "SYMBOL", "ENSEMBL")  #加基因名
library("Gviz")  #用于可视化for plotting the GRanges and associated metadata: the log fold changes due to dexam- ethasone treatment.
window <- resGR[topGene] + 1e6
strand(window) <- "*"
resGRsub <- resGR[resGR %over% window]
naOrDup <- is.na(resGRsub$symbol) | duplicated(resGRsub$symbol)
resGRsub$group <- ifelse(naOrDup, names(resGRsub), resGRsub$symbol)
sig <- factor(ifelse(resGRsub$padj < .1 & !is.na(resGRsub$padj),"sig","notsig"))
options(ucscChromosomeNames=FALSE)
g <- GenomeAxisTrack()
a <- AnnotationTrack(resGRsub, name="gene ranges", feature=sig)
d <- DataTrack(resGRsub, data="log2FoldChange", baseline=0,
 type="h", name="log2 fold change", strand="+")
plotTracks(list(g,d,a), groupAnnotation="group", notsig="grey", sig="hotpink")

14

14

5.8)Removing hidden batch effects
[Python] 纯文本查看 复制代码
library("sva")
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
svseq$sv
par(mfrow=c(2,1),mar=c(3,5,3,1))
stripchart(svseq$sv[,1] ~ dds$cell,vertical=TRUE,main="SV1")
abline(h=0)
stripchart(svseq$sv[,2] ~ dds$cell,vertical=TRUE,main="SV2")
abline(h=0)
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + dex
ddssva <- DESeq(ddssva)

5.9)Time course experiments
[Python] 纯文本查看 复制代码
library("fission")
data("fission")
ddsTC <- DESeqDataSet(fission, ~ strain + minute + strain:minute)
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ strain + minute)
resTC <- results(ddsTC)
resTC$symbol <- mcols(ddsTC)$symbol
head(resTC[order(resTC$padj),],4)
data <- plotCounts(ddsTC, which.min(resTC$padj),
intgroup=c("minute","strain"), returnData=TRUE)
ggplot(data, aes(x=minute, y=count, color=strain, group=strain)) +
     geom_point() + stat_smooth(se=FALSE,method="loess") + scale_y_log10()
resultsNames(ddsTC)
res30 <- results(ddsTC, name="strainmut.minute30", test="Wald")
res30[which.min(resTC$padj),]

15

15

对显著表达基因进行聚类
[Python] 纯文本查看 复制代码
betas <- coef(ddsTC)
colnames(betas)
library("pheatmap")
topGenes <- head(order(resTC$padj),20)
mat <- betas[topGenes, -c(1,2)]
thr <- 3
mat[mat < -thr] <- -thr
mat[mat > thr] <- thr
pheatmap(mat, breaks=seq(from=-thr, to=thr, length=101),
        cluster_col=FALSE)

16

16









回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2023-3-20 17:31 , Processed in 0.109071 second(s), 35 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.