搜索
查看: 66|回复: 0

[basic] edgeR包

[复制链接]

25

主题

25

帖子

127

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
127
发表于 2018-9-14 17:26:03 | 显示全部楼层 |阅读模式
1)简介edgeR作用对象是count文件,rows 代表基因,行代表文库,count代表的是比对到每个基因的reads数目。它主要关注的是差异表达分析,而不是定量基因表达水平。
edgeR works on a table of integer read counts, with rows corresponding to genes and columns to independent libraries. The counts represent the total number of reads aligning to each gene (or other genomic locus).edgeR is concerned with differential expression analysis rather than with the quantification of expression levels. It is concerned with relative changes in expression levels between conditions,but not directly with estimating absolute expression levels.Note that edgeR is designed to work with actual read counts. We not recommend that predicted transcript abundances are input the edgeR in place of actual counts.
归一化原因:
技术原因影响差异表达分析:
1)Sequencing depth:统计测序深度(即代表的是library size);
2)RNA composition:个别异常高表达基因导致其它基因采样不足
3)GC content: sample-specific effects for GC-content can be detected
4)sample-specific effects for gene length have been detected
注意:edgeR必须是原始表达量,而不能是rpkm等矫正过的。
2)安装
[Python] 纯文本查看 复制代码
if("edgeR" %in% rownames(installed.packages()) == FALSE) {source("http://bioconductor.org/biocLite.R");biocLite("edgeR")}
suppressMessages(library(edgeR))
ls('package:edgeR')

3)矩阵构建及差异分析
这里所用的数据是文献Tumor Transcriptome Sequencing Reveals Allelic Expression Imbalances Associated with Copy Number Alterations补充材料中的table1数据
需要构建2个矩阵:1、表达矩阵;2、分组矩阵( 实验设计);
-------------------------------------------------------表达矩阵-----------------------------------------
3.1、读取表达矩阵文件(Reading in the data)
[Python] 纯文本查看 复制代码
#读取文件
rawdata <- read.delim("E:/software/R/R-3.5.0/library/edgeR/Meta/TableS1.txt", check.names=FALSE, stringsAsFactors=FALSE)
head(rawdata)

3.2 、构建DGEList对象
这里因为已经有rawdata的count文件,因此直接用DGEList()函数就行了,否则要用readDGE()函数
[Python] 纯文本查看 复制代码
y <- DGEList(counts=rawdata[,4:9], genes=rawdata[,1:3])##构建DGEList对象

3.3)数据注释( Annotation)
这里主要是因为该文章数据是前好多年的,因此需要过滤,symbol更新等。
1)The study  was undertaken a few years ago, so not all of the RefSeq IDs provided by match RefSeq IDs currently in use. We retain only those transcripts with IDs in the current NCBI annotation, which is provided by the org.HS.eg.db package
2)因为edgeR默认使用NCBI中refSeq的ID,所以通过refseq Id 找到entrezID,然后通过entrezID对symbol更新
[Python] 纯文本查看 复制代码
#######retain only those transcripts with IDs in the current NCBI annotation provided by the org.HS.eg.db######
library(org.Hs.eg.db)
idfound <- y$genes$RefSeqID %in% mappedRkeys(org.Hs.egREFSEQ)
y <- y[idfound,]
dim(y)  ##15550 6
###################### 在注释中加入  Entrez Gene IDs #########################
egREFSEQ <- toTable(org.Hs.egREFSEQ) 
m <- match(y$genes$RefSeqID, egREFSEQ$accession)
y$genes$EntrezGene <- egREFSEQ$gene_id[m]
#####################用Entrez Gene IDs更新gene symbols##########################
egSYMBOL <- toTable(org.Hs.egSYMBOL)
m <- match(y$genes$EntrezGene, egSYMBOL$gene_id)
y$genes$Symbol <- egSYMBOL$symbol[m]
head(y$genes)

3.4) 过滤和归一化(Filtering and normalization)
过滤一:Different RefSeq transcripts for the same gene symbol count predominantly the same reads. So we keep one transcript for each gene symbol. We choose the transcript with highest overall count(选择最高表达量):
[Python] 纯文本查看 复制代码
o <- order(rowSums(y$counts), decreasing=TRUE)
y <- y[o,]
d <- duplicated(y$genes$Symbol)
y <- y[!d,]
nrow(y)

过滤二:Normally we would also filter lowly expressed genes.For this data, all transcripts already have at least 50 reads for all samples of at least one of the tissues types.
[Python] 纯文本查看 复制代码
y$samples$lib.size <- colSums(y$counts)  #Recompute the library sizes
###############################Use Entrez Gene IDs as row names:#####################
rownames(y$counts) <- rownames(y$genes) <- y$genes$EntrezGene
y$genes$EntrezGene <- NULL

归一化:TMM normalization is applied to this dataset to account for compositional difference between the libraries.
[Python] 纯文本查看 复制代码
y <- calcNormFactors(y)
y$samples

3.5) 数据的探索(Data exploration)
样本间关系(samples for outliers and for other relationships)
[Python] 纯文本查看 复制代码
plotMDS(y)


-----------------------------------分组矩阵(根据实验设计、目的)--------------------------------------
Here we want to test for differential expression between tumour and normal tissues within patients, i.e. adjusting for differences between patients.
[Python] 纯文本查看 复制代码
Patient <- factor(c(8,8,33,33,51,51))
Tissue <- factor(c("N","T","N","T","N","T"))
data.frame(Sample=colnames(y),Patient,Tissue)
design <- model.matrix(~Patient+Tissue)
rownames(design) <- colnames(y)
design

3.4)估计离散(estimate the NB dispersion for the dataset.)
[Python] 纯文本查看 复制代码
y <- estimateDisp(y, design, robust=TRUE)
y$common.dispersion    #0.1594505
plotBCV(y)



---------------------------------------差异分析-------------------------------------------------------
3.5) 差异分析(Differential expression)
[Python] 纯文本查看 复制代码
fit <- glmFit(y, design)
lrt <- glmLRT(fit)
topTags(lrt)
summary(decideTests(lrt))
plotMD(lrt)
abline(h=c(-1, 1), col="blue")


--------------------------------------- Gene ontology analysis--------------------------------

[Python] 纯文本查看 复制代码
go <- goana(lrt)
topGO(go, ont="BP", sort="Up", n=30)


本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树    

GMT+8, 2018-9-22 04:32 , Processed in 0.120473 second(s), 25 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.