搜索
查看: 7263|回复: 7

生信编程直播第七题:写超几何分布检验!

[复制链接]

634

主题

1182

帖子

4030

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
4030
发表于 2017-2-1 09:15:36 | 显示全部楼层 |阅读模式
这个是主流的GO/KEGG的富集分析的原理,作为一个合格的生信工程师,不能总是用一堆网页工具或者别人的软件和包,这个非常简单的统计学原理的实现完全可以用自己的代码!
前提是你已经掌握了第六题,如何下载KEGG数据库信息,并且格式化,或者你自己明白数据库是什么!

为了检验自己写的超几何分布检验对不对,请大家统一用 tmp=toTable(org.Hs.egPATH) 这个kegg数据库,不要用最新的!!!然后差异基因list和背景基因list也统一,用下面的R代码即可获取:
[AppleScript] 纯文本查看 复制代码
library(org.Hs.eg.db)
library(KEGG.db)
tmp=toTable(org.Hs.egPATH)
write.table(tmp,'kegg2gene.list.txt',quote = F,row.names = F)


library("hgu95av2.db")
ls('package:hgu95av2.db')
universeGeneIds <- unique(mappedRkeys(hgu95av2ENTREZID))
write.table(universeGeneIds,'universeGeneIds.txt',quote = F,row.names = F,col.names = F)


set.seed('123456.789')
diff_gene = sample(universeGeneIds,300)
write.table(diff_gene,'diff_gene.txt',quote = F,row.names = F,col.names = F)

library(GOstats)
annotationPKG='org.Hs.eg.db'
hyperG.params = new("KEGGHyperGParams", geneIds=diff_gene, universeGeneIds=universeGeneIds, annotation=annotationPKG, 
                    categoryName="KEGG", pvalueCutoff=1, testDirection = "over")
KEGG.hyperG.results = hyperGTest(hyperG.params); 
htmlReport(KEGG.hyperG.results, file="kegg.enrichment.html", summary.args=list("htmlLinks"=TRUE))      
library(clusterProfiler)
kk <- enrichKEGG(gene         = diff_gene,
                 universe=universeGeneIds,
                 organism     = 'hsa',
                 pvalueCutoff = 1,qvalueCutoff=1)
y_results=summary(kk)



  




把自己用python或者perl或者R写好的超几何分布检验结果跟kegg.enrichment.html比较即可!

计算P值的代码也可以自己写:C(k,M)*C(n-k,N-M)/C(n,N)

[AppleScript] 纯文本查看 复制代码
[/align][align=left][align=left][align=left]library("hgu95av2.db")[/align][align=left]ls('package:hgu95av2.db')[/align][align=left]universeGeneIds <- unique(mappedRkeys(hgu95av2ENTREZID))[/align][align=left]set.seed('123456.789')[/align][align=left]diff_gene = sample(universeGeneIds,300)[/align][/align]
library(org.Hs.eg.db)
library(KEGG.db)
tmp=toTable(org.Hs.egPATH)
GeneID2kegg<<- tapply(tmp[,2],as.factor(tmp[,1]),function(x) x)
kegg2GeneID<<- tapply(tmp[,1],as.factor(tmp[,2]),function(x) x)
#results <- hyperGtest_jimmy(GeneID2kegg,kegg2GeneID,diff_gene,universeGeneIds)

GeneID2Path=GeneID2kegg
Path2GeneID=kegg2GeneID

diff_gene_has_path=intersect(diff_gene,names(GeneID2Path))
n=length(diff_gene) #306
N=length(universeGeneIds) #5870
results=c()

for (i in names(Path2GeneID)){
  #i="04672"
  M=length(intersect(Path2GeneID[[i]],universeGeneIds))
  #print(M)
 
  if(M<5)
    next
  exp_count=n*M/N
  #print(paste(n,N,M,sep="\t"))
  k=0
  for (j in diff_gene_has_path){
    if (i %in% GeneID2Path[[j]]) k=k+1
  }
  OddsRatio=k/exp_count
  if (k==0) next
  p=phyper(k-1,M, N-M, n, lower.tail=F)
  #print(paste(i,p,OddsRatio,exp_count,k,M,sep="    "))
  results=rbind(results,c(i,p,OddsRatio,exp_count,k,M))
}
colnames(results)=c("PathwayID","Pvalue","OddsRatio","ExpCount","Count","Size")
results=as.data.frame(results,stringsAsFactors = F)
results$p.adjust = p.adjust(results$Pvalue,method = 'BH')
results=results[order(results$Pvalue),]
rownames(results)=1:nrow(results)

我的perl代码是:
呀,找不到了,你们自己写吧!







上一篇:2015年的320个非洲人的全基因组数据
下一篇:Linux shell tricks for bioinformatics 系列文章之一
你这个问题很复杂,需要打赏,请点击 http://www.bio-info-trainee.com/donate 进行打赏,谢谢
回复

使用道具 举报

2

主题

21

帖子

241

积分

中级会员

Rank: 3Rank: 3

积分
241
发表于 2017-3-18 22:33:39 | 显示全部楼层
本帖最后由 AdaWon 于 2017-3-19 10:10 编辑

147-python
解题思路:
1.首先从kegg网站下载物种的信息,所谓raw data;
2.再经过解析整理,得到如下数据结构如下:<具体见题6>

Class1Class2pathway(species_pathwayID_pathwayNAME)geneSymbolgeneID
Metabolism               Overview sly01200:Carbon metabolism gene1;gene2;gene3;... geneID1;geneID2;geneID3,...
...  ...  ...  ...  ...



3. 厘清差异基因与KEGG内基因之间的关系
[Python] 纯文本查看 复制代码
import os
import re
from collections import OrderedDict

### scipy,numpy,pandas,matplotlib ...
from scipy.stats import hypergeom
import numpy as np
import pandas as pd

### rpy2 
from rpy2.robjects.packages import importr
from rpy2.robjects.vectors import FloatVector
from rpy2.robjects import pandas2ri
stats = importr('stats')

os.chdir("/pathtofile/KEGG/")
#构建kegg字典,其中key >>> pathway name;value >>> 对应pathway的所有的基因
#再构建一个含有所有基因的列表,即Population
kegg = OrderedDict()
pop = []
with open('sly00001.cleaned.keg') as f:
    for line in f:
        line = line.rstrip()
        lst = line.split('\t')
        if len(lst) > 3:
            geneList = lst[-2].split(';') #eg: gene1;gene2;gene3;...
            kegg[lst[2]] = geneList #lst[2]  hsa01200:Carbon metabolism 
            pop = pop + [gene for gene in geneList if gene not in pop]

#然后将差异基因也存入列表中,即Sample            
DEGs = []
with open('deg.file.txt') as f:  #自己随意制作的一份表
    for line in f:
        line = line.rstrip()
        lst = line.split('\t')
        if lst[0] == '':
            continue

        elif lst[0] in pop:   #lst[0]  pathway eg:Metabolism 
            DEGs.append(lst[0])

popTotal = len(pop)
listTotal = len(DEGs)

print('Number of genes in the background list: %d' %popTotal)
print('Number of DE genes involved in any pathways: %d' %listTotal)



运行结果:Number of genes in the background list: 5183
Number of DE genes involved in any pathways: 538

再进行超几何分布检验;
#超几何分布检验是统计学上一种离散概率分布。它描述了由有限个物件中抽出n个物件,成功抽出指定种类的物件的个数(不归还)。
[Python] 纯文本查看 复制代码
from scipy.stats import hypergeom

keggEnrich = OrderedDict()
for ke,val in kegg.items():
    hits = [gene for gene in val if gene in DEGs] #差异基因与pathway中包含的基因的交集
    hitCount = len(hits)
    popHits = len(val)

    if hitCount == 0:  #无任何差异基因参与的pathway中,则打印出来
        print(ke)
    else:
        #rv = hypergeom.sf(k,M,n,N)
        pVal = hypergeom.sf(hitCount-1,popTotal,popHits,listTotal) ### p(X >= hitCounts)
        keggEnrich[ke] = [hitCount,listTotal,popHits,popTotal,pVal,';'.join(hits)]  #将信息存入字典


for ke,val in keggEnrich.items():
    print(ke)
    print(val)


#对输出文件进行整理,以便查阅,该操作与R十分类似
keggOutput = pd.DataFrame.from_dict(keggEnrich,orient='columns',dtype=None) #将上述字典结构转变成数据框结构
keggOutput = pd.DataFrame.transpose(keggOutput)  #转置
keggOutput.columns = ['Count','List Total','pop Hits','popTotal','pVal','Genes']  
keggOutput = keggOutput.sort_values(by='pVal',axis=0)   #排序


###FDR 校正P值
from rpy2.robjects.packages import importr
from rpy2.robjects.vectors import FloatVector
from rpy2.robjects import pandas2ri
stats = importr('stats')

pVal = keggOutput['pVal']
fdr = stats.p_adjust(FloatVector(pVal),method='fdr')  ##校正方法有多种
fdrPD = pandas2ri.ri2py(fdr)

keggOutput.insert(5,'FDR',fdrPD)

##存为excel 表形式
keggOutput.to_excel('kegg_pathway_enrichment.xlsx',sheet_name='kegg') 




本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
回复 支持 1 反对 0

使用道具 举报

13

主题

33

帖子

243

积分

版主

Rank: 7Rank: 7Rank: 7

积分
243
发表于 2017-3-4 17:52:20 | 显示全部楼层

发现是一楼,心有点虚!!!
运行了下群主的代码,发现代码是最原始的根据原理而来的。。代码看懂了就等原理了

代码如下:
[AppleScript] 纯文本查看 复制代码
library("hgu95av2.db")
ls('package:hgu95av2.db')
universeGeneIds <- unique(mappedRkeys(hgu95av2ENTREZID))#
set.seed('123456.789')
diff_gene = sample(universeGeneIds,300)

library(org.Hs.eg.db)
library(KEGG.db)
tmp=toTable(org.Hs.egPATH)
GeneID2kegg<<- tapply(tmp[,2],as.factor(tmp[,1]),function(x) x)  #pathid,geneid
head(GeneID2kegg)

kegg2GeneID<<- tapply(tmp[,1],as.factor(tmp[,2]),function(x) x)
??tapply
#results <- hyperGtest_jimmy(GeneID2kegg,kegg2GeneID,diff_gene,universeGeneIds)

GeneID2Path=GeneID2kegg
Path2GeneID=kegg2GeneID

diff_gene_has_path=intersect(diff_gene,names(GeneID2Path))
??intersect  ###取交集。。。
length(diff_gene_has_path) ##130
length(diff_gene) ##300
length(names(GeneID2Path))##5869


n=length(diff_gene) #306
N=length(universeGeneIds) #5870
results=c()

for (i in names(Path2GeneID)){
  #i="04672"
  M=length(intersect(Path2GeneID[[i]],universeGeneIds))
  #print(M)
  
  if(M<5)
    next
  exp_count=n*M/N
  #print(paste(n,N,M,sep="\t"))
  k=0
  for (j in diff_gene_has_path){
    if (i %in% GeneID2Path[[j]]) k=k+1
  }
  OddsRatio=k/exp_count
  if (k==0) next
  p=phyper(k-1,M, N-M, n, lower.tail=F)
  #print(paste(i,p,OddsRatio,exp_count,k,M,sep="    "))
  results=rbind(results,c(i,p,OddsRatio,exp_count,k,M))
}

colnames(results)=c("PathwayID","Pvalue","OddsRatio","ExpCount","Count","Size")
results=as.data.frame(results,stringsAsFactors = F)
results$p.adjust = p.adjust(results$Pvalue,method = 'BH')  ##怎么计算的?
results=results[order(results$Pvalue),]
rownames(results)=1:nrow(results)


还贴一个之前的代码吧:
[AppleScript] 纯文本查看 复制代码
rm(list=ls())
setwd("C:/Users/lixia/Desktop/XYL/single cell/single cell/ding/")
DEG=read.table("single cell DGE P C.csv",header = T, sep=",",stringsAsFactors = F,row.names = 1)
probeset=rownames(DEG[abs(DEG[,2])>2 & DEG[,5]<0.01,])
length(probeset)
##加载ID转换的包

library(annotate) # lookUp函数是属于annotate这个包的

## **转换时选用何种数据库需要查下!!!**###转换ID
library(biomaRt)  ###useMart函数
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
tmp<-getBM(attributes=c("entrezgene","hgnc_symbol","ensembl_gene_id"), filters = "hgnc_symbol", 
            values =probeset, mart=ensembl)
###不太明白以下缺少了这么多的genne。。。
###获取c中attributes的信息,用filters进行过滤,values是输入的值,mart是指定数据库。。。
head(tmp)
write.table(tmp, file="DGE C VS P.xls", sep="\t",quote=F)
###entrezgene 为 NCBIgene ID,gene的ID号,一般是entrez ID ,ensemble_gene_idw为enemble数据库ID;
##################
EGID <-tmp$entrezgene
length(unique(EGID))
diff_gene_list <- unique(EGID)

### Go analysis BP CC and MF 分析
library(GOstats)
#####BP
paramsBP<- new("GOHyperGParams", geneIds=diff_gene_list, universeGeneIds=NULL, annotation="org.Hs.eg.db",
             
            , pvalueCutoff=0.01, testDirection = "over")
GO.BP = hyperGTest(paramsBP)
BP<-summary(GO.BP)
htmlReport(GO.BP,file="P C_go BP pvaluecutoff 0.01.html")
head(BP)

####CC
paramsCC<- new("GOHyperGParams", geneIds=diff_gene_list, universeGeneIds=NULL, annotation="org.Hs.eg.db",
              , pvalueCutoff=0.01, testDirection = "over")
GO.cc = hyperGTest(paramsCC)
CC<-summary(GO.cc)
htmlReport(GO.cc,file="P C_go CC pvaluecutoff 0.01.html")
head(CC)
###MF
paramsMF<- new("GOHyperGParams", geneIds=diff_gene_list, universeGeneIds=NULL, annotation="org.Hs.eg.db",
               
              , pvalueCutoff=0.01, testDirection = "over")
GO.MF = hyperGTest(paramsMF)
MF<-summary(GO.MF)
htmlReport(GO.MF,file="P C_go MF pvaluecutoff 0.01.html")
head(MF)

#then do kegg pathway enrichment !
library(org.Hs.eg.db)
hyperG.params = new("KEGGHyperGParams", geneIds=diff_gene_list, universeGeneIds=NULL, annotation="org.Hs.eg.db",
                    categoryName="KEGG", pvalueCutoff=1, testDirection = "over")
KEGG.hyperG.results = hyperGTest(hyperG.params);
htmlReport(KEGG.hyperG.results, file="kegg.enrichment.html", summary.args=list("htmlLinks"=TRUE))


回复 支持 反对

使用道具 举报

29

主题

131

帖子

1208

积分

金牌会员

Rank: 6Rank: 6

积分
1208
发表于 2017-3-4 20:44:32 | 显示全部楼层
本帖最后由 anlan 于 2017-3-4 20:48 编辑

034-perl-shell

perl的视频挺赞的,我虽然用perl,但都没想过去写个富集分析脚本,因为我习惯性用perl去调用R了。。

我觉得用这个公式f(k,N,M,n) = C(k,M)*C(n-k,N-M)/C(n,N)算出来的结果跟R的结果不一样是因为:
这个公式只是算出某一点的概率值,而富集应该还包括大于这个点所有的概率值之和,因此这个公式导致结果比R算出来的小,不知道我理解的对不对。。。。。

结果用html格式展示数据好新奇,是用软件转化的吗?
回复 支持 反对

使用道具 举报

10

主题

52

帖子

559

积分

版主

Rank: 7Rank: 7Rank: 7

积分
559
QQ
发表于 2017-3-6 23:01:34 | 显示全部楼层
本帖最后由 旭日早升 于 2017-3-7 17:13 编辑

201-python-无名

问题解析:问题说明:用超几何分布来做KEGG的enrichment。文件格式:使用R导出了diff_gene, universe_gene和kegg2gene。问题解析:需要了解超几何分布的知识http://stattrek.com/probability-distributions/hypergeometric.aspx。数据结构:diff_gene[g1,g2,...], universe_gene[g1,g2,...], kegg2gene{p1[g1,g2,...],p2[g1,g2,...],...}。提高效率:暂时未知。
目前我对于超几何分布的理解还是很有限的,虽然我们经常把这个问题理解为黑白球问题,但是对于超几何分布做富集分析的p值还是不太理解。另外想利用楼主给的公式求p值,但是在使用python的itertools计算排列组合时一直算不出来,只能求助视频了。
我的尝试代码:
[Python] 纯文本查看 复制代码
###import module
import re, os
from collections import OrderedDict
#from itertools import combinations
###change directory if necessary
os.chdir("C:/Users/wangkangli/Documents")
###initialization variable
pathway = OrderedDict()
diff_gene = []
universe_gene = []
###read diff_gene as a list
with open("diff_gene.txt",'rt') as f:
        for line in f:
                line = line.rstrip()
                diff_gene.append(line)
###read universe gene as a list
with open("universeGeneIds.txt",'rt') as f:
        for line in f:
                line = line.rstrip()
                universe_gene.append(line)
###read kegg-pathway as a dictinary
with open("kegg2gene.list.txt",'rt') as f:
        for line in f:
                if line.startswith("g"):
                        continue
                line = line.rstrip()
                lst = line.split()
                if lst[1] not in pathway:
                        pathway[lst[1]] = [lst[0]]
                else:
                        pathway[lst[1]].append(lst[0])
n = len(diff_gene)
N = len(universe_gene)
for ptwy, genes in pathway.items():
        k = len(list(set(diff_gene).intersection(set(genes))))
        M = len(list(set(universe_gene).intersection(set(genes))))
        exp_count = n * M / N
        OddsRatio = k / exp_count
        # p = len(list(combinations(range(M),k))) * len(list(combinations(range(N-M),n-k))) / len(list(combinations(range(N),n)))
        print("\t".join([str(ptwy),str(OddsRatio),str(exp_count),str(k),str(M)]))

结果展示:




学习了东老师的视频,东老师导入了很多有用的包,让我收获不少。用scipy,numpy,pandas和rpy2可以在python方便的使用R的功能,很神奇,我在windows安装的这几个包遇到了一些小问题,如果有人和我一样安装不成功的,推荐看一下知乎的这个帖子https://www.zhihu.com/question/30188492。另外在学习的时候发现超几何的思路和老师的不一样(老师没用到universe-list),所以又重新看了楼主的github代码,发现老师用的是pathway所有gene作为background的。所以这里我详细解释一下我的理解,我们知道超几何需要用到几个数值N,n,M,k,其中N就是背景了,如果按照东老师的理解,N应该是参与pathway的所有gene,n就其中某个pathway的所有gene,M就是所有的差异gene(并且在N里的),k就是在n中的所有的差异gene了,而如果按照楼主的理解,N是所有的gene或者用于计算差异的所有gene了,n就是计算出来差异gene,M就是某个pathway的gene(并且在N里的),k就是在n里和M那个pathway的交叉的gene。总的来说,感觉就是用到的背景不同,一个是从pathway出发的,一个是从差异gene出发的,最后的结果可能略有不同。参考老师的代码,重新跑了一下数据。代码:
[Python] 纯文本查看 复制代码
import os, re 
from collections import OrderedDict

### scipy, numpy, pandas, matplotlib ...
from scipy.stats import hypergeom
import numpy as np 
import pandas as pd 
### rpy2
from rpy2.robjects.packages import importr 
from rpy2.robjects.vectors import FloatVector
from rpy2.robjects import pandas2ri
stats = importr("stats")

os.chdir("./")
###read kegg pathway file
kegg = OrderedDict()
pop = []
with open("kegg2gene.list.txt",'rt') as f:
        for line in f:
                line = line.rstrip()
                lst = line.split()
                if lst[1] not in kegg:
                        kegg[lst[1]] = [lst[0]]
                else:
                        kegg[lst[1]].append(lst[0])
                if lst[0] not in pop:
                        pop.append(lst[0])
###read diff-gene file
DEGs = []
with open("diff_gene.txt",'rt') as f:
        for line in f:
                line = line.rstrip()
                if line != "" and line not in DEGs and line in pop:
                        DEGs.append(line)
popTotal = len(pop)
listTotal = len(DEGs)

###enrichment use hypergeometric test
keggEnrich = OrderedDict()
for ke, val in kegg.items():
        hits = [gene for gene in val if gene in DEGs]
        hitCount = len(hits)
        popHits = len(val)
        if hitCount == 0:
                print(ke)
        else:
                #rv = hypergeom.sf(k,M.n,N)
                pVal = hypergeom.sf(hitCount-1, popTotal, popHits, listTotal) ### P(X >= hitCount)
                keggEnrich[ke] = [hitCount, listTotal, popHits, popTotal, pVal, ';'.join(hits)]

for ke, val in keggEnrich.items():
        print(ke)
        print(val)

###organize output to dataframe
keggOutput = pd.DataFrame.from_dict(keggEnrich, orient="columns", dtype=None)
keggOutput = pd.DataFrame.transpose(keggOutput)
keggOutput.columns = ["Count", "List Total", "pop Hits", "pop Total", "pVal", "Genes"]
keggOutput = keggOutput.sort_values(by="pVal",axis=0)

###calculate p value
pVal = keggOutput["pVal"]
fdr = stats.p_adjust(FloatVector(pVal), method = "fdr")
fdrPD = pandas2ri.ri2py(fdr)

keggOutput.insert(5, "FDR", fdrPD)
###output to excel file
keggOutput.to_excel("kegg_pathway_enrichment.xlsx", sheet_name="kegg")


python借助R来实现功能,挺好的收获。谢谢老师们。

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
回复 支持 反对

使用道具 举报

6

主题

23

帖子

201

积分

中级会员

Rank: 3Rank: 3

积分
201
发表于 2017-3-6 23:20:00 | 显示全部楼层
123-python-qin
[AppleScript] 纯文本查看 复制代码
#安装rpy2
#conda install --channel [url=https://conda.anaconda.org/Richarizardd]https://conda.anaconda.org/Richarizardd[/url] rpy2
##KEGG enrichment analysis
import os
import re
from collections import OrderedDict
import numpy as np
import pandas as pd
#os.chdir(C:\yangqin\biotree\python\lession_7)
kegg = OrderedDict()
pop = []
with open('hsa00001.cleaned.keg') as f:
        for line in f:
                line = line.rstrip()
                lst = line.split('\t')
                if len(lst) > 3:
                        geneList = lst[-2].split(';')
                        kegg[lst[2]] = geneList
                        pop = pop + [gene for gene in geneList if gene not in pop] #计算出所有通路的基因
DEGs = []
with open('limma_RNAseq.txt') as f:
        for line in f:
                line = line.strip()
                lst = line.split('\t')
                if lst[0] == '':
                        continue
                elif lst[0] in pop: #如果差异基因在背景基因里面,则取出放到DGEs里
                        DEGs.append(lst[0])                   
popTotal = len(pop)
#6990
listTotal = len(DEGs)
#15
print ('Number of genes in background list: %d' %popTotal)
print ('Number of DE genes involved in any pathways: %d' %listTotal)
from scipy.stats import hypergeom
keggEnrich = OrderedDict()
for ke,val in kegg.items(): 
        hits = [gene for gene in val if gene in DEGs]
        hitCount = len(hits)
        popHits = len(val)
        if hitCount == 0:
                print(ke)
        else:
                pVal = hypergeom.sf(hitCount-1,popTotal,popHits,listTotal)
                #Scipy中的stats.hypergeom.sf(x, m+n, m, k)相当于R中的phyper(x, m, n, k, lower.tail=FALSE),其中当lower.tail的值为真时,phyper返回一个当抽取到的白球数小于等于q值的概率,当lower.tail值为假时,phyper返回一个当抽取到的白球数大于q值的概率。phyper的返回值是一个累积的概率值。
                keggEnrich[ke] = [hitCount,listTotal,popHits,popTotal,pVal,';'.join(hits)]   
keggOutput = pd.DataFrame.from_dict(keggEnrich,orient='columns',dtype=None)
keggOutput = pd.DataFrame.transpose(keggOutput)
keggOutput.columns = ['Count','List Total','pop Hits','pop Total','pval','Genes']
keggOutput=keggOutput.sort_values(by='pval',axis=0)
from rpy2.robjects.packages import importr
from rpy2.robjects.vectors import FloatVector
from rpy2.robjects import pandas2ri
stats = importr('stats')
pVal = keggOutput['pval']
fdr = stats.p_adjust(FloatVector(pVal),method='fdr')
fdrPD = pandas2ri.ri2py(fdr)
keggOutput.insert(5,'FDR',fdrPD)
keggOutput.to_excel('kegg_pathway_enrichment.xlsx',sheet_name='kegg')
回复 支持 反对

使用道具 举报

2

主题

34

帖子

773

积分

高级会员

Rank: 4

积分
773
发表于 2017-3-14 01:02:22 | 显示全部楼层
本帖最后由 x2yline 于 2018-1-13 18:40 编辑

077 R-python-shell-x2yline

按照老师的python视频操作时,操作到最后一步算FDR时发现rpy2在我机子上总是出现错误,至今未解决:
Traceback (most recent call last):
    from rpy2.robjects.robject import RObjectMixin, RObject
  File "d:\soft\python35\lib\site-packages\rpy2\robjects\robject.py", line 6, in <module>
    rpy2.rinterface.initr()
RuntimeError: R_USER not defined.

查了一下,我的环境变量都没有问题,不知道是什么原因出错,因此这个作业拖了很久了。
我只好绕过python调用R,直接在python上实现进行FDR的计算,计算FDR的方法参考了这里:https://mp.weixin.qq.com/s?__biz ... KAjlHxacl6RQHFM6#rd


python代码如下:
[Python] 纯文本查看 复制代码
import os
os.chdir(r'E:\r\biotrainee_demo\class7')
import pandas as pd

#模拟差异基因
diff_gene_list = []
with open(r'E:\r\biotrainee_demo\class7\GDS4511.soft\GDS4511.soft','r') as f:
    for line in f:
        if not line[0] in ['#','!','^']:
            gene = line.split()[1]
            diff_gene_list.append(gene)
diff_gene_list = diff_gene_list[0:800]


def analysis_kegg(file_path):
    kegg_dict={}
    with open(file_path,'r') as f:
        for line in f:
            if line.startswith('C'):
                try:
                    path_num = line.split()[1]
#                    if len(path_num) > 10:
#                        print(line)
                    kegg_dict[path_num]= []
                except:
                    print('Warning: '+line)
            elif line.startswith('D'):
                try:
                    gene_name = line[:line.find(';')].split()[-1]
                    kegg_dict[path_num].append(gene_name)
                except:
                    print('Warning: '+line)
                    
    non_empty_path = []
    all_gene = []
    for keys in kegg_dict.keys():
        if kegg_dict[keys]:
           non_empty_path.append(keys)
           all_gene += kegg_dict[keys]
    return(kegg_dict, all_gene)
#kegg文件解析
kegg_dict, all_kegg_gene = analysis_kegg(r'E:\r\biotrainee_demo\class7\hsa00001.keg')

# 计算p值
DEGs = []
for i in diff_gene_list:
    if i in all_kegg_gene:
        DEGs.append(i)

DEGs=list(set(DEGs))
popTotal = len(set(all_kegg_gene))
listTotal = len(DEGs)


from scipy.stats import hypergeom

keggEnrich = {}
for ke, val in kegg_dict.items():
    hits = [gene for gene in val if gene in DEGs]
    hitCount =len(hits)
    popHits = len(val)
    
    if hitCount == 0:
        print(ke,end=' is not in hits\n')
    else:
        pVal = hypergeom.sf(hitCount-1, popTotal, popHits, listTotal)
        keggEnrich[ke] = [hitCount, listTotal, popHits, popTotal, pVal, ':'.join(hits)]
        
keggOutput = pd.DataFrame.from_dict(keggEnrich, orient='columns',dtype=None)
keggOutput = pd.DataFrame.transpose(keggOutput)
keggOutput.columns=['Count','List Total', 'pop Hits', 'pop Total', 'p-Val','genes']

# 计算FDR值
p = keggOutput['p-Val']
plist = [i for i in p]
length = len(plist)
psort = [i for i in plist]
psort.sort()
FDR = []
for i in plist:
    q = i*length/(psort.index(i)+1)
    FDR.append(q)
keggOutput['FDR']=FDR


结果如下:
CountList Totalpop Hitspop Totalp-ValgenesFDR
46301014615672170.00119IFNL2:IFNL3:IFNL1:IL11RA:IL12RB1:IL23R:CSF3R:JAK1:SOCS4TPN110.00119
496641462772170.001963654ATP6V1C2:ATP6V1E2:ATP6V0D2:SLC4A10.00197254
46211014617072170.002265905CCL5:TAB3:MAPK1:GPRC6A:CASP8:CARD16:NLRP6:NLRC4:JAK1:TLR40.002286504
472151466372170.0086753SLC18A2:RIMS1:ATP6V1C2:ATP6V1E2:ATP6V0D20.008794139
532361469072170.009563738TLR4:ATP6V1C2:ATP6V1E2:ATP6V0D2:CCL5:ANGPT10.00973922
512051466872170.011875718PTPN11:CCL5:ATP6V1C2:ATP6V1E2:ATP6V0D20.012149352
497731462472170.012024413SCARB1LB1:SLC46A10.012358424
474431462872170.018353967GRK7:GUCA1A:CALML60.018951539
40801214630672170.020875857HTR6:TAAR9:BRS3:SSTR3:UTS2R:RXFP1:RXFP2:UCN3:GPR156:GABRG1:GRIK5:THRA0.021656263
513441465572170.024708755CASP8:HSPA6:NLRC4:TLR40.025752787
4150714615272170.034176019ATP6V1C2:ATP6V1E2:FLCN:WNT9A:WNT9B:MAPK1:RICTOR0.035788095
2541146272170.040053518ACACB0.042141616
406451469572170.042986276TLR4:TIRAP:TAB3:TNFRSF13C:ERC10.045442635
40601014627372170.049650466CCL5:IFNL2:IFNL3:IFNL1:IL11RA:CSF3R:IL12RB1:IL23R:TNFRSF10A:TNFRSF13C0.052738773


回复 支持 反对

使用道具 举报

0

主题

1

帖子

61

积分

注册会员

Rank: 2

积分
61
发表于 2019-7-15 23:02:28 | 显示全部楼层
本帖最后由 liu9827885 于 2019-7-16 09:39 编辑
AdaWon 发表于 2017-3-18 22:33
147-python
解题思路:
1.首先从kegg网站下载物种的信息,所谓raw data;

大佬,你好,想请教下stats.hypergeom.sf中这些参数的具体信息
k,为某一通路与差异基因的交集
M,为所有通路背景基因
N,某一通路基因总数
n,为差异基因集合
这样是否正确呢?
我是参考https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.hypergeom.html
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2019-9-16 15:41 , Processed in 0.046227 second(s), 28 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.