搜索
查看: 3282|回复: 1

[other] 转录组入门 (2) - (8)

[复制链接]

2

主题

41

帖子

385

积分

中级会员

Rank: 3Rank: 3

积分
385
发表于 2017-8-25 22:52:58 | 显示全部楼层 |阅读模式
本帖最后由 yojoy123 于 2017-8-25 22:52 编辑

转录组入门(2)下载数据
因为是菜鸟,本次学习基本上都是参考群主以及群里各位大神的代码来进行轻度修改实现的。主要目标是实现生物信息学入门。
为了不掉队,所使用的数据就是Jimmy提到的GSE81916数据集中的RNAseq数据。使用GEO提供的SRAtools中的prefetch功能下载数据,默认下载路径为:~/ncbi/public/sra代码如下:
[Bash shell] 纯文本查看 复制代码
#Download data
for ((i=56;i<=62;i=i++)); do prefetch SRR35899$i; done

Aspera 安装参考:
http://www.biotrainee.com/thread-280-1-1.html
Aspera高速下载客户端安装
[Bash shell] 纯文本查看 复制代码
mkdir -p ~/biosoft/ascp && cd ~/biosoft/ascp
wget [url=http://download.asperasoft.com/download/sw/ascp-client/3.5.4/ascp-install-3.5.4.102989-linux-64.sh]http://download.asperasoft.com/d ... .102989-linux-64.sh[/url]
sudo sh ascp-install-3.5.4.102989-linux-64.sh
# 查看帮助文件
ascp -help
#查看ascp版本,生成Key文件
ascp -A

[Bash shell] 纯文本查看 复制代码
#Use ascp to download data (recommand)
for i in $(seq 56 62); do ascp -T -l 200M -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [url=mailto:anonftp@ftp-private.ncbi.nlm.nih.gov]anonftp@ftp-private.ncbi.nlm.nih.gov[/url]:/sra/sra-instant/reads/ByStudy/sra/SRP/SRP075/SRP075747/SRR35899$i/SRR35899$i.sra /home/jshi//data/project/RNAseq; done


转录组入门(3)了解fastq测序数据
使用安装好的SRAtools将SRA文件转为fastq格式,用gzip进行压缩。
[Bash shell] 纯文本查看 复制代码
#批量分析sra测序数据质量
#首先需要将sra数据转换为fastq数据格式==
# 方法1:
cd ~/ncbi/public/sra
ls *.sra| while read id; do (fastq-dump --gzip --split-3 $id -O ~/data/project/RNAseq); done
# 方法2:
for ((i=56;i<=62;i++)); do fastq-dump --gzip --split-3 -A ~/ncbi/public/sra/SRR35899$i.sra -O ~/data/project/RNAseq; done

cd ~/data/project/RNAseq
#由于文件名过长,所以用到批量重命名命令
rename "s/_home_jshi_ncbi_public_sra_//" *
rename "s/.sra_//" *

#用fastQC查看测序结果质量
ls *.fastq.gz|xargs ~/biosoft/fastqc/FastQC/fastqc -o ~/data/project/RNAseq/qc -t 40

#利用anaconda安装MultiQC
conda install -c bioconda multiqc
# 测试
multiqc --help
#用multiqc对fastqc结果进行汇总
multiqc ~/data/project/RNAseq//human/qc/*fastqc.zip --ignore *.html


fastQC软件中文教程还请参考其他小伙伴们的笔记。

转录组入门(4)了解参考基因组及基因注释
群主发的很多资料都涉及到参考基因组和GTF文件下载,再次不再赘述。特别提示:要注意hg19(GRCh37)和hg38(GRCh38)的区别, 在同一工作流程中要使用同一版本的参考基因组数据资源。
[Bash shell] 纯文本查看 复制代码
mkdir -p ~/reference/human/gtf/genecode/mm10
wget [url=ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M14/gencode.vM14.annotation.gtf.gz]ftp://ftp.sanger.ac.uk/pub/genco ... 4.annotation.gtf.gz[/url]
mkdir -p ~/reference/human/gtf/genecode/human
wget [url=ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/GRCh37_mapping/gencode.v26lift37.annotation.gtf.gz]ftp://ftp.sanger.ac.uk/pub/genco ... 7.annotation.gtf.gz[/url]

mkdir -p ~/reference/human/gtf/ensembl
wget [url=ftp://ftp.ensembl.org/pub/grch37/update/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.gtf.gz]ftp://ftp.ensembl.org/pub/grch37 ... ns.GRCh37.87.gtf.gz[/url]



mkdir -p ~/reference/human/gtf/ncbi
转录组入门(5)序列比对
#序列比对,参考Jimmy两小时搞定RNAseq帖子和生信媛发布的转录组作业。
[Bash shell] 纯文本查看 复制代码
reference=/home/jshi/reference/index/hisat/mm10/genome
~/biosoft/hisat/current/hisat2  -p 5 -x $reference -1 SRR3589959_1.fastq.gz  -2  SRR3589959_2.fastq.gz -S control_1.sam  2>control_1.log
~/biosoft/hisat/current/hisat2  -p 5 -x $reference -1 SRR3589960_1.fastq.gz  -2  SRR3589960_2.fastq.gz -S Akap95_1.sam  2>Akap95_1.log
~/biosoft/hisat/current/hisat2  -p 5 -x $reference -1 SRR3589961_1.fastq.gz  -2  SRR3589961_2.fastq.gz -S control_2.sam  2>control_2.log
~/biosoft/hisat/current/hisat2  -p 5 -x $reference -1 SRR3589962_1.fastq.gz  -2  SRR3589962_2.fastq.gz -S Akap95_2.sam  2>Akap95_2.log

##然后把sam文件根据reads name来排序并且转换为bam文件节省空间
ls *sam |while read id;do (nohup samtools sort -n -@ 10 -o ${id%%.*}.Nsort.bam $id   & );done
##用samtools来对bam文件进行sort和index
ls *Nsort.bam |while read id;do (nohup samtools sort -n -@ 10 $id -o ${id%%.*}.sorted.bam  & );done
ls *sorted.bam | while read id; do (nohup samtools index -@40 $id &); done


#参考生信媛发布的作业,个人认为生信媛同学发布的作业程序看着更舒服。
#! usr/bin/bash
set -u
set -e
set -o pipefail
hg19_ref=~/reference/index/hisat/hg19/genome
mm10_ref=~/reference/index/hisat/mm10/genome
data_path=~/data/project/RNAseq
NUM_THREADS=25

ls --color=never Homo*1.fastq.gz|while read id;do(~/biosoft/hisat2-2.1.0/hisat2 -t -p $NUM_THREADS -x $hg19_ref -1 $data_path/${id%_*}_1.fastq.gz -2 $data_path/${id%_*}_2.fastq.gz 2>${id%_*}_map.log | samtools view -Sb - >${id%_*}.bam);done
ls --color=never Mus*1.fastq.gz|while read id;do(~/biosoft/hisat2-2.1.0/hisat2 -t -p $NUM_THREADS -x $mm10_ref -1 $data_path/${id%_*}_1.fastq.gz -2 $data_path/${id%_*}_2.fastq.gz 2>${id%_*}_map.log | samtools view -Sb - >${id%_*}.bam);done 
ls --color=never *.bam|while read id;do(samtools sort --threads $NUM_THREADS $id -o ${id%.*}_sorted.bam);done
ls --color=never *_sorted.bam | while read id;do(samtools index $id);done


[Bash shell] 纯文本查看 复制代码
#picard进行质控
java -jar ~/biosoft/picard_2.8.0/picard.jar CollectRnaSeqMetrics I=Homo_sapiens_Control_293_cell_sorted.bam O=Homo_sapiens_Control_293_cell_RNA_Metrics REF_FLAT=~/rna_seq/data/reference/genome/hg19/refFlat.txt STRAND_SPECIFICITY=NONE

#启动IGV查看基因视图
bash ~/biosoft/igv/igv_3.0_beta/igv.sh


转录组入门(6) reads计数
[Bash shell] 纯文本查看 复制代码
#HTseqCount 基因表达计数(M14小鼠,基因id)
ls *.Nsort.bam |while read id;do (nohup samtools view $id | ~/.local/bin/htseq-count -f bam -i gene_id -s no $id ~/reference/gtf/gencode/mm10/gencode.vM14.annotation.gtf 1>${id%%.*}.geneCounts 2>${id%%.*}.HTseq.log&);done

#HTseqCount 基因表达计数(M14小鼠,基因name)
ls *.Nsort.bam |while read id;do (nohup samtools view $id | ~/.local/bin/htseq-count -f bam -i gene_name -s no $id ~/reference/gtf/gencode/mm10/gencode.vM14.annotation.gtf 1>${id%%.*}.geneCounts 2>${id%%.*}.HTseq.log&);done

转录组入门(7): 差异基因分析
参考http://www.biotrainee.com/thread-2018-1-1.html
http://www.jianshu.com/p/3bfb21d24b74
这两个帖子讲的都很好,我照葫芦画瓢,先囫囵吞枣的学。
该步骤在Rsutdio中实现
[Bash shell] 纯文本查看 复制代码
# DESeq2
library(DESeq2)
#将gene_count数据读入相应表格
control1 <- read.table("~/tmp/matrix/control_1.geneidCounts", sep="\t", col.names = c("gene_id","control1"))
control2 <- read.table("~/tmp/matrix/control_2.geneidCounts", sep="\t", col.names = c("gene_id","control2")) 
rep1 <- read.table("~/tmp/matrix/Akap95_1.geneidCounts", sep="\t", col.names = c("gene_id","akap951")) 
rep2 <- read.table("~/tmp/matrix/Akap95_2.geneidCounts", sep="\t",col.names = c("gene_id","akap952"))
#合并数据
raw_count <- merge(merge(control1, control2,by="gene_id"),merge(rep1,rep2, by="gene_id"))
raw_count_filt <- raw_count[-48823:-48825,]
raw_count_filter <- raw_count_filt[-1:-2,]
ENSEMBL <- gsub("\\.\\d*", "", raw_count_filter$gene_id) 
row.names(raw_count_filter) <- ENSEMBL
raw_count_filter <- raw_count_filter[ ,-1]
#设置分组信息,构建dds对象
condition <- factor(c(rep("control",2),rep("akap95",2)), levels = c("control","akap95"))
countData <- raw_count_filter[,1:4]
colData <- data.frame(row.names=colnames(raw_count_filter), condition)
dds <- DESeqDataSetFromMatrix(countData, colData, design= ~ condition)
head(dds)
#DESeq函数估计离散度,差异分析获得res对象
dds2 <- DESeq(dds)
resultsNames(dds2)
res <- results(dds2)
summary(res)

#设定阈值,筛选差异基因,导出数据。
table(res$padj<0.05)
res <- res[order(res$padj),]
diff_gene_deseq2 <-subset(res,padj < 0.05 & (log2FoldChange > 1 | log2FoldChange < -1))
diff_gene_deseq2 <- row.names(diff_gene_deseq2)
resdata <-  merge(as.data.frame(res),as.data.frame(counts(dds2,normalize=TRUE)),by="row.names",sort=FALSE)
#将结果写入csv文件保存
write.csv(resdata,file= "~/tmp/matrix/control_vs_akap95.csv",row.names = F)

# edgeR寻找差异表达基因

[Bash shell] 纯文本查看 复制代码
# edgeR
#将gene_count数据读入相应表格
control1 <- read.table("~/tmp/matrix/control_1.geneidCounts", sep="\t", col.names = c("gene_id","control1"))
control2 <- read.table("~/tmp/matrix/control_2.geneidCounts", sep="\t", col.names = c("gene_id","control2")) 
rep1 <- read.table("~/tmp/matrix/Akap95_1.geneidCounts", sep="\t", col.names = c("gene_id","akap951")) 
rep2 <- read.table("~/tmp/matrix/Akap95_2.geneidCounts", sep="\t",col.names = c("gene_id","akap952"))
#合并数据
raw_count <- merge(merge(control1, control2,by="gene_id"),merge(rep1,rep2, by="gene_id"))
raw_count_filt <- raw_count[-48823:-48825,]
raw_count_filter <- raw_count_filt[-1:-2,]
ENSEMBL <- gsub("\\.\\d*", "", raw_count_filter$gene_id) 
row.names(raw_count_filter) <- ENSEMBL
raw_count_filter <- raw_count_filter[ ,-1]
#设置分组信息,并做TMM标准化
condition <- factor(c(rep("control",2),rep("akap95",2)), levels = c("control","akap95"))
countData <- raw_count_filter[,1:4]
colData <- data.frame(row.names=colnames(raw_count_filter), condition)

exprSet <- DGEList(counts = countData, group = condition)
exprSet <- calcNormFactors(exprSet)
#使用qCML(quantile-adjusted conditional maximum likelihood)估计离散度(只针对单因素实验设计)
exprSet <- estimateCommonDisp(exprSet)
exprSet <- estimateTagwiseDisp(exprSet)
#寻找差异gene(这里的exactTest函数还是基于qCML并且只针对单因素实验设计),然后按照阈值进行筛选即可
et <- exactTest(exprSet)
tTag <- topTags(et, n=nrow(exprSet))
diff_gene_edgeR <- subset(tTag$table, FDR < 0.05 & (logFC > 1 | logFC < -1))
diff_gene_edgeR <- row.names(diff_gene_edgeR)
write.csv(tTag$table,file = "control_vs_Akap95_edgeR.csv")


[Bash shell] 纯文本查看 复制代码
# limma(未找到差异表达基因,不知道是不是哪里设置的不对)
#参考:[url=http://www.bioinfo-scrounger.com/archives/115]http://www.bioinfo-scrounger.com/archives/115[/url]
rm(list=ls())
library(limma)
#limma包需要对count进行标准化以及进行log2转化,需要用到edgeR里的calcNormFactors函数
library(edgeR)
#将gene_count数据读入相应表格
control1 <- read.table("~/tmp/matrix/control_1.geneidCounts", sep="\t", col.names = c("gene_id","control1"))
control2 <- read.table("~/tmp/matrix/control_2.geneidCounts", sep="\t", col.names = c("gene_id","control2")) 
rep1 <- read.table("~/tmp/matrix/Akap95_1.geneidCounts", sep="\t", col.names = c("gene_id","akap951")) 
rep2 <- read.table("~/tmp/matrix/Akap95_2.geneidCounts", sep="\t",col.names = c("gene_id","akap952"))
#合并数据
raw_count <- merge(merge(control1, control2,by="gene_id"),merge(rep1,rep2, by="gene_id"))
raw_count_filt <- raw_count[-48823:-48825,]
raw_count_filter <- raw_count_filt[-1:-2,]
ENSEMBL <- gsub("\\.\\d*", "", raw_count_filter$gene_id) 
row.names(raw_count_filter) <- ENSEMBL
raw_count_filter <- raw_count_filter[ ,-1]
#设置分组信息
condition <- factor(c(rep("control",2),rep("akap95",2)), levels = c("control","akap95"))
countData <- raw_count_filter[,1:4]
colData <- data.frame(row.names=colnames(raw_count_filter), condition)
design <- model.matrix(~condition)
#对count进行标准化以及转化为log2的值,这里标准化的方法为TMM,使用edgeR里面的calcNormFactors函数即可
exprSet <- DGEList(counts = countData, group = condition)
exprSet <- calcNormFactors(exprSet)
logCPM <- cpm(exprSet, log=TRUE, prior.count=3)
fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)
output <- topTable(fit, coef=2,n=Inf)
sum(output$adj.P.Val<0.05)

#limma voom寻找差异基因
v <- voom(exprSet, design, plot=TRUE)
fit <- lmFit(v, design)
fit <- eBayes(fit)
output_voom <- topTable(fit, coef=2,n=Inf)
sum(output_voom$adj.P.Val<0.05)


#GO和KEGG富集分析
[Bash shell] 纯文本查看 复制代码
rm(list=ls())
#获取差异表达基因
library(DESeq2)
#将gene_count数据读入相应表格
control1 <- read.table("~/tmp/matrix/control_1.geneidCounts", sep="\t", col.names = c("gene_id","control1"))
control2 <- read.table("~/tmp/matrix/control_2.geneidCounts", sep="\t", col.names = c("gene_id","control2")) 
rep1 <- read.table("~/tmp/matrix/Akap95_1.geneidCounts", sep="\t", col.names = c("gene_id","akap951")) 
rep2 <- read.table("~/tmp/matrix/Akap95_2.geneidCounts", sep="\t",col.names = c("gene_id","akap952"))
#合并数据
raw_count <- merge(merge(control1, control2,by="gene_id"),merge(rep1,rep2, by="gene_id"))
#head(raw_count)
#tail(raw_count)
raw_count_filter <- raw_count[-c(1,2,51829:51831),]
ENSEMBL <- gsub("\\.\\d*", "", raw_count_filter$gene_id) 
row.names(raw_count_filter) <- ENSEMBL
raw_count_filter <- raw_count_filter[ ,-1]
#设置分组信息,构建dds对象
condition <- factor(c(rep("control",2),rep("akap95",2)), levels = c("control","akap95"))
countData <- raw_count_filter[,1:4]
colData <- data.frame(row.names=colnames(raw_count_filter), condition)
dds <- DESeqDataSetFromMatrix(countData, colData, design= ~ condition)
head(dds)
#DESeq函数估计离散度,差异分析获得res对象
#Normalization
dds2 <- DESeq(dds)
#查看结果名称,本次实验中是“intercept”,“condition_akap95_vs_control"
resultsNames(dds2)
## 将结果用results()函数来获取,赋值给res变量
res <- results(dds2)
## summary一下,看一下结果的概要信息
summary(res)
#设定阈值,筛选差异基因
table(res$padj<0.05)
res <- res[order(res$padj),]
genelist <- res
diff_gene_deseq2 <-subset(res,padj < 0.05 & abs(log2FoldChange) > 1)

##构造GSEA分析需要用到的数据
#按照log2FC大小来排序基因列表
genelist<- diff_gene_deseq2$log2FoldChange
summary(genelist)
names(genelist)<- rownames(diff_gene_deseq2)
genelist <- sort(genelist,decreasing =TRUE )
head(genelist)

#GO/KEGG分析以及GSEA分析,加载需要的包(继续研究输入数据的问题)
library(clusterProfiler)
require(DOSE)
library(DO.db)
library(org.Mm.eg.db)
require(Rgraphviz)
# 看一下数据库的ID类型
#keytypes(org.Mm.eg.db)
#ID转换(使用Jimmy推荐的select()函数进行ID的转换)
gene_ensembl <- names(genelist)
anyid <- bitr(gene_ensembl,fromType = "ENSEMBL",toType = c("GENENAME","SYMBOL","ENTREZID"),OrgDb = org.Mm.eg.db)
#transid <- select(org.Mm.eg.db, keys=gene_ensembl,columns = c("GENENAME","SYMBOL","ENTREZID"),keytype = "ENSEMBL")
gene <- anyid[,4]
#也可以使用bitr()函数进行ID转换

ego <- enrichGO(
  gene =gene_ensembl,
  OrgDb = org.Mm.eg.db,
  keytype = "ENSEMBL",
  ont = "MF"
)
#气泡图
dotplot(ego,font.size=8)
#柱状图
barplot(ego,font.size = 8)
#网络图
enrichMap(ego, vertex.label.cex=1.2, layout=igraph::layout.kamada.kawai)
plotGOgraph(ego)

##KEGG通路分析
kk <- enrichKEGG(gene = gene,
                 organism = 'mmu',
                 pvalueCutoff = 0.05)
head(kk)
dotplot(kk)
barplot(kk)
enrichMap(kk)
##KEGG进行GSEA分析
# 获取按照log2FC大小来排序的基因列表
genelist <- diff_gene_deseq2$log2FoldChange
names(genelist) <- rownames(diff_gene_deseq2)
genelist <- sort(genelist, decreasing = TRUE)
# GSEA分析(具体参数参考:[url=https://mp.weixin.qq.com/s/p-n5jq5Rx2TqDBStS2nzoQ]https://mp.weixin.qq.com/s/p-n5jq5Rx2TqDBStS2nzoQ[/url])
gsemf <- gseGO(genelist,
               OrgDb = org.Mm.eg.db,
               keyType = "ENSEMBL",
               ont="MF"
)
# 查看大致信息
head(gsemf)
# 画出GSEA图
gseaplot(gsemf, geneSetID="GO:0001077")


[Bash shell] 纯文本查看 复制代码
##对人类基因进行GSEA基因富集分析,需要先载入gmt文件,并用R读取。

gene<-na.omit(gene)
class(gene)
gmtfile <- system.file("extdata", "h.all.v6.0.entrez.gmt", package="clusterProfiler")
h_all <- read.gmt(gmtfile)
egmt <- enricher(gene, TERM2GENE=h_all)
head(egmt)


总结:整体来说,本次实战都是参考群里各位大神以及网上相关帖子内容,一步一步摸索实现的。通过这个过程,对Linux操作以及NGS数据分析有了更深一步的认识,希望在不久的将来,能够独立完成作业。
欢迎各位批评指正~








上一篇:GO.db/topGO/goseq R包
下一篇:gene symbol 与gene ID 批量转换?
回复

使用道具 举报

0

主题

3

帖子

133

积分

注册会员

Rank: 2

积分
133
发表于 2017-9-17 23:28:59 | 显示全部楼层
学习一下~
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2019-7-24 02:40 , Processed in 0.033048 second(s), 25 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.