|
发表于 2017-1-15 21:58:56
|
显示全部楼层
本帖最后由 x2yline 于 2017-4-10 13:29 编辑
077-R-python-shell-x2yline
Ensemble( ensembl.org网站是常用真核生物参考基因组来源之一 )能够对人类基因自动进行注释,包括人类,小鼠,斑马鱼,猪和大鼠等,也包括来自HAVANA的人工注释信息。
Ensembl是一项生物信息学研究计划,旨在开发种能够对真核生物基因组进行自动注释(automatic annotation)并加以维护的软件系统。该计划由英国Sanger研究所Wellcome基金会及欧洲分子生物学实验室所属分部欧洲生物信息学研究所共同协作运营。
Ensembl与NCBI的NCBI Map Viewer和UCSC是最为常用基因组检索数据库。
Ensembl 与NCBI Map Viewer和UCSC最大区别表现在以下5点:
a.Ensembl的基因数据集是依据mRNA和蛋内序列的数据信息白动注释的。数据来源为新的基因组数据,UniProt/SwissProt和UniProt/TrEMBL的蛋白序列,NCBI的RefSeq里的DNA和蛋白序列和EMBL的cDNA序列。
b.Ensembl是一个开源(Perl API )的全自动的基因注释软件系统,很多网站都采用Ensembl这套软件系统。
c.Ensembl拥存其特有的BioMart功能。BioMart可以依据设定的要求对基 因组进行条件性检索,检索的结果吋以以图表的形式给出。
d.与其它数据库相整合,比如DAS。
e.基因组间的比较分析。
基因注释机构
目前从事基因注释的机构组织有很多,这里列出的只是较为常用的几个。
1. Ensembl:目的是做出最好的基因注释集。
2.Havana (VEGA):是桑格中心的一个基因注释组织,它的目标和Eiisembl—致,因此,结合得也最紧密。
3. HGNC -给出人类基因唯一的名字和符号。
4. UniProt 主要集中于蛋白质的信息注释。
Ensembl的通用基因注释有两种,一是Ensembl GeneBuild,它是自动化注释,速度快,实时更新,在不同物种上均适用;另一种是Wellcome基金会的 Havana (VEGA)小组的注释,它是手工注释,速度慢,但是准确,它依据的都是已经验证过的mRNA和蛋白序列来注释,比较费时。因此Ensembl基因组数据库 中,会有两种注释。
Havana (VEGA)小组的注释常有以下几种类型:
详细信息:http://vega.sanger.ac.uk/info/about/gene_and_transcript_types.html
Protein coding: 包括开放阅读框 (ORF).
Processed transcript:没有开放阅读框(ORF)
Pseudogene:假基因,是指脱氧核糖核酸(DNA)的碱基序列中,一段与其他生物体内已知的基因序列非常相似的片段。但是这个片段由于移码突变或者无义突变破坏了ORF,无法发挥原有的基因功能,也就是无法制造出蛋白质
IG gene:免疫球蛋白家族基因
TR Gene:T细胞受体基因
TEC (To be Experimentally Confirmed)
人类和小鼠基因组的GTF文件与GENCODE计划发布的gene set文件相同。
The GENCODE project 的目标为对人类和小鼠基因组提供高质量的注释信息和实验确证。
The GENCODE gene sets被其他项目作为参考而广泛使用(如 1000 Genomes).
详细内容:https://www.gencodegenes.org/about.html
带有abinitio扩展名的文件为用Genescan和abinitio基因预测工具生成的
预测基因的注释文件
------------------
GTF文件输出示例
------------------
#!genome-build GRCh38
11 ensembl_havana gene 5422111 5423206 . + . gene_id "ENSG00000167360"; gene_version "4"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
11 ensembl_havana transcript 5422111 5423206 . + . gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381";
11 ensembl_havana exon 5422111 5423206 . + . gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; exon_number "1"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381"; exon_id "ENSE00001276439"; exon_version "4";
11 ensembl_havana CDS 5422201 5423151 . + 0 gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; exon_number "1"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381"; protein_id "ENSP00000300778"; protein_version "4";
11 ensembl_havana start_codon 5422201 5422203 . + 0 gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; exon_number "1"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381";
11 ensembl_havana stop_codon 5423152 5423154 . + 0 gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; exon_number "1"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381";
11 ensembl_havana UTR 5422111 5422200 . + . gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381";
11 ensembl_havana UTR 5423155 5423206 . + . gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381";
1.python实现:
[AppleScript] 纯文本查看 复制代码 # -*- coding: utf-8 -*-
"""
Spyder Editor
This is a temporary script file.
"""
import time
import re
from collections import OrderedDict
start = time.clock()
def count_gtf_gene(file_path, buffer_size=1024*1024):
print(file_path.split('.')[-1])
gtf_count = OrderedDict()
with open(file_path, 'r') as f:
for i in f:
if not i.startswith('#'):
i_list = i.split('\t')
CHR = i_list[0]
feature = i_list[2]
if CHR not in gtf_count.keys():
gtf_count[CHR] = OrderedDict()
gtf_count[CHR]['all_gene'] = 0
if feature == 'gene':
gtf_count[CHR]['all_gene'] += 1
TYPE = re.search('gene_biotype "(.*?)";',i_list[-1]).group(1)
try:
gtf_count[CHR][TYPE] += 1
except:
gtf_count[CHR][TYPE] = 1
return gtf_count
def write_to_csv(gtf_gene_dict,file_save):
c = []
for a,b in gtf_gene_dict.items():
for i in b.keys():
c.append(i)
c = set(c)
file_raw = 'Chromsome'
file_raw += ',' + 'all_gene' + ',' + 'protein_coding'
for i in c:
if i not in ['all_gene', 'protein_coding']:
file_raw += ',' + i
for chr_id in list(range(1,23))+['X','Y','MT']:
chr_id = str(chr_id)
file_raw += '\nchr_'+ chr_id
file_raw += ',' + str(gtf_gene_dict[chr_id]['all_gene']) +',' + str(gtf_gene_dict[chr_id]['protein_coding'])
for i in c:
if i not in ['all_gene', 'protein_coding']:
i = str(i)
try:
file_raw += ',' + str(gtf_gene_dict[chr_id])
except:
file_raw += ',' + str(0)
file_raw_list = file_raw.split('\n')[1:]
i_list = file_raw_list[1].split(',')[1:]
SUM = list(range(len(i_list)))
for j in range(len(i_list)):
SUM[j] = 0
for i in file_raw_list:
i_list = i.split(',')[1:]
for j in range(len(i_list)):
SUM[j] += int(i_list[j])
file_raw += '\nSUM'
for i in SUM:
file_raw += ',' + str(i)
with open(file_save,'w') as f:
f.write(file_raw)
return re.sub(',', '\t', file_raw)
def count_transcript_and_exon(file_path):
print(file_path)
transcript = []
exon = []
with open(file_path, 'r') as f:
for i in f:
if not i.startswith('#'):
i_list = i.split('\t')
feature = i_list[2]
if feature == 'gene':
try:
transcript.append(transcript_num)
except:
transcript_num = 0
transcript_num = 0
elif feature == 'transcript':
try:
exon.append(exon_num)
except:
exon_num = 0
transcript_num += 1
exon_num = 0
elif feature == 'exon':
exon_num += 1
try:
transcript.append(transcipt_num)
except:
pass
try:
exon.apend(exon_num)
except:
pass
return (transcript, exon)
file_path_human = r'F:\genome\Homo_sapiens.GRCh38.87.chr.gtf\Homo_sapiens.GRCh38.87.chr.gtf'
#gtf_gene_dict1 = count_gtf_gene(file_path1)
#file_to_write1 = write_to_csv(gtf_gene_dict, file_save = 'gtf_analysis.csv')
file_path_cat = r'F:\genome\Felis_catus.Felis_catus_6.2.87.chr.gtf\Felis_catus.Felis_catus_6.2.87.chr.gtf'
#print('Used %.4f s'%(time.clock()-start))
file_path_gorilla = r'F:\genome\Gorilla_gorilla.gorGor3.1.87.chr.gtf\Gorilla_gorilla.gorGor3.1.87.chr.gtf'
file_path_mus = r'F:\genome\Mus_musculus.GRCm38.87.chr.gtf\Mus_musculus.GRCm38.87.chr.gtf'
file_path_zeb = r'F:\genome\Danio_rerio.GRCz10.87.chr.gtf\Danio_rerio.GRCz10.87.chr.gtf'
file_path_chicken = r'F:\genome\Gallus_gallus.Gallus_gallus-5.0.87.chr.gtf\Gallus_gallus.Gallus_gallus-5.0.87.chr.gtf'
file_path_elephant = r'F:\genome\Caenorhabditis_elegans.WBcel235.87.gtf\Caenorhabditis_elegans.WBcel235.87.gtf'
file_path_fluitfly = r'F:\genome\Drosophila_melanogaster.BDGP6.87.chr.gtf\Drosophila_melanogaster.BDGP6.87.chr.gtf'
file_path_pig = r'F:\genome\Sus_scrofa.Sscrofa10.2.87.chr.gtf\Sus_scrofa.Sscrofa10.2.87.chr.gtf'
file_path_dog = r'F:\genome\Canis_familiaris.CanFam3.1.87.chr.gtf\Canis_familiaris.CanFam3.1.87.chr.gtf'
file_path_horse = r'F:\genome\Equus_caballus.EquCab2.87.chr.gtf\Equus_caballus.EquCab2.87.chr.gtf'
transcript_human, exon_human = count_transcript_and_exon(file_path_human)
transcript_cat, exon_cat = count_transcript_and_exon(file_path_cat)
transcript_zeb, exon_zeb = count_transcript_and_exon(file_path_zeb)
transcript_gorilla, exon_gorilla = count_transcript_and_exon(file_path_gorilla)
transcript_mus, exon_mus = count_transcript_and_exon(file_path_mus)
transcript_pig, exon_pig = count_transcript_and_exon(file_path_pig)
transcript_fluitfly, exon_fluitfly = count_transcript_and_exon(file_path_fluitfly)
transcript_dog, exon_dog = count_transcript_and_exon(file_path_dog)
transcript_chicken, exon_chicken = count_transcript_and_exon(file_path_chicken)
transcript_horse, exon_horse = count_transcript_and_exon(file_path_horse)
transcript_elephant, exon_elephant = count_transcript_and_exon(file_path_elephant)
exon_hub = [exon_human, exon_gorilla, exon_elephant, exon_horse, exon_pig, exon_dog, exon_cat, exon_mus, exon_zeb, exon_fluitfly]
def list_to_freq(exon):
exon_range = [0]
exon_freq = [0]
for i in set(exon):
exon_range.append(i)
exon_freq.append(exon.count(i)/len(exon)*100)
return (exon_range, exon_freq)
import numpy as np
import matplotlib
from pylab import *
import matplotlib.pyplot as plt
myfont = matplotlib.font_manager.FontProperties(fname='C:/Windows/Fonts/msyh.ttf')
mpl.rcParams['axes.unicode_minus'] = False
plt.figure(1)
plt.style.use('seaborn-white')
plt.xlim(0,15)
plt.ylim(0, 50)
plt.ylabel('百分比(%)',fontproperties=myfont)
plt.xlabel(u'外显子数目',fontproperties=myfont)
plt.title(u'不同物种基因外显子数目频率统计图', fontproperties=myfont)
i = -1
colors = ['#00FF00','#006400','blue','#9400D3','#800000','#191970','#CD853F','#FF0000','#FFFF00','#FFA500']
for exon_item in exon_hub:
i += 1
labels = ['Human','gorilla','elephant', 'horse', 'pig', 'dog', 'cat', 'mus', 'zebrafish', 'fluitfly' ]
#colors = ['r', 'g', 'k', 'b','r', 'g', 'k', 'b', 'r', 'b']
exon_range, exon_freq = list_to_freq(exon_item)
plt.plot(exon_range, exon_freq, color=colors, alpha=.61, label=labels)
#plt.xticks(exon_range[0:15])
plt.plot((median(exon_item)+i/100,median(exon_item)+i/100), (0,50), colors,ls = '--', alpha=1)
#plt.text(median(exon_item)-0.5+i/10,2,u'中位数', ha='left', color= colors, fontproperties=myfont, alpha=0.8)
plt.legend(loc='upper right')
plt.savefig('F:\\1.png', dpi=600)
plt.figure(2)
plt.style.use('ggplot')
plt.boxplot(exon_hub)
plt.ylim(0, 35)
plt.yticks(range(1,36,2))
plt.xticks(range(1,11),labels, fontsize=9)
plt.title('exons of different species')
plt.ylabel('exon number')
plt.savefig('F:\\2.png', dpi=600)
plt.show()

贴上human gtf文件跑出来的部分结果
Chromsome all_gene protein_coding polymorphic_pseudogene processed_pseudogene IG_C_pseudogene scRNA unitary_pseudogene TR_J_gene sense_overlapping IG_D_gene misc_RNA
chr_1 5194 2052 6 897 0 0 26 0 17 0 192
chr_2 3971 1255 2 751 0 1 7 0 10 0 176
chr_3 3010 1076 2 616 0 0 9 0 7 0 134
chr_4 2505 751 1 561 0 0 2 0 9 0 104
chr_5 2868 876 0 625 0 0 4 0 10 0 119
chr_6 2863 1045 3 639 0 0 10 0 7 0 105
chr_7 2867 906 2 577 0 0 13 19 12 0 143
chr_8 2353 676 1 468 0 0 4 0 8 0 82
chr_9 2242 781 2 461 1 0 5 0 9 0 96
chr_10 2204 732 1 422 0 0 5 0 9 0 89
chr_11 3235 1276 22 473 0 0 20 0 17 0 97
chr_12 2940 1034 0 490 0 0 6 0 5 0 115
chr_13 1304 327 0 295 0 0 1 0 0 0 75
chr_14 2224 623 3 324 2 0 2 60 11 27 79
chr_15 2152 597 0 287 0 0 1 0 8 10 93
chr_16 2511 866 1 252 0 0 12 0 12 0 51
chr_17 2995 1197 0 310 0 0 6 0 3 0 99
chr_18 1170 270 0 207 1 0 3 0 1 0 41
chr_19 2926 1470 2 263 0 0 7 0 6 0 61
chr_20 1386 544 0 198 0 0 5 0 3 0 68
chr_21 835 233 0 143 0 0 0 0 8 0 24
chr_22 1318 438 2 156 5 0 3 0 9 0 62
chr_X 2359 841 1 723 0 0 3 0 6 0 100
chr_Y 523 53 0 130 0 0 0 0 0 0 7
chr_MT 37 13 0 0 0 0 0 0 0 0 0
SUM 57992 19932 51 10268 9 1 154 79 187 37 2212
最近正好看到有关染色体与基因关系的一页课件,以下步骤只能说是对课件中的结论进行验证
'探索代码如下:
[Python] 纯文本查看 复制代码 import time
import re
from collections import OrderedDict
start = time.clock()
def count_gtf_gene(file_path, buffer_size=1024*1024):
print(file_path.split('.')[-1])
gtf_count = OrderedDict()
with open(file_path, 'r') as f:
for i in f:
if not i.startswith('#'):
i_list = i.split('\t')
CHR = i_list[0]
feature = i_list[2]
if CHR not in gtf_count.keys():
gtf_count[CHR] = OrderedDict()
gtf_count[CHR]['all_gene'] = 0
if feature == 'gene':
gtf_count[CHR]['all_gene'] += 1
TYPE = re.search('gene_biotype "(.*?)";',i_list[-1]).group(1)
try:
gtf_count[CHR][TYPE] += 1
except:
gtf_count[CHR][TYPE] = 1
return gtf_count
file_path_human = r'F:\genome\Homo_sapiens.GRCh38.87.chr.gtf\Homo_sapiens.GRCh38.87.chr.gtf'
hg38_gene = count_gtf_gene(file_path_human)
chr_id_list = list(range(1,23)) + ['X','Y','MT']
gene_num_list = []
for i in chr_id_list:
gene_num_list.append(hg38_gene[str(i)]['all_gene'])
import matplotlib.pyplot as plt
plt.figure(1)
plt.style.use('seaborn-white')
plt.bar(left = range(25), height = gene_num_list, color='k')
for i in range(len(gene_num_list)):
plt.text( x= i , y=gene_num_list+35,s=str(round(gene_num_list)), fontsize = 5)
plt.title('Gene distribution of hg38 genome')
plt.ylabel('Gene number of chromosome')
pos = []
for i in range(len(chr_id_list)):
pos.append(i + 0.35)
plt.xticks(pos, list(range(1,23)) + ['X','Y','MT'], fontsize=8)
plt.xlim(-0.2, )
plt.savefig('F:\gene_distribution.png',dpi=600)
plt.show()
plt.figure(2)
plt.style.use('seaborn-white')
import os
from collections import OrderedDict
def count_fasta_atcgn(file_path, buffer_size=1024*1024):
bases = ['N', 'A', 'T', 'C', 'G']
ATCG_analysis = OrderedDict()
with open(file_path, 'r') as f:
line1 = f.readline().upper()
chr_i = re.split('\s', line1)[0][1:]
print(chr_i)
ATCG_analysis[chr_i] = OrderedDict()
for base in bases:
ATCG_analysis[chr_i][base] = 0
while True:
chunk = f.read(buffer_size).upper()
if '>' in chunk:
chromsome = re.split('>',chunk)
if chromsome[0]:
for base in bases:
ATCG_analysis[chr_i][base] += chromsome[0].count(base)
for i in chromsome[1:]:
if i:
chr_i = re.split('\s', i[0:i.index('\n')])[0]
print(chr_i)
strings_i = i[i.index('\n'):]
ATCG_analysis[chr_i] = OrderedDict()
for base in bases:
ATCG_analysis[chr_i][base] = strings_i.count(base)
else:
for base in bases:
ATCG_analysis[chr_i][base] += chunk.count(base)
if not chunk:
break
return ATCG_analysis
file_path = r'F:\genome\hg38.chromFa\chroms\hg38.fa'
ATCG_analysis = count_fasta_atcgn(file_path, buffer_size=1024*1024)
cg_list = []
chr_id_list = list(range(1,23)) + ['X','Y','M']
for i in chr_id_list:
cg_list.append((ATCG_analysis['CHR'+str(i)]['G']+ATCG_analysis['CHR'+str(i)]['C'])/(ATCG_analysis['CHR'+str(i)]['A']+ATCG_analysis['CHR'+str(i)]['T']+ATCG_analysis['CHR'+str(i)]['C']+ATCG_analysis['CHR'+str(i)]['G'])*100)
plt.bar(left = range(25), height = cg_list, color='k')
for i in range(len(cg_list)):
plt.text( x=i -.1, y=cg_list+.35,s=str(round(cg_list)))
plt.title('GC content for hg38 genome')
plt.ylabel('GC content (%)')
pos = []
for i in range(len(chr_id_list)):
pos.append(i + 0.35)
plt.xticks(pos, list(range(1,23)) + ['X','Y','MT'], fontsize=8)
plt.xlim(-0.2, )
plt.ylim(0, 100)
plt.savefig('F:\hg38_gc.png',dpi=600)
plt.show()
结果(图)为:


验证的结论:
13、21和18号、X、Y染色体基因分布比较少,这些染色体可以发生染色体数目变异而不发生早期胚胎死亡,在三体综合征中能够存活的是21、18、13三体(其他的均完全致死)并且疾病严重程度依次递增(与基因分布成正比),除此之外还有部分存活的22三体,不过所有报告22三体的病例中都有一条22号染色体带有缺陷。
|
|