|
发表于 2017-2-5 18:24:28
|
显示全部楼层
python-031
思路:
1. 按行读取数据,跳过#开头行
2. 提取每一行的信息至列表l_list
3. 构建数据结构:{chr_id:{feature:number}};
4. 输出统计结果
代码:
[Python] 纯文本查看 复制代码 #! usr/bin/envpython
# _*_coding:utf-8_*_
from collections import OrderedDict
import sys
import time
start = time.clock()
gtfcount = OrderedDict()
featurelist = ['gene','transcript','CDS','exon']
with open(sys.argv[1], 'r') as f:
for line in f:
if not line.startswith('#'):
line = line.strip()
l_list = line.split("\t")
Chr = l_list[0]
feature = l_list[2]
if Chr not in gtfcount.keys():
gtfcount[Chr] = {}
for i in featurelist:
if not feature == i:
continue
if i in gtfcount[Chr]:
gtfcount[Chr][i] += 1
if i not in gtfcount[Chr]:
gtfcount[Chr][i] = 1
with open("result.txt", 'w') as fp_out:
fp_out.write("\t".join(["Chr","gtfcount"]) + "\n")
print "\t".join(["Chr","gtfcount"]) + "\n"
for Chr, gtfcount in gtfcount.items():
fp_out.write("\t".join([Chr, str(gtfcount)]) + "\n")
print "\t".join([Chr,str(gtfcount)]) + "\n"
结果:
Chr gtfcount
1 {'exon': 108417, 'gene': 5194, 'transcript': 17296, 'CDS': 64975}
2 {'exon': 91110, 'gene': 3971, 'transcript': 14390, 'CDS': 53480}
3 {'exon': 74562, 'gene': 3010, 'transcript': 12064, 'CDS': 43036}
4 {'exon': 45520, 'gene': 2505, 'transcript': 7732, 'CDS': 27178}
5 {'exon': 52715, 'gene': 2868, 'transcript': 9200, 'CDS': 30758}
6 {'exon': 52060, 'gene': 2863, 'transcript': 8540, 'CDS': 33020}
7 {'exon': 56882, 'gene': 2867, 'transcript': 9437, 'CDS': 32904}
X {'exon': 35309, 'gene': 2359, 'transcript': 6037, 'CDS': 22767}
8 {'exon': 42730, 'gene': 2353, 'transcript': 7717, 'CDS': 24440}
9 {'exon': 42213, 'gene': 2242, 'transcript': 6581, 'CDS': 26270}
11 {'exon': 69915, 'gene': 3235, 'transcript': 12151, 'CDS': 42363}
10 {'exon': 41978, 'gene': 2204, 'transcript': 6507, 'CDS': 26710}
12 {'exon': 69686, 'gene': 2940, 'transcript': 11419, 'CDS': 42166}
13 {'exon': 18266, 'gene': 1304, 'transcript': 3236, 'CDS': 10435}
14 {'exon': 41897, 'gene': 2224, 'transcript': 7476, 'CDS': 24528}
15 {'exon': 45517, 'gene': 2152, 'transcript': 7444, 'CDS': 26068}
16 {'exon': 57261, 'gene': 2511, 'transcript': 9896, 'CDS': 32673}
17 {'exon': 74785, 'gene': 2995, 'transcript': 12573, 'CDS': 44869}
18 {'exon': 20309, 'gene': 1170, 'transcript': 3640, 'CDS': 11850}
20 {'exon': 25502, 'gene': 1386, 'transcript': 4242, 'CDS': 15499}
19 {'exon': 71424, 'gene': 2926, 'transcript': 12695, 'CDS': 43907}
Y {'exon': 3784, 'gene': 523, 'transcript': 740, 'CDS': 1538}
22 {'exon': 26063, 'gene': 1318, 'transcript': 4472, 'CDS': 14888}
21 {'exon': 13966, 'gene': 835, 'transcript': 2413, 'CDS': 7396}
MT {'exon': 37, 'gene': 37, 'transcript': 37, 'CDS': 13}
小结:看了东哥的视频很受启发,了解了写一个可复用的程序的思路,也从这次视频中熟悉了面向对象编程,虽然自己目前还写不出来这样的程序,但是仍然给了我今后处理问题的很好的思路。
下一步:学习怎样用python及R作图 |
|