搜索
查看: 2605|回复: 1

Python [Rosalind]-16[MPRT]Finding a Protein Motif

[复制链接]

5

主题

21

帖子

898

积分

高级会员

Rank: 4

积分
898
QQ
发表于 2017-7-4 09:31:13 | 显示全部楼层 |阅读模式
第16题,查找一个蛋白模体,参考了https://github.com/jdp/rosalindhttps://github.com/jdp/rosalind。采用正则表达式的做法更简洁些。和标准答案好像有点出入,不知道有没有错误。

[Python] 纯文本查看 复制代码
'''
Problem Title: Finding a Protein Motif
Rosalind ID: MPRT
Rosalind #: 016
URL: [url]http://rosalind.info/problems/mprt/[/url]
Motif Implies Functionclick to expand
Problem:
To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means "either X or Y" and {X} means "any amino acid except X." For example, the N-glycosylation motif is written as N{P}[ST]{P}.
You can see the complete description and features of a particular protein by its access ID "uniprot_id" in the UniProt database, by inserting the ID number into
[url]http://www.uniprot.org/uniprot/uniprot_id[/url]
Alternatively, you can obtain a protein sequence in FASTA format by following
[url]http://www.uniprot.org/uniprot/uniprot_id.fasta[/url]
For example, the data for protein B5ZC00 can be found at [url]http://www.uniprot.org/uniprot/B5ZC00.[/url]
Given: At most 15 UniProt Protein Database access IDs.
Return: For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.
Sample Dataset:
B5ZC00
Sample Output
B5ZC00
85 118 142 306 395
'''
# To allow for the presence of its varying forms,
# a protein motif is represented by a shorthand as follows: 
# [XY] means "either X or Y" and {X} means "any amino acid except X." 
# For example, the N-glycosylation motif is written as N{P}[ST]{P}.
from urllib import request
import re

#此处还可以直接用Biopython中的ExPASy解决,代码见最后,参考自《Python生物信息学数据管理(Manaing your Biological Data with Python)》
def fetch_fasta_file(uniprot_id):
	url = 'http://www.uniprot.org/uniprot/' + uniprot_id + '.fasta'
	response = request.urlopen(url)
	file = response.readlines()
	#print(file)
	seq, seq_name = parse_fasta(file)
	return seq, seq_name

seq = ''
seq_name = ''
def parse_fasta(file):
	seq = ''
	seq_name = ''
	outf = ('%s.fasta' % uniprot_id, 'w')
	for a in file:
		#outf.write(a)
		#print(str(a)[2])
		if str(a)[2] == '>':
			seq_name = str(a)[3:len(a) - 2]
			#print(seq_name + '\n\n\n\n')
			
		else:
			seq += str(a)[2:len(a) - 2]
			#print(seq)
			#break
	#print(seq_name + '\n' + seq)
	return seq, seq_name


def search_motif(seq_name, sequence):
	#其中一个作者的方法
	#print(seq_name)
	'''
	for i in range(len(sequence) - 3):
		if sequence[i] != 'N':
			continue
		if sequence[i+1] == 'P':
			continue
		if sequence[i+2] not in ['S', 'T']:
			continue
		if sequence[i+3] == 'P':
			continue
		print(i, '\t')
		#yield i
		
'''
	#彩正则表达式
	pattern = re.compile('N[^P][ST][^P]')
	ite = pattern.finditer(sequence)

	if ite:
		print(seq_name)
		#print('\n')
		for s in ite:
			print(s.start(), end="")
			print('\t', end="")
			#print(s.end(), end="")
			#print('\t')
	else:
		print(seq_name, 'Not Founded the motif')




inf = open('in-016-rosalind_mprt.txt')
for line in inf:
	uniprot_id = line.strip()
	seq1, seq_name1 = fetch_fasta_file(uniprot_id)
	search_motif(seq_name1, seq1)
	break


#biopython

'''
from Bio import ExPASy
from Bio import SeqIO
handle = ExPASy.get_sprot_raw(uniprot_id)
seq_record = SeqIO.read(handle, "swiss")
out = open(uniprot_id + '.fasta','w')
fasta = SeqIO.write(seq_record, out, "fasta")
out.close()
'''




上一篇:我在biostar上面回答了一个问题Expand a DNA sequence with IUPAC code...
下一篇:HOPTOP转录组入门(一)布置运行环境
初学菜鸟,欢迎交流!刚学习了《Python生物信息学数据管理》。
https://github.com/zd200572/
http://www.zd200572.com
回复

使用道具 举报

0

主题

4

帖子

77

积分

注册会员

Rank: 2

积分
77
发表于 2017-7-11 08:54:45 | 显示全部楼层
昨天刚做了这题,我来贴一个perl代码
```
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
die "usage: script.pl <UniProt Protein Database access ID file>\n" if @ARGV == 0;
my @seq;
while(<>){
        chomp;
        my $id=$_;
        my $gene= get ("http://www.uniprot.org/uniprot/$id.fasta");
        my ($head,$seq)=split /\n/,$gene,2;
        $seq=~s/\n//g;
        print $head,"\n";
        for (my $i=0;$i<length($seq);$i++){
                my $motif=substr($seq,$i,4);
                if ($motif =~m/N[^P][ST][^P]/){
                        print "\t",$i+1;
                }
        }
        print "\n";
}
```
回复 支持 1 反对 0

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2019-11-12 20:21 , Processed in 0.036966 second(s), 26 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.