搜索
查看: 581|回复: 6

指定基因的某个变异查找它在基因组的坐标

[复制链接]

518

主题

910

帖子

3065

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
3065
发表于 2016-11-11 17:48:07 | 显示全部楼层 |阅读模式
目前虽然有6种基因的变异表现形式:
"c." for a coding DNA sequence (like  c.76A>T)
"g." for a genomic sequence (like g.476A>T)
"m." for a mitochondrial sequence (like m.8993T>C, see Reference Sequence)
"n" for a non-coding RNA reference sequence (gene producing an RNA transcript but not a protein, see Community consultation 002)
"r." for an RNA sequence (like r.76a>u)
"p." for a protein sequence (like  p.Lys76Asn)

参考: http://www.hgvs.org/mutnomen/recs.html
但是我们常见的就是c和p的表现形式!
如果我给你一个 BRAF.p.V600E:或者BRAF.c.1799T>A,你能通过自己下载数据,解析,然后写脚本告诉这个变异发生在基因组的哪个坐标吗?hg19或者hg38均给出吧!或者SMARCC1:p.R1064K  等等!


你需要知道各个基因在基因组的坐标,然后需要知道每个基因有多少个转录本,多少个蛋白,需要知道每个转录本的外显子排列方式,总体来说,是需要理解gtf文件!
加油吧,少年!



上一篇:R语言绘图热图
下一篇:R绘图之k-means聚类示例图
回复

使用道具 举报

10

主题

50

帖子

217

积分

中级会员

Rank: 3Rank: 3

积分
217
QQ
发表于 2016-12-12 13:30:05 | 显示全部楼层
本帖最后由 惠真-市二医院 于 2016-12-12 13:31 编辑

贴我的答案了,写的很乱,我之前没考虑到基因所在的正负链问题,现在考虑到了!同时打印出突变所在cds的前后各25bp!
[AppleScript] 纯文本查看 复制代码
#!/home/huizhen/bin/perl/bin
      2
      3 use strict;
      4 use warnings;
      5 #use Smart::Comments;
      6 #use List::Util qw(sum max min);
      7
      8 die"perl $0 <gene> <snp_ccds_position>:$!" unless @ARGV == 2;
      9 my $gene = shift @ARGV;
     10 my $pos = shift @ARGV;
     11
     12 open GTF,"<","/share/Data01/huizhen/reference/NCBI/Homo_sapiens.GRCh38.86.gtf" or die "cant open:$!";
     13 if(-e "$gene.info.txt"){
     14                 die "The output file exists:$!";
     15 }else{
     16                 open OUT,">","$gene.info.txt";
     17 }
     18 my %cds;
     19 my $strand;
     20 my ($chr_of_snp,$exon_of_snp,$pos_of_snp);
     21 my (@gene,@trans,@exon,@cds,@start_end,@utr);
     22 push @gene,"The info of gene\n";
     23 push @trans,"The info of transcript\n";
     24 push @exon,"The info of exon\n";
     25 push @cds,"The info of CDS\n";
     26 push @start_end,"The info of Start_end code\n";
     27 push @utr,"The info of UTR\n";
     28 my @temp;
     29 while(<GTF>){
     30                 if(/^#/){
     31                                 print OUT $_;
     32                 }else{
     33                                 chomp;
     34                                 if(/\tgene\t.*gene_name "$gene"/){
     35                                                 my @info = split;
     36                                                 my $len = $info[4] - $info[3] + 1;
     37                                                 push @gene,"The gnen:$gene\tThe gene_id:$1\n" if ($_ =~ m/gene_id "(\w+)";/);
     38                                                 push @gene,"The seqname:$info[0]\tThe source:$info[1]\tThe feature:$info[2]\n";
     39                                                 push @gene,"The start:$info[3]\tThe end:$info[4]\tThe strand:$info[6]\tThe length:$len\n";
     40                                                 $chr_of_snp = $info[0];
     41                                                 $strand = $info[6];
     42                                 }elsif(/\ttranscript\t.*transcript_name "$gene-001";/){
     43                                                 my @info = split;
     44                                                 my $len = $info[4] - $info[3] + 1;
     45                                                 push @trans,"The transcript_id:$1\tThe transcript_name:$2\n" if($_ =~ m/transcript_id "(\w+)";.*transcript_name "(\w+\-?\w+)";/);
     46                                                 push @trans,"The feature:$info[2]\tThe start:$info[3]\tThe end:$info[4]\tThe length:$len\n";
     47                                 }elsif(/\texon\t.*transcript_name "$gene-001"/){
     48                                                 my @info = split;
     49                                                 my $len = $info[4] - $info[3] + 1;
     50                                                 push @exon,"The exon_number:$1\tThe exon_id:$2\n" if($_=~ m/exon_number "(\w+)";.*exon_id "(\w+)";/);
     51                                                 push @exon,"The feature:$info[2]\tThe start:$info[3]\tThe end:$info[4]\tThe length:$len\n";
     52                                 }elsif(/\tCDS\t.*transcript_name "$gene-001"/){
     53                                                 my @info = split;
     54                                                 my $len = $info[4] - $info[3] + 1;
     55                                                 push @cds,"The ccds_id:$2\tThe exon_number:$1\n" if($_ =~ m/exon_number "(\w+)";.*ccds_id "(\w+)";/);
     56                                                 push @cds,"The feateur:$info[2]\tThe start:$info[3]\tThe end:$info[4]\tThe frame:$info[7]\tThe length:$len\n";
     57                                                 $cds{$1} = [] if not exists $cds{$1};
     58                                                 ######$info[3] += $info[7]; ===>>> NCBI web include the frame and start code , but not the stop code
 59                                                 ######==========================>> The gtf file include the start and stop codes !
     60                                                 push @{$cds{$1}},($info[3]..$info[4]);
     61                                                 push @temp,$len;
     62                                 }elsif(/\tstart_codon\t.*transcript_name "$gene-001"/){
     63                                                 my @info = split;
     64                                                 my $len = $info[4] - $info[3] + 1;
     65                                                 push @start_end,"The transcript_id:$1\tThe exon_number:$2\n" if($_ =~ m/transcript_id "(\w+)";.*exon_number "(\w+)";/);
     66                                                 push @start_end,"The feature:$info[2]\tThe start:$info[3]\tThe end:$info[4]\tThe frame:$info[7]\tThe length:$len\n";
     67                                 }elsif(/\tstop_codon\t.*transcript_name "$gene-001"/){
     68                                                 my @info = split;
     69                                                 my $len = $info[4] - $info[3] + 1;
     70                                                 push @start_end,"The exon-number:$1\n" if($_ =~ m/exon_number "(\w+)";/);
     71                                                 push @start_end,"The feature:$info[2]\tThe start:$info[3]\tThe end:$info[4]\tThe frame:$info[7]\tThe length:$len\n";
     72                                 }elsif(/\tfive_prime_utr\t.*transcript_name "$gene-001"/){
     73                                                 my @info = split;
     74                                                 my $len = $info[4] - $info[3] + 1;
     75                                                 push @utr,"The gene_id:$1\tThe transcript_id:$2\n" if($_ =~ m/gene_id "(\w+)";.*transcript_id "(\w+)";/);
     76                                                 push @utr,"The feature:$info[2]\tThe start:$info[3]\tThe end:$info[4]\tThe length:$len\n";
     77                                 }elsif(/\tthree_prime_utr\t.*transcript_name "$gene-001"/){
     78                                                 my @info = split;
     79                                                 my $len = $info[4] - $info[3] + 1;
     80                                                 push @utr,"The feature:$info[2]\tThe start:$info[3]\tThe end:$info[4]\tThe length:$len\n";
     81                                 }
     82                 }
     83 }
     84 close GTF;
     85 my @total = (@gene,@trans,@start_end,@exon,@cds,@utr);
     86 for(@total){
     87                 print OUT $_;
     88 }
     89 close OUT;
     90
     91 my %seq;
     92 my ($position_f,$position_r)=(0,0);
     93 for my $num(sort {$a<=>$b}keys %cds){
     94                 if($strand eq "+"){
     95                                 for(@{$cds{$num}}){
     96                                     $seq{$position_f} = $_;
     97                                     $position_f ++;
     98                                 }
     99                 }else{
    100                           for(reverse @{$cds{$num}}){
    101                                                 $seq{$position_r} = $_;
    102                                                 $position_r ++;
    103                                 }
    104                 }
    105 }
    106
    107 $pos_of_snp = $seq{$pos};
    108 print"The snp location of Genome:chr$chr_of_snp\tThe strand:$strand\n";
    109
    110 for my $num(sort {$a<=>$b} keys %cds){
    111                 for(@{$cds{$num}}){
    112                                 $exon_of_snp = $num if ($pos_of_snp == $_);
    113                 }
    114 }
    115 print"The snp locaton of Exon:$exon_of_snp\n";
    116 print"The snp position of Chromosome:$pos_of_snp\n";
 117
    118 open BED,"<","/share/Data01/huizhen/reference/NCBI/bed_chr_all.bed" or die "cant open:$!";
    119 my $rs ||= "None";
    120 while(<BED>){
    121                 if(/$chr_of_snp\t\d+\t\b$pos_of_snp\b/i){
    122                     $rs = (split /\t/,$_)[3];
    123                     last;
    124                 }
    125 }
    126 close BED;
    127 print"The snp of rs:$rs\n";
    128 my $genome_chr = join "",("chr_","$chr_of_snp");
    129 open SEQ,"<","/share/Data01/huizhen/reference/NCBI/GRCh38.p7/$genome_chr" or die"cant open:$!";
    130 my $seq_of_chr;
    131 while(<SEQ>){
    132                 chomp;
    133                 $seq_of_chr .= $_ unless ($_ =~ m/^>/);
    134 }
    135 my $seq_before = substr($seq_of_chr,$pos_of_snp - 26,25);
    136 my $seq_mut = substr($seq_of_chr,$pos_of_snp,1);
    137 my $seq_after = substr($seq_of_chr,$pos_of_snp,25);
    138 print"The bases before and after the mutation:\n";
    139 if($strand eq "+"){
    140     print"$seq_before     $seq_mut     $seq_after\n";
    141 }else{
    142                 $seq_before =~ tr/atcgATCG/tagcTAGC/;
    143                 $seq_before = reverse $seq_before;
    144                 $seq_mut =~ tr/atcgATCG/tagcTAGC/;
    145                 $seq_mut = reverse $seq_mut;
    146                 $seq_after =~ tr/atcgATCG/tagcTAGC/;
    147                 $seq_after = reverse $seq_after;
    148                 print"$seq_after     $seq_mut     $seq_before\n";
    149 }


苛求远离完美
回复 支持 1 反对 0

使用道具 举报

10

主题

50

帖子

217

积分

中级会员

Rank: 3Rank: 3

积分
217
QQ
发表于 2016-11-11 18:11:06 | 显示全部楼层
楼主,这个题目真的真的很有作用!
苛求远离完美
回复 支持 反对

使用道具 举报

10

主题

50

帖子

217

积分

中级会员

Rank: 3Rank: 3

积分
217
QQ
发表于 2016-11-14 14:17:42 | 显示全部楼层
楼主,我发现有一个问题:
gtf都是ensemble的文件,这里面的基因和NCBI的基因的起始位点还是有些不同,例如:
ensembl_havana  gene    11785723        11806920    (genome-build GRCh38.p7)

NCBI GRCh38.p7 (GCF_000001405.33)        1        NC_000001.11 (11785730..11806103, complement)
我想通过这个查询来查找任意位置突变的rs号码,但是查到的位置和NCBI不一样,怎么解决额?
苛求远离完美
回复 支持 反对

使用道具 举报

0

主题

3

帖子

50

积分

注册会员

Rank: 2

积分
50
发表于 2017-1-11 16:07:33 | 显示全部楼层
惠真-市二医院 发表于 2016-12-12 13:30
贴我的答案了,写的很乱,我之前没考虑到基因所在的正负链问题,现在考虑到了!同时打印出突变所在cds的前 ...

多谢,借走一用
回复 支持 反对

使用道具 举报

0

主题

11

帖子

65

积分

注册会员

Rank: 2

积分
65
发表于 2017-1-14 13:12:31 | 显示全部楼层
寻找 BRAF.c.1799T>A,的位置。寻找其他的突变位点,可以跟这个是一样的方法。
[AppleScript] 纯文本查看 复制代码
import time
import re
start = time.clock()
lichr = []
li_gene = []
with open("F:\\Python34\\zbioinfo\\Homo_sapiens.GRCh38.87.chr.gtf","r") as f:
    for line in f:
        if line.startswith('#'):
            continue
        line = line.rstrip()
        chr_id = line.split('\t')[0]
        if chr_id not in lichr:
            lichr.append(chr_id)
        if re.search(r'"ENST00000288602"',line):       
            if  re.search(r'\tCDS\t',line):
                result =re.findall(r'\t(\d{9})\t(\d{9})',line)
                li_gene.append(result)

summ = []
for li_cds in li_gene:
    num_start = li_cds[0][0]
    num_end = li_cds[0][1]
    for id_loc in range (int(num_start),(int(num_end)+1)):
        summ.append(id_loc)
print (len(summ))
回复 支持 反对

使用道具 举报

0

主题

11

帖子

65

积分

注册会员

Rank: 2

积分
65
发表于 2017-1-14 13:13:50 | 显示全部楼层
贴少了一部分,给补上。
[AppleScript] 纯文本查看 复制代码
print (len(list(set(summ))))
summ_order =sorted(summ)
print (summ_order[-1799])
end = time.clock()
print("The function run time is : %.03f seconds" %(end-start))    
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|关于我们|手机版|小黑屋|生信技能树    

GMT+8, 2017-4-28 01:10 , Processed in 0.034798 second(s), 33 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.