搜索
查看: 5768|回复: 7

生信编程直播第10题:根据指定染色体及坐标得到位置信息

[复制链接]

633

主题

1182

帖子

4030

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
4030
发表于 2017-2-12 15:42:52 | 显示全部楼层 |阅读模式
这个题目是来源于生信编程直播学员的真实需求!
当然,我也遇到过这个需求:http://www.bio-info-trainee.com/1001.html
任意给定基因组的 chr:pos, 判断它在哪个基因上面?这个程序难吗?

基因的chr,start,end都是已知的 (这个文件需要下载,可以是UCSC,或者NCBI)

学术一点讲述这个问题:已知CNV数据在染色体上的position如chr1:2075000-2930999,怎样批量获取其对应的Gene Symbol呢(批量)

而且还可以更复杂,我不仅想知道自己指定的坐标是否在某个基因上:
如果在基因上,我还想知道在内含子还是外显子,进一步说是第几个外显子或者内含子!
如果不在基因上,我想知道我指定的坐标距离哪一个基因的TSS(转录起始位点)最近呢?

就是对指定位点或者区间做genomic features的注释!

当然,这个大把的已经造好的轮子可以做到,比如annovar,snpEFF,VEP就是做单个位点注释genomic features的,其它CHIP-seq的软件,都可以做单个区间的批量注释到genomic features!

既然有现成的轮子,我推荐大家去学习一下就好了。

我们的练习题还是简单一点好!

就是指定位置注释是否在基因上面吧!
作业:
[AppleScript] 纯文本查看 复制代码
chr7        148697841        148698941
chr7        148698942        148699029
chr7        148699911        148701053
chr7        148701109        148701307
chr7        148701354        148702694
chr7        148703100        148703520
chr7        148703831        148704175
chr7        148704484        148704734
chr7        148704857        148705937
chr7        148706271        148706671

首先,拿到上面文件里面每一个区间的序列,然后告诉我该区间是什么基因!
坐标是hg38系统的。
如果你不想写程序,那么熟练使用bedtools就可以了,bedtools这些简单的功能都可以实现。
hg38的基因的起始坐标是:http://www.biotrainee.com/jmzeng/tmp/hg38.tss
因为这个比较复杂,所以我就专门写了一个帖子来做这件事:http://www.biotrainee.com/thread-1321-1-1.html










上一篇:ID转换大全
下一篇:mac下安装mysql与后期学习
你这个问题很复杂,需要打赏,请点击 http://www.bio-info-trainee.com/donate 进行打赏,谢谢
回复

使用道具 举报

2

主题

34

帖子

777

积分

高级会员

Rank: 4

积分
777
发表于 2017-4-11 23:09:17 | 显示全部楼层
本帖最后由 x2yline 于 2017-4-12 14:11 编辑

077-x2yline
尝试着用第五题的思路根据gtf寻找位点覆盖的基因, 当位点只覆盖一个基因时作出基因的结构图,当覆盖多个基因时,只作出基因与目标区域的位置关系。
gtf文件为:Homo_sapiens.GRCh38.87.chr.gtf
测试数据为chr1 2588000-2590000

代码为:
[Python] 纯文本查看 复制代码
import os
import re

def get_chr_info(file, chromosome):
    with open(file, 'r') as f:
        head = ''
        chr_info = ''
        get = 0
        while(True):
            buffer = head + f.read(2048*2048)
            print(buffer[buffer.find('\n')+1:buffer.find('\n')+3],end='\r')
            head = buffer[buffer.rfind('\n'):]
            buffer_handle = buffer[:buffer.rfind('\n')]
            if buffer == head:
                buffer_handle = head
            if ('\n'+ str(chromosome)+'\t') in buffer_handle:
                get = 1
                chr_info += buffer_handle[buffer_handle.find(str(chromosome)+'\t'):]
            else:
                if get == 1:
                    break
    return(chr_info)


def find_target_data(gene_name, chr_info):
        target_data = {} 
        header = ['chr','db','record','start','end','tmp','strand','tmp','info']
        buffer_list = chr_info.split('\n')
        for line in buffer_list:
            if  ('gene_name "'+ gene_name.upper() + '"') in line:
                line_list = line.split('\t')
                for i in range(9):
                    try:
[i]                        target_data[header].append(line_list)
                    except:
                        target_data[header] = [line_list]

        if not target_data:
            print('\n\n There is some wrong with your gene name!\n')
            raise NameError('your gene_name is not exit')
        print ("\nHave got the gene information!")
        return(target_data)
        
def draw_gene_structure(gene_name,target_data, pos_start, pos_end, png_path='',line_width=5):
    gene_symbol = gene_name.upper()
    if not png_path:
        png_path = gene_symbol+'.png'
    #定义颜色的字典
    tmp_colors = ['lime','red', 'blue', 'yellow','yellow','w','lightgray']
    names_tmp_colors = [ 'gene','CDS','exon','three_prime_utr','five_prime_utr','stop_codon']
    colors_legend_name = ['gene','CDS_exon', 'non_CDS_exon', 'UTR_exon',str(pos_start)+'-'+str(pos_end)]
    color_dict = dict(zip(names_tmp_colors,tmp_colors))
    # 提取转录本名称
    import re
    transcript_list = []
    for i in target_data['info']:
        try:
            transcript_name = re.findall('transcript_name "(.*?)"',i)[0]
            if transcript_name not in transcript_list:
                transcript_list.append(transcript_name)
        except:
            pass
    # 计算转录本数目
    transcript_num = 0
    for i in target_data['record']:
        if i =='transcript':
           transcript_num += 1
 
    import numpy as np
    import matplotlib.patches as mpatches
    import matplotlib.pyplot as plt
    import matplotlib.patches as patches
    import matplotlib.lines as lines
    import matplotlib.transforms as transforms
    fig = plt.figure(1)
    # 设置不透明度,默认为1
    fig.patch.set_alpha(1)
    fig.patch.set_facecolor('w')
    num = 0#当前转录本数目标志
    warnings = []
    for i in range(len(target_data['record'])):
        if target_data['record'] == 'gene':
            #判断正反链
            if target_data['strand'] == '+':
                arr = '->'
            else:
                arr = '<-'
            #图的第一个区域
            # add_axes 是在一张图上指定特定区域作图,第一个数字为从左边%20处,下面20%处开始,宽50%,高60%区域作图
            ax = fig.add_axes([0.2,0.2,0.5,0.6])
            # 定义基因方向箭头
            arrow= mpatches.FancyArrowPatch(
            (int(target_data['start']), 0.1),
            (int(target_data['end']), 0.1),
            arrowstyle= arr,
            mutation_scale=25, lw=1, color='lime',antialiased=True)#antialiased默认为True,边缘平滑处理
            # 画箭头
            ax.add_patch(arrow)
            # 坐标轴标签
            ax.set_xlim([int(target_data['start']), int(target_data['end'])])
            ax.set_ylim([-0.5, transcript_num+1])
            ax.set_xticks(np.linspace(int(target_data['start']),int(target_data['end']),5))
            ax.set_yticks([0.1]+list(range(1,transcript_num +1)))
            ax.set_yticklabels(['gene']+transcript_list)
            ax.set_xticklabels([int(i) for i in np.linspace(int(target_data['start']),int(target_data['end']),5)])
            # 坐标轴显示
            ax.spines['top'].set_visible(False)
            ax.spines['left'].set_visible(False)
            ax.spines['right'].set_visible(False)
            ax.get_xaxis().tick_bottom()
            ax.get_yaxis().tick_left()
            ax.get_xaxis().set_tick_params(direction='out')
            ax.tick_params(axis=u'y', which=u'both',length=0)
            # 坐标轴字体大小
            for tick in ax.xaxis.get_major_ticks():
                tick.label.set_fontsize(6) 
            for tick in ax.yaxis.get_major_ticks():
                tick.label.set_fontsize(6) 
        elif target_data['record'] == 'transcript':
            num += 1 # 转录本所有区域计数作图
            line1 = [(int(target_data['start']),num), (int(target_data['end']),num)]
            (line1_xs, line1_ys) = zip(*line1)
            ax.add_line(lines.Line2D(line1_xs, line1_ys, linewidth=0.2,
                                     solid_capstyle = 'butt',solid_joinstyle='miter',
                                     antialiased=False,color='black'))
        elif target_data['record'] in color_dict.keys():
            # 添加结构图
            line2 = [(int(target_data['start'])-0.5,num), (int(target_data['end'])+0.5,num)]
            (line2_xs, line2_ys) = zip(*line2)
            ax.add_line(lines.Line2D(line2_xs, line2_ys, 
                                solid_capstyle = 'butt',solid_joinstyle='miter',
                                linewidth=int(line_width), alpha = 1,
                                color=color_dict[target_data['record']],
                                antialiased=False))
        else:
            warnings.append(target_data['record'])
    trans = transforms.blended_transform_factory(
        ax.transData, ax.transAxes)
    rect = patches.Rectangle((pos_start,0), width=pos_end-pos_start, height=1,
                              transform=trans, color='gray', alpha=0.2)
    ax.add_patch(rect)
    if warnings:
        print('\nTips: ')
        print(' and '.join([i for i in set(warnings)]) + ' is not in our consideration!!!!!!')
 
    
    # 做图例
    # add_axes 是在一张图上指定特定区域作图,第一个数字为从左边%74处,下面20%处开始,宽20%,高60%区域作图
    ax_legend = fig.add_axes([0.76,0.2,0.2,0.6])
    #ax_legend.set_xticks([])
    #ax_legend.set_yticks([])
    tmp_colors = ['lime','red', 'blue', 'yellow','lightgray','w','lightgray']
    names_tmp_colors = [ 'gene','CDS','exon','three_prime_utr','five_prime_utr','stop_codon']
    colors_legend_name = ['gene','CDS_exon', 'non_CDS_exon', 'UTR_exon',str(pos_start)+'-'+str(pos_end)]
    color_dict = dict(zip(names_tmp_colors,tmp_colors))
    for i in range(len(colors_legend_name)):
        line3 = [(0, (9-i)*0.1),(0.1, (9-i)*0.1)]
        (line3_xs, line3_ys) = zip(*line3)
        ax_legend.add_line(lines.Line2D(line3_xs, line3_ys, linewidth=5,
                                color=color_dict[names_tmp_colors],
                                solid_capstyle = 'butt',solid_joinstyle='miter',
                                antialiased=False))
        ax_legend.text(0.2, (8.9-i)*0.1,colors_legend_name , fontsize=6)
    ax_legend.set_axis_off()
    # 加标题
    fig.suptitle('\n\n\nchr' + str(target_data['chr'][0])+ ': ' + gene_symbol, fontsize=10)
    # 保存图片
    fig.savefig(png_path, dpi=150)
    plt.show()
    print('\nThe picture file is completed: ' + png_path)
    print("All transcripts of " + gene_name + ':\n' + " ".join(sorted(transcript_list)))
    return(png_path)

def find_overlap_gene_list(chr_info, chromosome, pos_start, pos_end):
    gene_list = re.findall(str(chromosome)+'\t.*?\t(.*?)\t(.*?)\t(.*?)\t.*?\t(.*?)\t.*?gene_name "(.*?)"',chr_info)
    target_list = []
    target_gene_list = []
    for i in gene_list:
        if (int(i[1])>=pos_start and int(i[1]) <= pos_end) or (int(i[2])>=pos_start and int(i[2])<=pos_end) or (int(i[1])<=pos_start and int(i[2])>=pos_end):
            target_list.append(i)
            target_gene_list.append(i[-1])
    target_gene_list = list(set(target_gene_list))
    if len(target_gene_list) == 0:
        print('\nThese is no genes in the positon\n')
    return(target_gene_list)


def draw_pos_gene(overlap_genes, chr_info, pos_start, pos_end, chromosome):
    import matplotlib.pyplot as plt
    import matplotlib.patches as mpatches
    import re
    import numpy as np
    import matplotlib.patches as patches
    import matplotlib.transforms as transforms
    if len(overlap_genes) > 0:
        gene_info_list = []
        for i in overlap_genes:
            print(i)
            gene_i = re.search(str(chromosome)+'\t.*?\tgene\t(.*?)\t(.*?)\t.*?\t(.*?)\t.*?gene_name "'+i+'";',chr_info)
            gene_info_list.append(chr_info[slice(gene_i.span()[0], gene_i.span()[1])].split('\t')[3:5]+chr_info[slice(gene_i.span()[0], gene_i.span()[1])].split('\t')[6:7]+)
        max_x = max([int(i[1]) for i in gene_info_list]+[int(i[0]) for i in gene_info_list]+[pos_start, pos_end])
        min_x = min([int(i[1]) for i in gene_info_list]+[int(i[0]) for i in gene_info_list]+[pos_start, pos_end])
        fig = plt.figure(1)
        fig.patch.set_alpha(1)
        fig.patch.set_facecolor('w')
        ax = fig.add_axes([0.2,0.2,0.6,0.6])
        y = 0
        ax.set_ylim([0, (len(overlap_genes)+1)*0.1])
        ax.spines['top'].set_visible(False)
        ax.spines['left'].set_visible(False)
        ax.spines['right'].set_visible(False)
        ax.get_yaxis().set_visible(False)
        ax.get_xaxis().tick_bottom()
        ax.get_yaxis().tick_left()
        ax.set_xlim([min_x,max_x])
        ax.set_xticks([int(k) for k in np.linspace(min_x,max_x, 4)])
        ax.set_xticklabels([int(k) for k in np.linspace(min_x,max_x, 4)])
        ax.get_xaxis().set_tick_params(direction='out')
        ax.tick_params(axis=u'y', which=u'both',length=0)
        trans = transforms.blended_transform_factory(
        ax.transData, ax.transAxes)
        rect = patches.Rectangle((pos_start,0), width=pos_end-pos_start, height=1,
                              transform=trans, color='yellow', alpha=0.5)
        ax.add_patch(rect)
        ax.tick_params(axis='x', colors='gray')
        ax.spines['bottom'].set_color('gray')
        for i in gene_info_list:
            if i[2] == '+':
                arr = '->'
                text_x = i[0]
                ha = 'right'
            else:
                arr = '<-'
                text_x = i[1]
                ha = 'left'
                
            y += 0.1
            # define arrow for genes
            arrow= mpatches.FancyArrowPatch(
                (int(i[0]), y),
                (int(i[1]), y),
                arrowstyle= arr,
                mutation_scale=25, lw=1, color='blue',antialiased=True, alpha=0.6)
            ax.add_patch(arrow)
            ax.text(text_x,y, i[-1] ,style='italic', ha=ha, va='center', color='blue', fontsize=200*6/(200+len(overlap_genes)*0.1))

    fig.suptitle('\n\n\nchr' + str(chromosome)+ ': ' + str(pos_start) +'-'+str(pos_end), fontsize=10, color='red')
    png_path = str(pos_start)+'-'+str(pos_end)+'.png'
    fig.savefig(png_path, dpi=150)
    return(png_path)








def main(chromosome, pos_start, pos_end):
    os.chdir(r'E:\r\biotrainee_demo\class10')
    file = 'Homo_sapiens.GRCh38.87.chr.gtf'
    chr_info = get_chr_info(file, chromosome)
    overlap_genes = find_overlap_gene_list(chr_info, chromosome, pos_start, pos_end)
    if len(overlap_genes) == 1:
        for i in overlap_genes:
            draw_gene_structure(i,find_target_data(i, chr_info),pos_start,pos_end)
    else:
        draw_pos_gene(overlap_genes, chr_info, int(pos_start), int(pos_end), int(chromosome))
    print('\nThere are %d genes in the position\n'%len(overlap_genes))

if __name__ == '__main__':
    chromosome = input("Please enter the Chromsome number:\n").strip()
    pos_start = input("Please enter the start position:\n").strip()
    pos_end = input("Please enter the stop position:\n").strip()
    if not chromosome:
        chromosome = 1
    if not pos_start:
        pos_start = 2075000
    if not pos_end:
        pos_end = 2930999
    main(int(chromosome), int(pos_start), int(pos_end))

位点为题目中的chr1:2075000-2930999时得出的图为:

回复 支持 2 反对 0

使用道具 举报

0

主题

5

帖子

105

积分

注册会员

Rank: 2

积分
105
发表于 2017-4-24 21:05:10 | 显示全部楼层
本帖最后由 王土土哈 于 2017-4-24 21:08 编辑

[Bash shell] 纯文本查看 复制代码
awk 'BEGIN{FS="\t";OFS="\t"}{if($3>0 && $4>0)print $2,$3,$4,$1,$5}' hg38.tss|bedtools sort -i - |bedtools closest -a input.bed -b - -D "b"


结果:
chr7        148697841        148698941        chr7        148696841        148700841        NM_003592        0        0
chr7        148698942        148699029        chr7        148696841        148700841        NM_003592        0        0
chr7        148699911        148701053        chr7        148696841        148700841        NM_003592        0        0
chr7        148701109        148701307        chr7        148696841        148700841        NM_003592        0        -269
chr7        148701354        148702694        chr7        148696841        148700841        NM_003592        0        -514
chr7        148703100        148703520        chr7        148696841        148700841        NM_003592        0        -2260
chr7        148703831        148704175        chr7        148696841        148700841        NM_003592        0        -2991
chr7        148704484        148704734        chr7        148696841        148700841        NM_003592        0        -3644
chr7        148704857        148705937        chr7        148696841        148700841        NM_003592        0        -4017
chr7        148706271        148706671        chr7        148696841        148700841        NM_003592        0        -5431

前三个坐标找到的是同一个基因,后面几个都没有找到。前三列是输入坐标的信息,第四到第六列是距离最近的基因的坐标信息,第七列是基因名称,最后一列是坐标与最近基因的距离
回复 支持 反对

使用道具 举报

29

主题

131

帖子

1208

积分

金牌会员

Rank: 6Rank: 6

积分
1208
发表于 2017-4-27 23:58:46 | 显示全部楼层
034-perl-shell

简单测试了下,输入chr1 27371  输出 0 27370(index和起始位点)

代码如下:
[Perl] 纯文本查看 复制代码
#!/usr/bin/perl -w
use strict;

#my ($chr, $pos) = @ARGV;
my $chr = "chr1";
my $pos = 27371;

my %hash;
open my $fh, "hg38.tss.txt" or die "Cannot open txt";
while(<$fh>){
	my @array = split /\t/, $_;
	$hash{$array[1]} -> {$array[2]} = [($array[0], $array[3])];
}
close $fh;

my %all_pos;
map{
	$all_pos{$_} = [sort {$a <=> $b} keys %{$hash{$_}}];
}keys %hash;

my $max_all = $#{$all_pos{$chr}};
my $min_all = 0;

my $near = &search($all_pos{$chr},$pos,$max_all,$min_all);
my $near_pos = $all_pos{$chr} -> [$near];
print "$near\t$near_pos\n";

sub search{
	my ($tmp,$in,$max,$min) = @_;
	if ($in < $tmp -> [$min]){
		if ($in < ($tmp -> [$min-1] + $tmp -> [$min])/2){
			return $min-1;
		}else{
			return $min;
		}
	}elsif($in > $tmp -> [$max]){
		if ($in < ($tmp -> [$max] + $tmp -> [$max+1])/2){
			return $max;
		}else{
			return $max + 1;
		}
	}
	
	my $mid = int(($min+$max)/2);
	
	if ($tmp -> [$mid] == $in){
		return $mid;
	}elsif($tmp -> [$mid] < $in){
		&search($tmp, $in, $max, $mid+1);
	}else{
		&search($tmp, $in, $mid-1, $min);
	}
}
回复 支持 反对

使用道具 举报

0

主题

4

帖子

89

积分

注册会员

Rank: 2

积分
89
QQ
发表于 2017-4-30 21:48:39 | 显示全部楼层
本帖最后由 xzmcxjb 于 2017-5-8 15:37 编辑

这一题首先要将TSS文件用制表符排列好,然后由于文件较大,可能发生内存读不进来而报错,我选择了前一千个基因片段进行分析。
[Python] 纯文本查看 复制代码
#!/user/bin/env python
# coding=utf-8

import sys#引入sys模块
args = sys.argv#调用命令行参数

class Genome_info:#创建类Genome_info
    def __init__(self):
        self.chr = ""
        self.start = 0
        self.end = 0#初始化属性

class Gene(Genome_info):#创建父类Genome_info之下的类Gene
    def __init__(self):
        Genome_info.__init__(self)
        self.orientation = ""
        self.id = ""#初始化属性

def main(args):
    list_chr = {}#定义染色体列表
    with open(args[1]) as fp_gene:#导入参数1,即TSS.txt
        for line in fp_gene:
            if line.startswith("#"):#如果某行以#开头则越过
                continue

            lines = line.strip("\n").split("\t")#每行去除换行,以制表符分割
            id = lines[0]#第一栏为基因id
            chr = lines[1]#第二栏为染色体号
            start = int(lines[2])#第三栏转为整数型
            end = int(lines[3])#第四栏转为整数型
            orientation = lines[4]#第五栏为基因方向

            if not chr in list_chr:#如果染色体号在列表里不存在就初始化一下
                list_chr[chr] = {}

            gene = Gene()#初始化基因
            gene.chr= chr#初始化染色体
            gene.start = start#初始化基因起始位点
            gene.end = end#初始化基因结束位点
            gene.id = id#初始化ID
            gene.orientation = orientation#初始化基因方向
            list_chr[chr][id] = gene#将基因键、值存入list_chr字典

    with open(args[2]) as fp_pos:#导入参数2,即pos.txt
        for line in fp_pos:
            gene_list = []#初始化gene_list
            lines = line.strip('\n').split('\t')#每行去除换行,用制表符分割
            (chr, start, end) = (lines[0], int(lines[1]), int(lines[2]))#取出染色体号,起始坐标和结束坐标,后两者均转为整数型
            for gene_id, gene in list_chr[chr].items():#判断pos.txt中基因位置与TSS.txt中是否有重叠
                if gene.start <= start <= gene.end or gene.start <= end <= gene.end or start <= gene.start <= end or start <= gene.end <= end:
                    gene_list.append(gene.id)#如有则将基因ID添加至列表gene_list
            print gene_list#输出gene_list

if __name__ == '__main__':#直接运行当前代码
    main(args)

运行结果:
['NM_001318199', 'NR_049893', 'NM_207163', 'NM_030570', 'NM_007155', 'NR_015381', 'NM_006989', 'NM_000168', 'NM_001293079', 'NR_104148', 'NM_001128854', 'NM_001287055', 'NM_001287054']
[]
[]
[]
[]
[]
[]
[]
[]
[]
['NR_046018', 'NR_024540']


回复 支持 反对

使用道具 举报

64

主题

138

帖子

681

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
681
发表于 2017-5-4 12:21:06 | 显示全部楼层
王土土哈 发表于 2017-4-24 21:05
[mw_shl_code=bash,true]awk 'BEGIN{FS="\t";OFS="\t"}{if($3>0 && $4>0)print $2,$3,$4,$1,$5}' hg38.tss| ...

注意文件的间隔应该事tab(制表符)
awk 'BEGIN{FS="\t";OFS="\t"}(指定分隔符为制表符){if($3>0 && $4>0)print $2,$3,$4,$1,$5}'#行匹配,
hg38.tss|bedtools sort -i - |bedtools closest -a input.bed -b - -D "b"
awk 'BEGIN{FS="\t";OFS="\t"}{if($3>0 && $4>0)print $2,$3,$4,$1,$5}’hg38.tss匹配行
$1 是传递传递给shell脚本的第1个参数,如果第3和第4个参数(起始点和终止点)大于0,则输出1,2,3,4,5个参数(输出hg38.tss)
“|”则表示是一个管道的,可以理解为一个管道从一边流向另外一边。
bedtools sort -i -对文件进行排序(染色体为标准)
closest表示找到两个文件的重叠部分,-D "b"表示以b文件为参考。

我这样的解释对不对? -i-后面的-是代表文件hg38.tss?“-b-”也是相同的用法么?可以回答“-“的用法么?谢谢。
最后一个“b”双引号什么用法?
回复 支持 反对

使用道具 举报

0

主题

16

帖子

145

积分

注册会员

Rank: 2

积分
145
发表于 2017-5-15 15:57:30 | 显示全部楼层
本帖最后由 azazelcc 于 2017-5-15 15:59 编辑

补作业3,脑子有点不清楚,感觉写很一般。

主要思路都是照抄dongye老师,建立双层字典进行查询,底层value用一个类来初始化,分别存储更多的数据。


[AppleScript] 纯文本查看 复制代码
#!/usr/bin/env python
# encoding=utf-8
import sys
args = sys.argv




class Gene:
    def __init__(self):
        self.chr = ''
        self.start = 0
        self.end = 0
        self.orientation = 0
        self.id = ''


def main(args):


    chr_list = {}
    fi = open(args[1], 'r')
    for line in fi:
        lines = line.strip('\n').split()

        id = lines[0]
        chrs = lines[1]
        start = lines[2]
        end = lines[3]
        orientation = lines[4]

        if not chrs in chr_list:
            chr_list[chrs] = {}

        gene = Gene()
        gene.start = int(start)
        gene.end = int(end)
        gene.chr = chrs
        gene.orientation = int(orientation)
        gene.id = id
        chr_list[chrs][id] = gene

    fi2 = open(args[2], 'r')
    gene_list = []
    for line in fi2:
        lines = line.strip('\n').split()
        chrs = lines[0]
        start = int(lines[1])
        end = int(lines[2])
        for id, gene in chr_list[chrs].items():
            if (gene.start <= start <= gene.end) or (gene.start <= end <= gene.end) or (start <= gene.start <= end) or (
                            start <= gene.end <= end):
                gene_list.append(id)
    print(gene_list, sep='\n')


if __name__ == '__main__':
    main(args)

回复 支持 反对

使用道具 举报

0

主题

15

帖子

151

积分

注册会员

Rank: 2

积分
151
发表于 2017-6-6 12:01:15 | 显示全部楼层
等了一个周,终于做出来这道题了,原来下的那个hg19文件有问题,困惑了我好久



代码如下:

[Perl] 纯文本查看 复制代码
#!/usr/bin/perl

use strict;
use warnings;

open FH,$ARGV[0] or die $!;
my %hash;

while(<FH>){
	chomp;
	next if /^#/;
	my @info = split /\t/,$_;
	if ($info[1] eq "chr1"){
	$hash{$info[0]}{'start'} = $info[2];
	$hash{$info[0]}{'end'} = $info[3];
	}else{
		next;
	}
}
close FH;

my $chr = "chr1";
my $start = 2075000;
my $end = 2930999;

foreach my $k(sort keys %hash){
	my $s = $hash{$k}{'start'};
	my $e = $hash{$k}{'end'};
	if ( $s >= $start && $e <= $end ){
		print "chr1\t$k\t$s\t$e\n";
	}else{
		next;
	}
}

#Gene   Chr     start   end     -
#NR_046018       chr1    9874    13874   0
#NR_024540       chr1    27370   31370   1
#NR_104148       chr7    64664083        64668083        0
#NR_111960       chrX    44871175        44875175        0
#NR_028458       chr14   92104621        92108621        1
#NR_028459       chr14   92104621        92108621        1
#NR_026818       chr1    34081   38081   1
#NR_026820       chr1    34081   38081   1




结果示意:



共筛选出38个 ,部分展示如图

共筛选出38个 ,部分展示如图
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2019-11-21 17:45 , Processed in 0.044098 second(s), 29 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.