搜索
查看: 114|回复: 0

生成genefuse 的fusion.csv

[复制链接]

2

主题

2

帖子

65

积分

注册会员

Rank: 2

积分
65
发表于 2019-9-16 20:54:59 | 显示全部楼层 |阅读模式
最近分析一批panel测序数据,Panel中包含一些针对Fusion的捕获探针,使用delly等call SV的软件效果并不好。所以尝试使用genefuse这个软件。
genefuse这个软件是根据一些先验的可能形成Fusion的基因对,直接从fastq文件检测Fusion,我觉得可能会被没有先验知识的caller的性能更好一些。但是,我做的是血液肿瘤,而官方提供的先验Fusion文件中没有包括血液肿瘤中常见的Fusion基因(如RUNX1-RUNX1T1)。官方给了Fusion文件的方法也总是报错(主要是哪个脚本是julia写的,我不会……)。所以就自己动手丰衣足食了。

根据作者提供的信息以及观察fusion文件格式,感觉可以从GTF文件中直接抽取相关信息。代码如下:

[Python] 纯文本查看 复制代码
#!/usr/bin/env python3

from collections import OrderedDict
import logging
logging.basicConfig(level=logging.WARNING)


def main(gtf, genelist,mode):

    # read genelist
    with open(genelist, 'r') as li:
        target_genelist = [i.strip() for i in li]
    # initialize geneblock
    gene_block = {
        "empty": True,
        "gene": {},
        "transcript": OrderedDict()
    }
    # read lines
    with open(gtf, 'r') as g:
        for record in g:

            # the data structure looks like:
            # {
            # empty = False
            #
            # gene:"DDX41"
            #
            # uniq_gene_id = ENSGxxxxxx
            #
            # transcript(){
            #
            #   ENSTxxxxxxxx.3.xxxx.xxxx----exon {
            #       trasncription_information{}
            #       exon{}
            #       cds{}
            #   }
            # }

            if record.strip().startswith("#"):
                continue

            a = parse_oneline(record)
            if a['feature'] == "gene":
                if gene_block['empty'] == True:
                    try:
                        gene_block['gene'] = a
                        gene_block['uniq_gene_id'] = ".".join(
                            [a['attr']['gene_id'], a['attr']['gene_version']])
                        gene_block['gene_name'] = a['attr']['gene_name']
                        gene_block['empty'] = False
                    except:
                        raise Exception(
                            "The Freature line could not be inserted into geneblock")
                else:
                    """
                    something to process geneblock
                    """
                    process_gene_block(gene_block, target_genelist,mode)
                    # initialize gene_block (load new gene record)
                    gene_block['empty'] = False
                    gene_block['uniq_gene_id'] = ".".join(
                        [a['attr']['gene_id'], a['attr']['gene_version']])
                    gene_block['gene_name'] = a['attr']['gene_name']
                    gene_block['gene'] = a
                    gene_block['transcript'] = OrderedDict()
            else:
                uniq_trans_id = ".".join(
                    [a['attr']['transcript_id'], a['attr']['transcript_version']])
                if a['feature'] == 'transcript':
                    # check if the new trancript id is exits, if not insert into gene_block
                    if uniq_trans_id not in gene_block['transcript'].keys():
                        gene_block['transcript'][uniq_trans_id] = {}
                        gene_block['transcript'][uniq_trans_id]['transcript_information'] = a
                else:
                    for k in gene_block['transcript'].keys():
                        if uniq_trans_id == k:
                            general_uniq_id = uniq_trans_id + "." + \
                                a['start'] + "." + a['end'] + \
                                "---" + a['feature']
                            gene_block['transcript'][k][general_uniq_id] = a

        """
        do someting for the last gene_block
        """
        process_gene_block(gene_block, target_genelist,mode)


def process_gene_block(gene_block, target_genelist,mode):
    # build in mode: 
    # 1. all[print all transcripts,the gene name is transcript_name ] 
    # 2. longest
    # 3. first transcript 
    

    select_transcript = ['None', 0]
    genename = gene_block['gene_name'].strip()

    if genename in target_genelist:
        logging.info(
            "[notice] found {0} in GTF file, start to extract information...".format(genename))
        if mode == "longest":
            for k, v in gene_block['transcript'].items():
                if v['transcript_information']['attr']['gene_biotype'] == 'protein_coding':
                    current_transcript_length = int(
                        v['transcript_information']['end']) - int(v['transcript_information']['start'])

                    if current_transcript_length > select_transcript[1]:
                        select_transcript[0] = k
                        select_transcript[1] = current_transcript_length

            if select_transcript[0] == 'None':
                logging.warning(
                    "[Waring] do not found protein_coding transcript in {0}".format(genename))
            else:
                print_transcript(gene_block['transcript'][select_transcript[0]])
        elif mode == "all":
            for k, v in gene_block['transcript'].items():
                if v['transcript_information']['attr']['gene_biotype'] == 'protein_coding':
                    print_transcript(gene_block['transcript'][k])
        

def print_transcript(trans):
    """
            >EML4_ENST00000318522.5,chr2:42396490-42559688
            1,42396490,42396776
            2,42472645,42472827
    """
    out = []

    header_genename = trans['transcript_information']['attr']['gene_name']
    header_ENST = trans['transcript_information']['attr']['transcript_id']
    header_ENST_ver = trans['transcript_information']['attr']['transcript_version']
    header_chr = trans['transcript_information']['chr']
    header_start = trans['transcript_information']['start']
    header_end = trans['transcript_information']['end']

    out.append(">"+header_genename + "_"+header_ENST+"."+header_ENST_ver +
               ",chr"+header_chr + ":" + header_start + "-" + header_end)

    for k, i in enumerate(trans.keys()):
        if i != 'transcript_information':
            sp = i.split("---")[0].split(".")
            out.append(",".join([str(k+1), sp[2], sp[3]]))
    for j in out:
        print(j)
    print("")


def parse_oneline(line):
    r = [i.strip() for i in line.strip().split("\t")]
    chro, sourceo, featureo, st, ed, scoreo, strando, frameo, attributeo = r
    attr_dict = parse_attr(attributeo)
    tmp_dict = {}
    tmp_dict['chr'] = chro
    tmp_dict['source'] = sourceo
    tmp_dict['feature'] = featureo
    tmp_dict['start'] = st
    tmp_dict['end'] = ed
    tmp_dict['score'] = scoreo
    tmp_dict['strand'] = strando
    tmp_dict['frame'] = frameo
    tmp_dict['attr'] = attr_dict
    return(tmp_dict)


def parse_attr(attr):
    a = {i.split(" ")[0].strip(): i.split(" ")[1].rstrip(
        ";").strip('"') for i in attr.split("; ")}
    return(a)


if __name__ == "__main__":
    # config
    mode = "longest"
    gtf = "/Disk384G/biodatabase/Homo_sapiens.GRCh37.87.gtf"
    genelist = './genelist.txt'
    main(gtf, genelist,mode)


GTF格式很不方便,因为每一个行并不是独立的,一个gene行下包含多个transcript行,每个transcript 行下包含多个exon和CDS还有其他一堆乱七八糟的行。所以处理GTF文件我觉得最好是将文件解析成一个一个的gene_block。这个代码对GTF文件实现了简单的解析,按行读取,根据gene -- transcript -- [exon/CDS....] 这种结构返回一个字典对象,实现了我的目的了,其他细节没太考虑。



上一篇:生信分析云平台产品开发 - 3 生信分析流程的进化
下一篇:请问大家非模式生物如何进行基因功能注释
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2019-10-14 22:21 , Processed in 0.037755 second(s), 26 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.