搜索
查看: 27|回复: 0

机器学习&深度学习项目实战(三)

[复制链接]

3

主题

3

帖子

33

积分

新手上路

Rank: 1

积分
33
发表于 4 天前 | 显示全部楼层 |阅读模式
3. 定义函数

这里的话是作者定义的函数,其中 get_threshold_metrics 是用来计算 roc 曲线,integrate_copy_number是用于表达谱和处理好的拷贝数变异信息进行整合,shuffle_columns 是用于基因扰动

这里要提一下为什么要进行基因扰动:首先,对于一个基因表达谱而言,行是基因;列是样本,在常规的R包分析中,无论是 在TCGA 的数据还是 GEO 的数据,我们从来不关心基因还是样本的在表达谱中的哪一个位置,只需要有唯一的标识符即可,但是在机器学习中并不是这样的,不同的特征位置可能会导致不同的结果,树模型尤其如此

### 定义函数
# 获取计算ROC
def get_threshold_metrics(y_true, y_pred, drop_intermediate=False,
                          disease='all'):
    """
    Retrieve true/false positive rates and auroc/aupr for class predictions

    Arguments:
    y_true - an array of gold standard mutation status
    y_pred - an array of predicted mutation status
    disease - a string that includes the corresponding TCGA study acronym

    Output:
    dict of AUROC, AUPR, pandas dataframes of ROC and PR data, and cancer-type
    """
    import pandas as pd
    from sklearn.metrics import roc_auc_score, roc_curve
    from sklearn.metrics import precision_recall_curve, average_precision_score

    roc_columns = ['fpr', 'tpr', 'threshold']
    pr_columns = ['precision', 'recall', 'threshold']

    if drop_intermediate:
        # zip 函数类似于R语言的rbind
        # 用法还是有点差异的
        # 在python3.x版本中,zip类得用list 或者dict 调用
        roc_items = zip(roc_columns,
                        roc_curve(y_true, y_pred, drop_intermediate=False))
        # 这里注意一下,roc_curve 返回的是tuple类型的
        # tumple 类 和list 类 进行zip时会返回dict类
    else:
        roc_items = zip(roc_columns, roc_curve(y_true, y_pred))
        
    roc_df = pd.DataFrame.from_dict(dict(roc_items))
    prec, rec, thresh = precision_recall_curve(y_true, y_pred) #精确率召回率曲线
    # prec 精确率
    # rec  召回率
    # thresh 阈值
    pr_df = pd.DataFrame.from_records([prec, rec]).T
    # 这里就有点没搞明白,他为什么要多此一举的多做一步这个
    # pr_df = pd.DataFrame.from_records([prec, rec, thresh]).T
    pr_df = pd.concat([pr_df, pd.Series(thresh)], ignore_index=True, axis=1)
    pr_df.columns = pr_columns

    auroc = roc_auc_score(y_true, y_pred, average='weighted') # 计算auc和roc 得分
    aupr = average_precision_score(y_true, y_pred, average='weighted')

    return {'auroc': auroc, 'aupr': aupr, 'roc_df': roc_df,
            'pr_df': pr_df, 'disease': disease}

# 整合拷贝数变异信息
def integrate_copy_number(y, cancer_genes_df, genes, loss_df, gain_df,
                          include_mutation=True):
    """
    Function to integrate copy number data to define gene activation or gene
    inactivation events. Copy number loss results in gene inactivation events
    and is important for tumor suppressor genes while copy number gain results
    in gene activation events and is important for oncogenes.

    Arguments:
    y - pandas dataframe samples by genes where a 1 indicates event
    cancer_genes_df - a dataframe listing bona fide cancer genes as defined by
                      the 20/20 rule in Vogelstein et al. 2013
    genes - the input list of genes to build the classifier for
    loss_df - a sample by gene dataframe listing copy number loss events
    gain_df - a sample by gene dataframe listing copy number gain events
    include_mutation - boolean to decide to include mutation status
    """

    # Find if the input genes are in this master list
    # 提取基因信息
    genes_sub = cancer_genes_df[cancer_genes_df['Gene Symbol'].isin(genes)]

    # Add status to the Y matrix depending on if the gene is a tumor suppressor
    # or an oncogene. An oncogene can be activated with copy number gains, but
    # a tumor suppressor is inactivated with copy number loss
    # 判断基因属于抑癌基因还是原癌基因
    tumor_suppressor = genes_sub[genes_sub['Classification*'] == 'TSG']
    oncogene = genes_sub[genes_sub['Classification*'] == 'Oncogene']

    copy_loss_sub = loss_df[tumor_suppressor['Gene Symbol']]
    copy_gain_sub = gain_df[oncogene['Gene Symbol']]

    # Append to column names for visualization
    # 将基因突变信息和表达谱结合
    copy_loss_sub.columns = [col + '_loss' for col in copy_loss_sub.columns]
    copy_gain_sub.columns = [col + '_gain' for col in copy_gain_sub.columns]

    # Add columns to y matrix
    y = y.join(copy_loss_sub)
    y = y.join(copy_gain_sub)

    # Fill missing data with zero (measured mutation but not copy number)
    y = y.fillna(0)
    y = y.astype(int)
   
    # 将基因表达剔除
    if not include_mutation:
        y = y.drop(genes, axis=1)
    return y

# 扰动基因
def shuffle_columns(gene):
    """
    To be used in an `apply` pandas func to shuffle columns around a datafame
    Import only
    """
    import numpy as np
    return np.random.permutation(gene.tolist())

回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2020-8-4 19:09 , Processed in 0.077767 second(s), 29 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.