搜索
查看: 5486|回复: 0

我以为我介绍过了UCSC的cancer genome browser

[复制链接]

634

主题

1182

帖子

4030

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
4030
发表于 2017-3-6 15:25:45 | 显示全部楼层 |阅读模式
其实根cbioportal和firehose没有本质上的区别,都是探索TCGA的level3数据,可能稍微用起来方便一点,当然也需要一点时间去熟悉这个网页工具。
旧版:https://genome-cancer.ucsc.edu/proj/site/hgHeatmap/
新版:https://xenabrowser.net/heatmap/#
可以对任何癌种,根据任何临床指标进行分sub-group之后进行任何形式的生存分析,比较分析,还有相关分析。

原始level3数据也是可以下载的:https://xenabrowser.net/datapages/
进入到每个数据集,选择数据种类,介绍:


下面是FAQ,英文不好的朋友,就不用看了,反正你也看不懂;
FAQ on data hosted in the UCSC public Xena Hub



1.  How can I download un-normalized data?
There are two types of normalization that we provide: on-the-fly mean normalization, and dataset normalization (varies by dataset, e.g. log2(x+1)).
If you want to download data with both types of normalization, then use the Xena visualization to create a column with the data you want to download and then use the caret menu to download it.
If you want to download data with just the dataset normalization, then download the data via our Explore Data pages.
If you want to download the data without any normalization (i.e. the raw data), then we recommend downloading the ‘raw’ data directly from the data providers. We provide a link to where we obtained the data in the Explore Data pages.
2.  What are the data values in the download file for TCGA data?
Example filename
Values in file
TCGA_KIRC_exp_HiSeqV2
Log2(x+1),  x is the RSEM value
TCGA_KIRC_exp_HiSeqV2_PANCAN
Log2(x+1) value mean-normalized per-gene across all TCGA samples, extracted converted values only belong to this cohort.  x is the RSEM value
TCGA_KIRC_exp_HiSeqV2_percentile
Percentile ranking of RSEM value per sample, values range from 0 to 100, lower values representing lower expression
TCGA_KIRC_gistic2
Gistic2 value from Broad Firehose
TCGA_KIRC_gistic2thd
Gistic2 value discretized to -2,-1,0,1,2 by Broad Firehose
TCGA_KIRC_hMethyl27
beta values
TCGA_KIRC_hMethyl450
beta values
TCGA_KIRC_miRNA
Log2(x+1), x is RPKM value
TCGA_KIRC_mutation
PANCAN AWG somatic mutation calls
TCGA_KIRC_PDMRNAseq
Pathway inference score derived using RNAseq data alone (generated at Firehose)
TCGA_KIRC_PDMRNAseqCNV
Pathway inference score derived using RNAseq and copy number data (generated at Firehose)
TCGA_KIRC_RPPA
RPPA value
TCGA_KIRC_RPPA_RBN
RBN-normalized RPPA value


3.  For TCGA, which gene expression RNAseq dataset should I use for my analysis?
gene expression RNAseq (IlluminaHiSeq)
For comparison within a single TCGA cohort, you can use the "gene expression RNAseq" data. Values in this dataset is log2(x+1) where x is the RSEM value.

gene expression RNAseq (IlluminaHiSeq pancan normalized)
For questions regarding the gene expression of a particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data.  Values in this dataset are generated at UCSC by combining "gene expression RNAseq" values (above) of all TCGA cohorts, values are then mean-centered per gene, then extracting the converted data only belongs to the cohort of interest (e.g. TCGA breast cancer). Since there are 30-40 cancer types with RNAseq data, the TCGA pancan data can serve as a proxy of background distribution of gene expression.

gene expression RNAseq (IlluminaHiSeq percentile)
For comparing with data outside TCGA, you can use the percentile version if your non-TCGA RNAseq data is normalized by percentile ranking. Values in this dataset are generated at UCSC by rank RSEM values per sample.  The values are percentile ranks ranges from 0 to 100, lower values represent lower expression. You can also combine the TCGA RNAseq data with your RNAseq data, perform normalization across the combined dataset using whatever method you choose, then analyze the combined dataset further.

TCGA Pan-Cancer gene expression
For comparison across multiple or all TCGA cohorts. Dataset is generated at UCSC by combining "gene expression RNAseq (IlluminaHiSeq) data" (see above) from all TCGA cohorts. No further normalization is performed.
4.  What is the difference between RPPA data and RPPA_RBN data? And why the number of features varies for both?
The TCGA RPPA data are generated at MD Anderson.  RPPA data is values generated using method described at http://bioinformatics.mdanderson.org/main/TCPA:Overview. We download the RPPA values from TCGA DCC.

The RPPA_RBN data is normalized value generated using the RBN (replicate-base normalization) method developed by MDACC.  For more information: http://bioinformatics.mdanderson.org/main/TCPA:Overview.  We downloaded the RBN values from synapse at https://www.synapse.org/#!Synapse:syn1750330.  
5.  Can I combine data from the methylation 450k and 27k datasets?
The methylation 450k dataset has 90% of the probes from the 27k dataset. However, we have discovered the range of data for each dataset to be slightly different. As such, we recommend applying some sort of normalization. I would recommend looking in the literature to see what methods people have used.
6.  What is the normalization procedure for the methylation 450k and 27k datasets?
From the dataset details pages:

450k: DNA methylation profile was measured experimentally using the Illumina Infinium HumanMethylation450 platform. DNA methylation values, described as beta values, are recorded for each array probe in each sample via BeadStudio software. DNA methylation beta values are continuous variables between 0 and 1, representing the ratio of the intensity of the methylated bead type to the combined locus intensity. Thus higher beta values represent higher level of DNA methylation (hypermethylation) and lower beta values represent lower level of DNA methylation (hypomethylation). We observed a bimodal distribution of the beta values from both methylation27 and methylation450 platforms, with two peaks around 0.1 and 0.9 and a relatively flat valley around 0.2-0.8. The bimodal distribution is far more pronounced and balanced in methylation450 than methylation27 platform. In the methylation27 platform, the lower beta peak is much stronger than the higher beta peak, while the two peaks are of similar height in the methylation450 platform.. Microarray probes are mapped onto the human genome coordinates using xena probeMap derived from GEO GPL13534 record. Here is a reference to Illumina Infinium BeadChip DNA methylation platform beta value.

27k: DNA methylation profile was measured experimentally using the Illumina Infinium HumanMethylation27 platform. DNA methylation values, described as beta values, are recorded for each array probe in each sample via BeadStudio software. DNA methylation beta values are continuous variables between 0 and 1, representing the ratio of the intensity of the methylated bead type to the combined locus intensity. Thus higher beta values represent higher level of DNA methylation, i.e. hypermethylation and lower beta values represent lower level of DNA methylation, i.e. hypomethylation. We observed a bimodal distribution of the beta values from both methylation27 and methylation450 platforms, with two peaks around 0.1 and 0.9 and a relatively flat valley around 0.2-0.8. The bimodal distribution is far more pronounced and balanced in methylation450 than methylation27 platform. In the methylation27 platform, the lower beta peak is much stronger than the higher beta peak, while the two peaks are of similar height in the methylation450 platform. The average of the beta values of this dataset is 0.261794241419., thus much of the heatmap appears hypomethylated (blue). Microarray probes are mapped onto the human genome coordinates using xena probeMap derived from GEO GPL8490 and GPL13534 records. Here is a reference to Illumina Infinium BeadChip DNA methylation platform beta value.
7.  What is the difference between GISTIC 2 and GISTIC 2 thresholded datasets?
Many copy number estimation algorithms estimate copy number variation on a continuous scale even though it is measuring something discrete (i.e. the number of copies of piece of chromosome or a gene in the cell). The GISTIC 2 thresholded data attempts to assign discrete numbers to these fragments by thresholding the data. The estimated values -2,-1,0,1,2, represent homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, or high-level copy number amplification respectively. More information can be found in the GISTIC 2 paper and at the Broad Institute, which is the group that processed this data.

8.  For TCGA data, in the cohort.json file, there is the default dataset for somatic mutation data. How is the "default"  for each cohort selected?
Those are set by the UCSC xena team manually for each cohort. We made our call based on information at: https://wiki.nci.nih.gov/display/TCGA/TCGA+MAF+Files

We are looking for
1. "curated" in the file name
2. sample number in the data file

based on the three criteria, we make a call, we would
1. go with then curated, given the sample number is the highest or close to be highest
2. or only took the non-curated file if the sample number is significantly more
3. if there is no file with the keyword curated in its file name, go with the file with the highest number of samples
4. sometimes, there are two different sequencing centers made calls on different subsets of the same cohort, we include both in our default to maximize sample number.

Our current (version 10-30-2015) default mutation dataset are:

   default_mutation = {
    "ACC":["TCGA_ACC_mutation_curated_broad"],
    "BLCA":["TCGA_BLCA_mutation_broad"],  ###########    "TCGA_BLCA_mutation_curated_broad"                                                                     
    "BRCA":["TCGA_BRCA_mutation_curated_wustl"],
    "CESC":["TCGA_CESC_mutation_curated_wustl"],
    "CHOL":["TCGA_CHOL_mutation_broad"],
    "COAD":["TCGA_COAD_mutation_bcm","TCGA_COAD_mutation_bcm_solid"],
    "DLBC":["TCGA_DLBC_mutation_bcm"],
    "ESCA":["TCGA_ESCA_mutation_broad"],
    "FPPP":[],  ## need hgsc mixed data                                                                                                                     
    "GBM":["TCGA_GBM_mutation_ucsc_maf"],
    "HNSC":["TCGA_HNSC_mutation_broad"],############ "TCGA_HNSC_mutation_curated_broad""                                                                     
    "KICH":["TCGA_KICH_mutation_broad"],
    "KIRC":["TCGA_KIRC_mutation_broad"],  #need hgsc mixed                                                                                                   
    "KIRP":["TCGA_KIRP_mutation_curated_broad"],
    "LAML":["TCGA_LAML_mutation_wustl"],
    "LGG":["TCGA_LGG_mutation_ucsc_vcf"], ########### "TCGA_LGG_mutation_curated_broad"                                                                     
    "LIHC":["TCGA_LIHC_mutation_bcm"],
    "LUAD":["TCGA_LUAD_mutation_broad"], #####"TCGA_LUAD_mutation_curated_broad"                                                                             
    "LUSC":["TCGA_LUSC_mutation_broad"],
    "LNNH":[],
    "OV":["TCGA_OV_mutation_broad","TCGA_OV_mutation_wustl","TCGA_OV_mutation_curated_bcm_solid"],
    "PAAD":["TCGA_PAAD_mutation_curated_broad"],
    "PCPG":["TCGA_PCPG_mutation_broad"],
    "PRAD":["TCGA_PRAD_mutation_curated_broad"],
    "READ":["TCGA_READ_mutation_bcm","TCGA_READ_mutation_bcm_solid"],
    "SARC":["TCGA_SARC_mutation_broad"],
    "SKCM":["TCGA_SKCM_mutation_broad"],
    "STAD":["TCGA_STAD_mutation_curated_broad"],
    "TGCT":["TCGA_TGCT_mutation_broad"],
    "THCA":["TCGA_THCA_mutation_broad"],
    "THYM":["TCGA_THYM_mutation_broad"],
    "UCEC":["TCGA_UCEC_mutation_wustl"],
    "LCLL":[],
    "LCML":[],
    "MESO":["TCGA_MESO_mutation_bcgsc"],
    "UCS":["TCGA_UCS_mutation_curated_broad"],
    "UVM":["TCGA_UVM_mutation_curated_broad"]
    }

** we require a minimum variant allele frequency of 4% to broad automated mutation calls, which removed 50% mutation calls from broad's maf files.






上一篇:Zinc fingers家族包罗万象呀
下一篇:GTEx数据库应用之尼安德特人基因仍影响现代人
你这个问题很复杂,需要打赏,请点击 http://www.bio-info-trainee.com/donate 进行打赏,谢谢
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2019-8-26 10:14 , Processed in 0.126843 second(s), 26 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.