其实根cbioportal和firehose没有本质上的区别,都是探索TCGA的level3数据,可能稍微用起来方便一点,当然也需要一点时间去熟悉这个网页工具。
旧版:https://genome-cancer.ucsc.edu/proj/site/hgHeatmap/
新版:https://xenabrowser.net/heatmap/#
可以对任何癌种,根据任何临床指标进行分sub-group之后进行任何形式的生存分析,比较分析,还有相关分析。
原始level3数据也是可以下载的:https://xenabrowser.net/datapages/
进入到每个数据集,选择数据种类,介绍:
下面是FAQ,英文不好的朋友,就不用看了,反正你也看不懂;
1. How can I download un-normalized data?
There are two types of normalization that we provide: on-the-fly mean normalization, and dataset normalization (varies by dataset, e.g. log2(x+1)). If you want to download data with both types of normalization, then use the Xena visualization to create a column with the data you want to download and then use the caret menu to download it. If you want to download data with just the dataset normalization, then download the data via our Explore Data pages. If you want to download the data without any normalization (i.e. the raw data), then we recommend downloading the ‘raw’ data directly from the data providers. We provide a link to where we obtained the data in the Explore Data pages. 2. What are the data values in the download file for TCGA data?
Example filename | Values in file | TCGA_KIRC_exp_HiSeqV2 | Log2(x+1), x is the RSEM value | TCGA_KIRC_exp_HiSeqV2_PANCAN | Log2(x+1) value mean-normalized per-gene across all TCGA samples, extracted converted values only belong to this cohort. x is the RSEM value | TCGA_KIRC_exp_HiSeqV2_percentile | Percentile ranking of RSEM value per sample, values range from 0 to 100, lower values representing lower expression | TCGA_KIRC_gistic2 | Gistic2 value from Broad Firehose | TCGA_KIRC_gistic2thd | Gistic2 value discretized to -2,-1,0,1,2 by Broad Firehose | TCGA_KIRC_hMethyl27 | beta values | TCGA_KIRC_hMethyl450 | beta values | TCGA_KIRC_miRNA | Log2(x+1), x is RPKM value | TCGA_KIRC_mutation | PANCAN AWG somatic mutation calls | TCGA_KIRC_PDMRNAseq | Pathway inference score derived using RNAseq data alone (generated at Firehose) | TCGA_KIRC_PDMRNAseqCNV | Pathway inference score derived using RNAseq and copy number data (generated at Firehose) | TCGA_KIRC_RPPA | RPPA value | TCGA_KIRC_RPPA_RBN | RBN-normalized RPPA value |
3. For TCGA, which gene expression RNAseq dataset should I use for my analysis?gene expression RNAseq (IlluminaHiSeq) For comparison within a single TCGA cohort, you can use the "gene expression RNAseq" data. Values in this dataset is log2(x+1) where x is the RSEM value.
For questions regarding the gene expression of a particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data. Values in this dataset are generated at UCSC by combining "gene expression RNAseq" values (above) of all TCGA cohorts, values are then mean-centered per gene, then extracting the converted data only belongs to the cohort of interest (e.g. TCGA breast cancer). Since there are 30-40 cancer types with RNAseq data, the TCGA pancan data can serve as a proxy of background distribution of gene expression.
gene expression RNAseq (IlluminaHiSeq percentile) For comparing with data outside TCGA, you can use the percentile version if your non-TCGA RNAseq data is normalized by percentile ranking. Values in this dataset are generated at UCSC by rank RSEM values per sample. The values are percentile ranks ranges from 0 to 100, lower values represent lower expression. You can also combine the TCGA RNAseq data with your RNAseq data, perform normalization across the combined dataset using whatever method you choose, then analyze the combined dataset further.
TCGA Pan-Cancer gene expression For comparison across multiple or all TCGA cohorts. Dataset is generated at UCSC by combining "gene expression RNAseq (IlluminaHiSeq) data" (see above) from all TCGA cohorts. No further normalization is performed. 4. What is the difference between RPPA data and RPPA_RBN data? And why the number of features varies for both?
5. Can I combine data from the methylation 450k and 27k datasets?The methylation 450k dataset has 90% of the probes from the 27k dataset. However, we have discovered the range of data for each dataset to be slightly different. As such, we recommend applying some sort of normalization. I would recommend looking in the literature to see what methods people have used. 6. What is the normalization procedure for the methylation 450k and 27k datasets?From the dataset details pages:
450k: DNA methylation profile was measured experimentally using the Illumina Infinium HumanMethylation450 platform. DNA methylation values, described as beta values, are recorded for each array probe in each sample via BeadStudio software. DNA methylation beta values are continuous variables between 0 and 1, representing the ratio of the intensity of the methylated bead type to the combined locus intensity. Thus higher beta values represent higher level of DNA methylation (hypermethylation) and lower beta values represent lower level of DNA methylation (hypomethylation). We observed a bimodal distribution of the beta values from both methylation27 and methylation450 platforms, with two peaks around 0.1 and 0.9 and a relatively flat valley around 0.2-0.8. The bimodal distribution is far more pronounced and balanced in methylation450 than methylation27 platform. In the methylation27 platform, the lower beta peak is much stronger than the higher beta peak, while the two peaks are of similar height in the methylation450 platform.. Microarray probes are mapped onto the human genome coordinates using xena probeMap derived from GEO GPL13534 record. Here is a reference to Illumina Infinium BeadChip DNA methylation platform beta value.
27k: DNA methylation profile was measured experimentally using the Illumina Infinium HumanMethylation27 platform. DNA methylation values, described as beta values, are recorded for each array probe in each sample via BeadStudio software. DNA methylation beta values are continuous variables between 0 and 1, representing the ratio of the intensity of the methylated bead type to the combined locus intensity. Thus higher beta values represent higher level of DNA methylation, i.e. hypermethylation and lower beta values represent lower level of DNA methylation, i.e. hypomethylation. We observed a bimodal distribution of the beta values from both methylation27 and methylation450 platforms, with two peaks around 0.1 and 0.9 and a relatively flat valley around 0.2-0.8. The bimodal distribution is far more pronounced and balanced in methylation450 than methylation27 platform. In the methylation27 platform, the lower beta peak is much stronger than the higher beta peak, while the two peaks are of similar height in the methylation450 platform. The average of the beta values of this dataset is 0.261794241419., thus much of the heatmap appears hypomethylated (blue). Microarray probes are mapped onto the human genome coordinates using xena probeMap derived from GEO GPL8490 and GPL13534 records. Here is a reference to Illumina Infinium BeadChip DNA methylation platform beta value. 7. What is the difference between GISTIC 2 and GISTIC 2 thresholded datasets? Many copy number estimation algorithms estimate copy number variation on a continuous scale even though it is measuring something discrete (i.e. the number of copies of piece of chromosome or a gene in the cell). The GISTIC 2 thresholded data attempts to assign discrete numbers to these fragments by thresholding the data. The estimated values -2,-1,0,1,2, represent homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, or high-level copy number amplification respectively. More information can be found in the GISTIC 2 paper and at the Broad Institute, which is the group that processed this data.
8. For TCGA data, in the cohort.json file, there is the default dataset for somatic mutation data. How is the "default" for each cohort selected?
We are looking for 1. "curated" in the file name 2. sample number in the data file
based on the three criteria, we make a call, we would 1. go with then curated, given the sample number is the highest or close to be highest 2. or only took the non-curated file if the sample number is significantly more 3. if there is no file with the keyword curated in its file name, go with the file with the highest number of samples 4. sometimes, there are two different sequencing centers made calls on different subsets of the same cohort, we include both in our default to maximize sample number.
Our current (version 10-30-2015) default mutation dataset are:
default_mutation = { "ACC":["TCGA_ACC_mutation_curated_broad"], "BLCA":["TCGA_BLCA_mutation_broad"], ########### "TCGA_BLCA_mutation_curated_broad" "BRCA":["TCGA_BRCA_mutation_curated_wustl"], "CESC":["TCGA_CESC_mutation_curated_wustl"], "CHOL":["TCGA_CHOL_mutation_broad"], "COAD":["TCGA_COAD_mutation_bcm","TCGA_COAD_mutation_bcm_solid"], "DLBC":["TCGA_DLBC_mutation_bcm"], "ESCA":["TCGA_ESCA_mutation_broad"], "FPPP":[], ## need hgsc mixed data "GBM":["TCGA_GBM_mutation_ucsc_maf"], "HNSC":["TCGA_HNSC_mutation_broad"],############ "TCGA_HNSC_mutation_curated_broad"" "KICH":["TCGA_KICH_mutation_broad"], "KIRC":["TCGA_KIRC_mutation_broad"], #need hgsc mixed "KIRP":["TCGA_KIRP_mutation_curated_broad"], "LAML":["TCGA_LAML_mutation_wustl"], "LGG":["TCGA_LGG_mutation_ucsc_vcf"], ########### "TCGA_LGG_mutation_curated_broad" "LIHC":["TCGA_LIHC_mutation_bcm"], "LUAD":["TCGA_LUAD_mutation_broad"], #####"TCGA_LUAD_mutation_curated_broad" "LUSC":["TCGA_LUSC_mutation_broad"], "LNNH":[], "OV":["TCGA_OV_mutation_broad","TCGA_OV_mutation_wustl","TCGA_OV_mutation_curated_bcm_solid"], "PAAD":["TCGA_PAAD_mutation_curated_broad"], "PCPG":["TCGA_PCPG_mutation_broad"], "PRAD":["TCGA_PRAD_mutation_curated_broad"], "READ":["TCGA_READ_mutation_bcm","TCGA_READ_mutation_bcm_solid"], "SARC":["TCGA_SARC_mutation_broad"], "SKCM":["TCGA_SKCM_mutation_broad"], "STAD":["TCGA_STAD_mutation_curated_broad"], "TGCT":["TCGA_TGCT_mutation_broad"], "THCA":["TCGA_THCA_mutation_broad"], "THYM":["TCGA_THYM_mutation_broad"], "UCEC":["TCGA_UCEC_mutation_wustl"], "LCLL":[], "LCML":[], "MESO":["TCGA_MESO_mutation_bcgsc"], "UCS":["TCGA_UCS_mutation_curated_broad"], "UVM":["TCGA_UVM_mutation_curated_broad"] }
** we require a minimum variant allele frequency of 4% to broad automated mutation calls, which removed 50% mutation calls from broad's maf files.
|