Wolfram Language Paclet Repository

Community-contributed installable additions to the Wolfram Language

Primary Navigation

    • Cloud & Deployment
    • Core Language & Structure
    • Data Manipulation & Analysis
    • Engineering Data & Computation
    • External Interfaces & Connections
    • Financial Data & Computation
    • Geographic Data & Computation
    • Geometry
    • Graphs & Networks
    • Higher Mathematical Computation
    • Images
    • Knowledge Representation & Natural Language
    • Machine Learning
    • Notebook Documents & Presentation
    • Scientific and Medical Data & Computation
    • Social, Cultural & Linguistic Data
    • Strings & Text
    • Symbolic & Numeric Computation
    • System Operation & Setup
    • Time-Related Computation
    • User Interface Construction
    • Visualization & Graphics
    • Random Paclet
    • Alphabetical List
  • Using Paclets
    • Get Started
    • Download Definition Notebook
  • Learn More about Wolfram Language

TCGADataTool

Guides

  • TCGA Data Tool

Tech Notes

  • Custom Entities
  • Data Exploration
  • Data Modeling
  • Data Visualization
  • Genomic Data
  • Images Download
  • Property Standard Name
  • User Interface

Symbols

  • buildDesignMatrix
  • buildModel
  • cleanRawData
  • columnHeaderRiskClassSummary
  • downloadGenomicData
  • dynamicallyExploreThreshold
  • exampleDataTCGA
  • getHistologicalImages
  • getPotentialPredictors
  • importGenomicDataFile
  • inspectPotentialPredictors
  • modelMeasurementsAtThreshold
  • overallSurvivalPlot
  • progressionFreeSurvivalPlot
  • pullDataSlice
  • radiologicalImagesBatchProcessing
  • swimmerPlot
  • TCGADataToolUserInterface
Custom Entities
GDCProject entity type
ColumnHeader entity type
CDE entity type
​
The paclet includes custom entity types containing information on various GDC projects and data properties.
This loads the paclet.
In[23]:=
Needs["JaneShenGunther`TCGADataTool`"]
GDCProject entity type
The GDCProject entity type gets defined at paclet loading time. It stores information about GDC projects, with each entity corresponding to a project available on the GDC data portal.
Get the full list of GDCProject entities.
In[38]:=
EntityList["GDCProject"]
Out[38]=

GENIE-GRCC
,
GENIE-DFCI
,
GENIE-NKI
,
GENIE-VICC
,
GENIE-UHN
,
GENIE-MDA
,
GENIE-MSK
,
GENIE-JHU
,
FM-AD
,
VAREPOP-APOLLO
,
CGCI-BLGSP
,
BEATAML1.0-CRENOLANIB
,
TRIO-CRU
,
REBC-THYR
,
TARGET-CCSK
,
MP2PRT-WT
,
NCICCR-DLBCL
,
OHSU-CNL
,
WCDT-MCRPC
,
ORGANOID-PANCREATIC
,
CTSP-DLBCL1
,
CMI-ASC
,
CPTAC-3
,
MMRF-COMMPASS
,
CMI-MBC
,
CPTAC-2
,
EXCEPTIONAL_RESPONDERS-ER
,
BEATAML1.0-COHORT
,
CGCI-HTMCP-CC
,
HCMI-CMDC
,
TARGET-ALL-P3
,
TARGET-ALL-P2
,
TARGET-ALL-P1
,
TARGET-AML
,
TARGET-WT
,
TCGA-CHOL
,
TARGET-OS
,
TARGET-RT
,
TCGA-LIHC
,
TCGA-DLBC
,
TCGA-BLCA
,
TCGA-ACC
,
TCGA-CESC
,
TCGA-PCPG
,
TCGA-PAAD
,
TCGA-MESO
,
TCGA-TGCT
,
TCGA-KIRP
,
TCGA-UVM
,
TCGA-UCS
,
MATCH-Z1D
,
TCGA-THYM
,
TCGA-COAD
,
TCGA-ESCA
,
CDDP_EAGLE-1
,
CMI-MPC
,
TCGA-GBM
,
TCGA-KICH
,
TCGA-HNSC
,
TCGA-PRAD
,
TCGA-OV
,
TCGA-LUSC
,
TCGA-LAML
,
TCGA-LGG
,
TARGET-NBL
,
TCGA-SARC
,
TCGA-BRCA
,
TCGA-READ
,
TCGA-LUAD
,
TCGA-STAD
,
TCGA-THCA
,
TCGA-KIRC
,
TCGA-SKCM
,
TCGA-UCEC

There are a total of 74 entities
In[24]:=
EntityList["GDCProject"]//Length
Out[24]=
74
For each project, information like number of cases, file size etc. are stored. Most of the information is derived from the GDC Projects API endpoint.
Inspect all property values available for TCGA-CESC entity.
In[25]:=
Entity["GDCProject","TCGA-CESC"]["PropertyAssociation"]
Out[25]=

Case Count
GDC307,TCIA54,
Created Date
2023-02-27,
Data Categories
{FileCount2384,CaseCount304,DataCategoryCopy Number Variation,FileCount1988,CaseCount307,DataCategorySequencing Reads,FileCount4262,CaseCount305,DataCategorySimple Nucleotide Variation,FileCount936,CaseCount307,DataCategoryDNA Methylation,FileCount632,CaseCount307,DataCategoryClinical,FileCount1242,CaseCount307,DataCategoryTranscriptome Profiling,FileCount1536,CaseCount307,DataCategoryBiospecimen,FileCount172,CaseCount172,DataCategoryProteome Profiling,FileCount1236,CaseCount304,DataCategoryStructural Variation},
Disease Type
{Squamous Cell Neoplasms,Cystic, Mucinous and Serous Neoplasms,Adenomas and Adenocarcinomas,Complex Epithelial Neoplasms},
File Count
14388,
File Size
TotalFileSize
326.267
GB
,ImageTotalFileSize
314.511
GB
,TextTotalFileSize
11.7565
GB
,
Primary Site
{Cervix uteri},
Program Name
TCGA,
Project ID
TCGA-CESC,
Project Name
Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma,
Released Q
True,
State
open,
TCIAData Q
True
Get the number of patients available on GDC and TCIA data portals respectively for TCGA-CESC project.
In[48]:=
EntityValue
TCGA-CESC
GDC PROJECT
,"CaseCount"
Out[48]=
GDC307,TCIA54
Get breakdown of project file size from the GDC portal.
In[48]:=
EntityValue
TCGA-CESC
GDC PROJECT
,"CaseCount"
Out[48]=
GDC307,TCIA54
Entity classes group GDC projects by program and are named after the program ID.
There is an EntityClass for each program.
In[41]:=
EntityValue["GDCProject","EntityClasses"]
Out[41]=

GENIE
,
FM
,
VAREPOP
,
CGCI
,
BEATAML1.0
,
TRIO
,
REBC
,
TARGET
,
MP2PRT
,
NCICCR
,
OHSU
,
WCDT
,
ORGANOID
,
CTSP
,
CMI
,
CPTAC
,
MMRF
,
EXCEPTIONAL_RESPONDERS
,
HCMI
,
TCGA
,
MATCH
,
CDDP_EAGLE

Inspect entities included in the TCGA entity class.
In[42]:=
EntityClass["GDCProject","TCGA"]
Out[42]=
TCGA
In[44]:=
EntityList@EntityClass["GDCProject","TCGA"]​​Length[%]
Out[44]=

TCGA-CHOL
,
TCGA-LIHC
,
TCGA-DLBC
,
TCGA-BLCA
,
TCGA-ACC
,
TCGA-CESC
,
TCGA-PCPG
,
TCGA-PAAD
,
TCGA-MESO
,
TCGA-TGCT
,
TCGA-KIRP
,
TCGA-UVM
,
TCGA-UCS
,
TCGA-THYM
,
TCGA-COAD
,
TCGA-ESCA
,
TCGA-GBM
,
TCGA-KICH
,
TCGA-HNSC
,
TCGA-PRAD
,
TCGA-OV
,
TCGA-LUSC
,
TCGA-LAML
,
TCGA-LGG
,
TCGA-SARC
,
TCGA-BRCA
,
TCGA-READ
,
TCGA-LUAD
,
TCGA-STAD
,
TCGA-THCA
,
TCGA-KIRC
,
TCGA-SKCM
,
TCGA-UCEC

Out[45]=
33
CDE entity type
The CDE entity type gets defined at paclet loading time. It stores information about Common Data Elements (CDE) derived from the
CDE browser
such as data element description etc.
There are thousands of CDE entities.
In[3]:=
EntityList["CDE"]//Length
Out[3]=
71886
CDE entities are identified using their CDE ID.
In[6]:=
Entity["CDE","2001822"]
Out[6]=
2001822
CDE entities only store a subset of information available from the CDE browser.
In[7]:=
Entity["CDE","2001822"]["PropertyAssociation"]
Out[7]=

Data Element Long Name
Patient Hospitalization Ind-2,
Preferred Definition
the yes/no indicator that asks whether the patient was hospitalized.,
Data Element Public ID
2001822,
Documentation Version
xml_cde_20227105054,
Value Domain Datatype
CHARACTER,
Value Domain Unit Of Measure
Missing[],
Version
3
In[8]:=
Entity["CDE","649"]["PropertyAssociation"]
Out[8]=

Data Element Long Name
Patient Height Measurement,
Preferred Definition
the height of the patient in centimeters.,
Data Element Public ID
649,
Documentation Version
xml_cde_20227105054,
Value Domain Datatype
NUMBER,
Value Domain Unit Of Measure
Centimeters,
Version
4.1
Version refers to the CDE definition version from the CDE browser. Only the most recent version is stored for each CDE ID.
In[26]:=
RandomSeeding[123];​​EntityValue[RandomEntity["CDE",100],"Version"]
Out[27]=
{1,2,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
Documentation version refers to the file used to scrape data from the CDE browser. The file name records the file type and the file date. All CDE entities use CDE data from July 2022.
In[30]:=
RandomSeeding[123];​​Union@EntityValue[RandomEntity["CDE",100],"DocumentationVersion"]
Out[31]=
{xml_cde_20227105054}
ColumnHeader entity type
The ColumnHeader entity type gets loaded by the TCGA Data Tool User Interface every time data is loaded/download. It stores information about properties available in the data currently used by the user interface. For example, if only clinical patient data is in scope, then only ColumnHeader entities for clinical patient properties will be defined.
ColumnHeader entities are defined using EntityStore. Every time data is downloaded using the TCGA Data Tool User Interface it gets exported together with its corresponding
EntityStore
for ColumnHeader entities.
Get example data EntityStore.
In[32]:=
exampleDataTCGA
[{"TCGAProjectData","TCGACESCExceptGenomicDataAllPatients"},"EntityStore"]
Out[32]=
EntityStore
Type: ColumnHeader
Entities: {Clinical::Patient::bcr_patient_uuid,Clinical::Patient::bcr_patient_barcode,Clinical::Patient::form_completion_date,…,ScrapedData::HistologicalImages::slide_id,ScrapedData::HistologicalImages::state,ScrapedData::HistologicalImages::updated_datetime} (551)
Properties: {AlternativeHeaderRawLabel,Category,CDEID,…,LongLabel,Subcategory,Unit} (8)
Entity classes: none
Property classes: none

Define ColumnHeader entities from example EntityStore.
ColumnHeader entity type is now defined in the current session and can be accessed directly using built-in Entity functionalities.
ColumnHeader entities store information about properties that appear in their associated data structure and they are identified by the property standard name. Various properties that appear in GDC data files are associated with a specific CDE through a CDE ID, in those cases it is possible to retrieve additional information about the property through the associated CDE entity.
Property standard names are used as canonical names for ColumnHeader entities.
Get all property values for a ColumnHeader entity.
Get the preferred definition for property "Clinical::Patient::vital_status" from the associated CDE entity.
Not all properties have an associated CDE entity. Generally, properties under the ScrapedData category don't have an associated CDE entity but they're not the only example.

© 2025 Wolfram. All rights reserved.

  • Legal & Privacy Policy
  • Contact Us
  • WolframAlpha.com
  • WolframCloud.com