JaneShenGunther/TCGADataTool | Paclet Repository

Genomic Data

Genomic Data walkthrough

DNA methylation download

Example workflow of the use of genomic data from the TCGA Data Tool.

This loads the package.

Needs["JaneShenGunther`TCGADataTool`"]

Genomic Data walkthrough

Summary of the genomic data available for TCGA-CESC project.

Load TCGA-CESC example data structure and its description.

In[84]:=

exampleDataTCGA

[{"TCGAProjectData","TCGACESCFullDataScopePatientSample"},"Description"]

Out[84]=

Example data structure for project TCGA-CESC, including 10 randomly sampled patients and full data scope. This example shows how data is stored and organized under the hood by the TCGADataToolUserInterface[]. Files exported in the .m format from the TCGADataToolUserInterface[] will adhere to this format.

In[85]:=

dataStructure=

exampleDataTCGA

[{"TCGAProjectData","TCGACESCFullDataScopePatientSample"}];

Summary

For each patient Genomic Data is structured as a

List

Associations

In[112]:=

Short[#,10]&@dataStructure〚1〛["GenomicData"]["SimpleNucleotideVariation_MaskedSomaticMutation"]

Out[112]//Short=

{HugoGeneSymbolCAMTA1,EntrezGeneID23261,CenterWUGSC,NCBIBuildGRCh38,Chromosomechr1,StartPosition7736540,EndPosition7736540,Strand+,VariantClassificationMissense_Mutation,VariantTypeSNP,ReferenceAlleleG,TumorSeqAllele1G,TumorSeqAllele2A,dbSNP_RSrs1285071931,dbSNP_Val_StatusMissing[],TumorSampleBarcodeTCGA-VS-A8EH-01A-11D-A36J-09,109,HGVSOffsetMissing[],Phenotype0;1,GenePhenotype1,ContextATGGCGGTAAG,TumorBAMUUID9994b31f-bde3-408a-bfb8-626faa375aac,normal_bam_uuid32be9198-0aad-4fd7-ae06-f19cbd44868e,bcr_patient_uuid05026179-b1da-411e-a286-89727b1ae380,GDCFilterMissing[],COSMICMissing[],HotspotN,RNASupportUnknown,RNADepthMissing[],RNARefCountMissing[],RNAAltCountMissing[],Callersmuse;mutect2;varscan2,65,1}

Relevant columns you have access for the masked somatic mutations

In[87]:=

dataStructure〚1〛["GenomicData"]["SimpleNucleotideVariation_MaskedSomaticMutation"]〚1〛//Keys//Multicolumn[#,3]&

Out[87]=

HugoGeneSymbol	Gene	gnomAD_NFE_AF
EntrezGeneID	Feature	gnomAD_OTH_AF
Center	FeatureType	gnomAD_SAS_AF
NCBIBuild	OneConsequence	MAX_AF
Chromosome	Consequence	MAX_AF_POPS
StartPosition	cDNAPosition	gnomAD_non_cancer_AF
EndPosition	CDSPosition	gnomAD_non_cancer_AFR_AF
Strand	ProteinPosition	gnomAD_non_cancer_AMI_AF
VariantClassification	AminoAcids	gnomAD_non_cancer_AMR_AF
VariantType	Codons	gnomAD_non_cancer_ASJ_AF
ReferenceAllele	ExistingVariation	gnomAD_non_cancer_EAS_AF
TumorSeqAllele1	Distance	gnomAD_non_cancer_FIN_AF
TumorSeqAllele2	TranscriptStrand	gnomAD_non_cancer_MID_AF
dbSNP_RS	GeneSymbol	gnomAD_non_cancer_NFE_AF
dbSNP_Val_Status	SymbolSource	gnomAD_non_cancer_OTH_AF
TumorSampleBarcode	HGNCGeneID	gnomAD_non_cancer_SAS_AF
MatchedNormSampleBarcode	Biotype	gnomAD_non_cancer_MAX_AF_adj
MatchNormSeqAllele1	Canonical	gnomAD_non_cancer_MAX_AF_POPS_adj
MatchNormSeqAllele2	CCDS	ClinicalSignificance
TumorValidationAllele1	ENSP	Somatic
TumorValidationAllele2	SwissProt	PubmedID
MatchNormValidationAllele1	TrEMBL	TranscriptionFactors
MatchNormValidationAllele2	UniParc	MotifName
VerificationStatus	UniProtIsoform	MotifPosition
ValidationStatus	RefSeq	HighInformationPositionFlag
MutationStatus	Mane	MotifScoreChange
SequencingPhase	APPRIS	miRNA
SequenceSource	Flags	Impact
ValidationMethod	SIFT	Pick
Score	PolyPhen	VariantClass
BAMFile	EXON	TranscriptSupportLevel
Sequencer	Intron	HGVSOffset
TumorSampleUUID	Domains	Phenotype
MatchedNormSampleUUID	1000G_AF	GenePhenotype
HGVSc	1000G_AFR_AF	Context
HGVSp	1000G_AMR_AF	TumorBAMUUID
HGVSpShort	1000G_EAS_AF	normal_bam_uuid
TranscriptID	1000G_EUR_AF	bcr_patient_uuid
ExonNumber	1000G_SAS_AF	GDCFilter
t_depth	ESP_AA_AF	COSMIC
t_ref_count	ESP_EA_AF	Hotspot
t_alt_count	gnomAD_AF	RNASupport
n_depth	gnomAD_AFR_AF	RNADepth
n_ref_count	gnomAD_AMR_AF	RNARefCount
n_alt_count	gnomAD_ASJ_AF	RNAAltCount
AllEffects	gnomAD_EAS_AF	Callers
Allele	gnomAD_FIN_AF

In order to be able to demonstrate the selection of genomic data for different samples we make sure that there are patients who have data relative to multiple aliquots, identified by multiple “TumorSampleBarcode”, resulting from multiple genomic files.

Define variable with patients with multiple sample tested for genomic data

In[88]:=

dataPatientwithmultiplegenomicfiles=Select[Length[Union[Query[All,"TumorSampleBarcode"]@(#["GenomicData"]["SimpleNucleotideVariation_MaskedSomaticMutation"])]]>1&]@dataStructure;Length[dataPatientwithmultiplegenomicfiles]

Out[89]=

Extract data based on sample type

Define example data structure

In[90]:=

dataWithMultiplegenomics=Union[dataStructure〚;;3〛,(*addingapatientknowntohavemultiplefiles*)dataPatientwithmultiplegenomicfiles];

Show different output if restricting to single sample type:

Define sample type

In[91]:=

$sampletype="Primary Tumor";

Example showing how to restrict to a sample type, and displaying differences

In[120]:=

Query[All,"bcr_patient_uuid"#["bcr_patient_uuid"],"Clinical::Patient::weight"Query[First,"weight"]@#["Clinical","Patient"],"Clinical::Patient::height"Query[First,"height"]@#["Clinical","Patient"],(*herethereisnoconstrainonthesampletype*)"Impact"(Query[Counts,"Impact"]@#["GenomicData","SimpleNucleotideVariation_MaskedSomaticMutation"]),"Impact_high"(Query[Counts,"Impact"]@#["GenomicData","SimpleNucleotideVariation_MaskedSomaticMutation"])["HIGH"],(*hereweaddconstrainonthesampletypeby"Select[#["sample_type"]==$sampletype&]"*)"Impact_in_selected_sampletype"Query[Counts,"Impact"]@Query[Select[#["sample_type"]$sampletype&],All]@Query[All,{

"Impact"

"sample_type"

}]@JoinAcross[(Query[All,{"TumorSampleBarcode",

"Impact"

}]@#["GenomicData","SimpleNucleotideVariation_MaskedSomaticMutation"]),JoinAcross[(*weneedtofurtherjoinacrossbecauseTumorSampleBarcodeisequivaletto"bcr_aliquot_barcode"andnotsimplyto"bcr_sample_barcode"*)(Query[All,{"bcr_sample_barcode",

"sample_type"

}]@#["Biospecimen","Sample"]),(Query[All,{"bcr_aliquot_barcode","bcr_sample_barcode"}]@#["Biospecimen","Aliquot"]),"bcr_sample_barcode","Outer"],"TumorSampleBarcode""bcr_aliquot_barcode"],"Impact_high_in_selected_sampletype"(Query[Counts,"Impact"]@Query[Select[#["sample_type"]$sampletype&],All]@Query[All,{"Impact","sample_type"}]@JoinAcross[(Query[All,{"TumorSampleBarcode","Impact"}]@#["GenomicData","SimpleNucleotideVariation_MaskedSomaticMutation"]),JoinAcross[(Query[All,{"bcr_sample_barcode","sample_type"}]@#["Biospecimen","Sample"]),(Query[All,{"bcr_aliquot_barcode","bcr_sample_barcode"}]@#["Biospecimen","Aliquot"]),"bcr_sample_barcode","Outer"],"TumorSampleBarcode""bcr_aliquot_barcode"])["HIGH"]&]@dataWithMultiplegenomics

Create example data to extend existing design matrix

Define sample type

Example 1: Computation based on all columns for a given patient, determining the total count of high impact mutations.

Determine total high impact

Example 2: Computation based on single mutation

Define the mutation of interest

Determine total high impact mutations

Extend a design matrix

Define the variables for design matrix creation

Create a design matrix

DNA methylation download

Load TCGA-CESC example data structure and its description.

Get methylation data

Workflow functions for methylation data import.

Select project and patients UUID for the example

Select the download folder

Download methylation raw data

Inspect the data

Get human methylation genomic coordinate and merge data

Get data for human methylation genomic coordinates from Wolfram Data Repository.

Merge methylation data with genomic coordinates

Brief data exploration

Create an overlapped histogram of beta values from the two patients

Compare distributions for a specific gene name