JaneShenGunther/TCGADataTool | Paclet Repository

User Interface

Overview	Data browser
Data download	Radiological images

The paclet includes a function

TCGADataToolUserInterface

to open in a new notebook an interface to assist with downloading, inspecting, preparing and exporting the TCGA data.

This loads the paclet.

In[1]:=

Needs["JaneShenGunther`TCGADataTool`"]

Overview

The User Interface (UI) is implemented as a separate Mathematica notebook that allows data download, inspection and export. Note that only one may be open, and it will only be functional while the paclet is loaded in the kernel.

TCGADataToolUserInterface

[]

open the user interface notebook

Load the TCGA user interface.

In[73]:=

TCGADataToolUserInterface

[]

Out[73]=

NotebookObject

TCGA Data Tool



This will open the UI at the “Download parameters” pane shown below. The UI has two main “sections”:

◼

Data retrieval / processing

◼

Data browser

The following sections will dive into details of these two UI functionalities. The UI layout will remain the same from pane to pane: the bar below the window title bar shows some useful information and buttons, the menu on the left shows what functionalities are accessible and highlights the current UI pane.

Screenshot of the default view when the UI is loaded.

Clean cache button

The “Clean cache” button, below the window title bar, can be used to clean a temporary folder from downloaded data that are no longer necessary. Raw data files are downloaded, then cleaned and processed before saving in a second location. These locations are set set in the advanced options during download,

Current / available memory

Current and available memory is displayed each side of the Clean Cache button. The “Current memory used” indicates the amount of memory currently being used by the Wolfram Language kernel. “Memory available” is the amount of memory available for storing additional data in the current Mathematica session.

If the current memory available is low, this can be the cause of various issues such as sluggish response of the kernel. We recommend quitting the kernel and reloading the package if the memory in use is high, and you wish to download new data.

Data download

This section covers UI panes for data download and processing.

Parameters setting

The first pane in the data download and processing section is the “Download parameters” pane. Here the user decides what data to download from GDC/TCIA portals. There are three main parameters that should be set: primary site, project and data scope.

The automatic initial selection of the download parameters pane

Primary site

The user must select a primary site of interest, which will then cause the projects with data in this primary site to be displayed under the “Project selection” column. The primary sites are derived from the GDC data portal.

The site may be selected either by clicking the one of interest in the list, or searching via the input field at the top, which will autocomplete to valid sites. Once a primary site is selected, the relevant projects and their summaries will be displayed in the "Project selection" column.

Note that although by default all patients will be downloaded for a project regardless of the selected primary site, you can restrict the patients to the selected primary sites via the “Patient selection” popup menu on the righthand side of the pane. This will only display additional options for projects with more than one primary site.

Project

The user must select a project from the list under the column “Project selection”. The list of projects is derived from the GDC portal. The user can also type a project name directly in the input field and autocompletion will suggest valid projects. Once a project has been selected, a grid with information about that project is displayed below the list of available projects. Information shown about the project are derived from the GDC portal and can be accessed in the code as custom entities of type “Project”

Data scope

In the data scope selection column, the user can select what type of data should be downloaded and processed. Data is organized in categories and subcategories, it is possible to select a whole category (Clinical/Biospecimen/Scraped data/Genomic data) or only few subcategories within different categories.

Clinical, Biospecimen data and Genomic Data subcategories are derived from GDC data. Scraped data instead refers to data not directly available in the form of files on the GDC portal, this includes:

◼

Radiological Images : these are metadata about radiological images derived from the TCIA portal

◼

Histological Images: these are metadata about histological images derived from the GDC portal, it expands information from Biospecimen Diagnostic Slide /Slide

◼

FollowUp / NewTumorEventFollowUp collapse whatever follow-up versions are available under the clinical data category

Clinical Patient data must always be included because it is used to get the complete list of patient UUIDs and the mapping between patient UUIDs and barcodes, and thus cannot be unselected.

Patient selection

Screenshot of the patient selection tool.

The patient selection dropdown menu allows toggling between downloading all patients for the given project, or only those that are relevant for the selected primary site. The second option to restrict the downloaded data will only appear for projects with multiple primary sites.

Advanced options

Advanced options can be accessed by clicking the little arrow to the left of the text:

Screenshot of the expanded advanced options settings.

These options allow to choose:

◼

which tool should be used for data download (GDC API or GDC Data Transfer Tool)

◼

where to save raw data

◼

where to save processed data

◼

the filename of exported data file

Directories can be selected by inputting directly into the blue boxes, or by clicking the “select directory” button. The exported file name is based off the date and time when the pane is initialized, and unless changed will be automatically updated after data is downloaded in order to avoid overwriting previous downloads.

Load saved data

By clicking “Load saved data”, the user can choose to import previously downloaded data into the interface, rather than having to download new data. On selecting a file, the pane will be set to the “Progress report” pane, and summaries of the data scope and coverage will be displayed in the progress report, as with the “Get data” button.

Note: When the data is downloaded, it is automatically saved into the locations set in the advanced options and it can be imported afterwards.

Get data

When the user clicks on the “Get Data” button, the UI will automatically transition to the “Download/processing pane” and start downloading and processing the data.

Progress pane

This pane shows the progress of any ongoing data download, and a summary of the downloaded data once the process is complete. Please note that any messages starting “WARNING” or “ERROR” should be read. There are three main areas, the progress pane at the top, the settings summaries in the middle and the patient coverage summary at the bottom.

Progress after a download has been completed.

Progress

The progress pane appears at the top of the window, and will initially appear blank. Once the user starts to get data, information on the progress will be printed into the pane. While sometimes messages appear rapidly, depending on the size of the download files, it can take a while for a single download to complete. Please pay special attention to any messages starting “WARNING” or “ERROR” as these should not routinely appear in the message window.

New messages are printed at the top of the window, so the messages visible will always be up to date with current progress unless the user scrolls down.

Progress pane if selected without starting data download from the download parameters pane.

The pane while data is being downloaded.

Settings summaries

There are two sets of summaries in this area: to the left, there is the data retrieval settings. This summarises the project scope of the data selected in the “Download parameters pane”. To the right, the advanced options are summarised, even if they were left unchanged during the data selection. These summarise the tool and the download directories used.

Patient coverage

At the bottom of the pane is an area that displays the patient coverage of the current downloaded data. Each separate category has it’s own labeled table, with the subcategories appearing below in separate rows. There are two columns for each subcategory: the first indicates the percentage of downloaded patients which have data for the subcategory, the second indicates the overall count.

Examples of the patient coverage from a set of subcategories selected from TCGA-CESC.

Each row is individually color coded in order to indicate the amount of data available:

◼

Light blue indicates a high percentage of patients with data in that subcategory;

◼

Medium blue between 10 and 80% of patients;

◼

Light red indicates a low percentage.

The subcategories are arranged in alphabetical order, and the display area will scroll vertically to display subcategories if there are more than will fit within the display.

Open data folder

The “Open data folder” button will open a window in the default file browser, showing the currently selected “Processed data folder export” location.

Radiological Images

The “Radiological images” button will take the user to the first pane in the radiological images download and inspection tools. Note that although this button will always work, the radiological images panes will only be populated if the user has downloaded the “Radiological images” subcategory data from the “Scraped data” category.

Data filtering

The “Data filtering” button will move the user over to the first pane of the data browser, in order to inspect and select data filtering.

Data browser

The data browser includes a set of panes that allows the user to inspect, refine and export the downloaded data. In the “Column filtering” stage, columns in each subcategory can be selected to refine the data and filters can be applied to remove patients outside of certain ranges or categories. Details of individual columns can be inspected in the “Data inspection” pane and the user can switch between these two panes at will to iterate and refine what they need. The “Data export” pane allows the user to export the final range of data in a variety of formats.

Column filtering

Filtering enables the user to select which columns of data they are interested in, and gives them the option to restrict the data to only patients which match certain criteria. The pane is broken down into four major sections: “Subcategory selection”, “Column selection”, the “Filter column” inspector in the top right and the summary of selected filters in the bottom left “selected columns/filters”. The “Apply filters” button must be clicked to select the columns and filters for inspection and export.

Filters are applied using logical AND. Only patients that match ALL filters will be included in the result, it is possible that no patient matches all selected criteria.

Initially when entering the pane after the download pane there will be no data selected, with the subcategory and column selection sections available. Getting additional data or switching projects will preserve any selected filters if possible, but projects vary in available columns, which will result in some filters being dropped.

The initial layout of the column filtering pane after data has been downloaded.

Subcategory selection

Subcategory selection allows the user to navigate between different subcategories, displaying the subcategories columns in the column selection section. Subcategories are displayed under each of the four categories. Clicking on the text will cause the column selection to display the columns in that subcategory; clicking on the checkbox will select all columns in the subcategory for filtering - once checked, the checkbox can be clicked again to deselect all columns in the selection. Note that if some columns were already selected within a subcategory, this information will be lost if the checkbox for the entire subcategory is clicked.

Column selection

Column selection enables the user to select specific columns from the currently selected subcategory for filtering, inspection and export. Clicking either the checkbox or the text will result in toggling between having the column included and excluded, with all columns starting as excluded by default.

Column selection is displayed for only the subcategory selected in the left hand section, changing subcategories will change the range of columns selected. Some subcategories have columns with the same display name but contain potentially different data. These must must be selected separately if the user wants to include them - for example, the Follow Up subcategory appears for both Clinical and Scraped data, and some calculations such as survival rates are better studied by combining the two (default functionality in the included survival rate functions, assuming that the data is all included).

Filter column

The filter column section comprises of a popup menu at the top, where the user can select which column they are interested in, and a summary of the column’s data below with the options to select different filtering methods. The filter column will default to showing the most recently selected column, while the popup menu allows selection of other columns.

“Update” must be clicked to activate a filter for the column. The “Apply filter” button must be clicked to set filters for the data inspection and export.

Column filtering with a numerical column selected.

All columns have a summary table displayed at the top, which displays information about the column’s metadata. Below this, there is a small table that gives the percentage and count of missing values in the data series. Note that with some columns, such as “days_to_death”, this indicates that some patients had not died at the time the data snapshot was taken. The row labeled “column name” can be clicked in order to copy the standard name of the column for use outside of the interface.

The rest of the section has two different appearances, depending on if the data is a continuous numerical scale, or is categorical.

The column inspection section for continuous numerical data shows details of the column, controls to restrict the data to within a certain range and a small plot.

For numerical data, a histogram will be displayed showing the numerical distribution of the data in the column. Below it, there are controls that are used to select only patients within a ceratin range. By default all patients are included, but by clicking the radio button next to “Select values from” the slider is enabled to allow the user to select only patients in a certain range. There are two controls on the slider: one that sets the minimum value required for the patient to be included in the dataset, and one that sets the maximum. These can be controlled either by clicking and dragging them, clicking on the range or by typing numbers into the two boxes below that indicate the currently selected minimum and maximum value.

Note that although the histogram shows data in certain bins, the slider allows the minimum and maximum to be set to any valid value.

For categorical or qualitative data, the column inspection instead allows selection of specific values.

For categorical data, a list of the categories is displayed at the bottom of the sections. In order to restrict patients to certain categories, the user must select to restrict the data, then the checkboxes below will be enabled. Only patients with the category set to the selected checkboxes will be available after filtering. If there are too many different categories, the checkboxes will be disabled with a message.

To activate filtering, the “Update” button must be clicked. Only patients within the selected range or ticked checkboxes will be available in the data included in the inspection and export panes.

Selected columns/filters

A screenshot showing the section with range of selected filters selected, including data with filters applied.

This section shows a summary of the currently selected column and it’s filtering across all different subcategories. Each selected columns will be displayed with a summary of the filters selected and the buttons “modify” and “ delete”. Clicking modify shows the columns and it’s currently selected filter in the “Filter column” section of the pane, while “delete” will instantly remove the column and it’s selected filters from the current selection.

To the right of the title, there is a “Clear selection” button, this will reset all filters for all columns and subcategories.

Apply filters

Click the "Apply filters" button to implement the filters and move to the data browser inspection pane.

If multiple filters are selected and there is no data available in the data inspection pane, then there may be no patients that match all selected filters.

Data inspection

The data inspection pane is used to inspect in more detail the columns selected in the “Column filtering” pane, displaying only patients that satisfy all filters imposed in the previous pane. This pane has the subcategory selector on the left hand side, and on the right a popup menu to select columns and the details of the column, it’s filters, and it’s data below. This pane is only a data inspector, changes to the data must be made in the “Column filtering” pane.

Filtered data used in the inspection pane is stored separately to the downloaded data. Changing projects or data will not be reflected in the data inspection pane until the “Apply filters” button is checked in the column filtering pane.

An example of the data inspection pane when inspecting a numerical column of data.

Subcategory selection

Column selection

The column selection popup menu is displayed at the top, and is used to switch between different columns in the selected subcategory. Only columns for the currently selected subcategory will be available, to switch between subcategories, the subcategory selection pane to the left must be used.

To the right of the popup menu is a “Save” button. When clicked, it will export the currently displayed column, complete with the current settings for the graph.

Column inspection

The column inspection pane shows more detailed information on the data included in the selected column. This pane shows only data for the patients that satisfied all the filters in the previous pane, therefore the graphs may look different to those in the “column filtering” pane, even when looking at columns without filters applied.

For all columns, there is a small table at the top summarising the selected column, its data type and the range of values selected in the “Column filtering” step. At the bottom is a detailed table on the column’s metadata, including a clickable label with a blue background that, when clicked, will copy the column’s standard name into the user’s clipboard. The visualisations in the middle will vary depending on if the column contains numerical or categorical data.

The inspection pane for numerical data showing a histogram of the values, brief statistical analysis and a table of information about the selected column.

For numerical data, there will be a histogram, showing the data within the range selected by the current filtering. To the right of the graph, there is a summary of key statistics of the filtered column data, and below a toggle to switch the y-axis of the graph between displaying probability or count.

For multinomial data, a pie chart is displayed by default instead of a histogram, with a breakdown of the percentage of each possible category (in this case, alleles)

For categorical data the graphic will be in the form of a pie chart by default, assuming there is not too much data. On the right, there is a table that breaks down the different categories into the percentage and number of entries, which has been sorted from most to least common. Below the chart, there is a an option which allows the user to switch between a percentage and a count. If the count is selected, then a bar chart will be displayed, with the categories displayed in descending order, as in the table.

The count bar chart options for multinomial data - bars are sorted from left to right in the same order as the breakdown table to the right of the graph.

Data filtering

This button takes the user back to the “Column filtering” interface, in order to further refine the options.

Data export

This button takes the user to the “Data export”pane

Export

The export pane is for exporting data restricted to the columns and filters set during column filtering, the whole data structure is automatically exported during “Data retrieval”. This pane allows you to select other types of export format, by default a CSV. Data is by default exported as CSV file, and because CSV does not support multiple sheets, one file is exported for each subcategory. There is also a short preview of the exported data structure.

Filtered data used in the export pane is stored separately to the downloaded data. Changing projects or data will not be reflected in the data export pane until the “Apply filters” button is checked in the column filtering pane.

Export pane after data download and filtering has been completed.

Export options

The controls to customise export format, path and file name.

This section of the notebook, the user can select the format, path and file name. The file name should not contain the file format, as this is set via the “Export format” button bar, and controls how the data is formatted before export. For .csv files, multiple files will be exported - one for each subcategory. xls files will use a different sheet for each subcategory, while .m (Mathematica) files will be exported in a list, with the data for each patient in a association.

The export path can be set either by typing into the export path box, or by hitting “Select directory” which will open a system dialogue to select the desired directory. If a non-existent directory is typed in the export path, one will be created automatically prior to export.

Export file name is the exported file name, and is by default based off the date and time. It will automatically be updated to avoid overwriting previously saved data, but the user can also change it to a name of their choice.

Export and associated buttons

Text will appear after export to indicate if the export was successful.

The export button causes the interface to export the file(s) according to the settings above. During export, a wheel will be displayed to indicate progress, but it can stop while Mathematica is still processing data. Please be patient during export, if a large amount of data is selected, it can take a long time to export. On successful export, the “Export successful” text will appear to the right of the buttons. In the case of a failure, the words “Export failed” will appear in red text.

The “Open data folder” button will open the location of the current export path in a system file browser.

File data preview

The file data preview shows a preview of the data structure as it will be exported, in a relevant structure to the selected export format. This has no operational effects on the export itself, and is only intended to give an idea of the data that is being exported. By default, only 10 patients will be shown.

Preview table of a csv or xls file.

When exporting a csv or xls file, the data preview will be in the format of a table for the specific subcategory being exported, with a dropdown menu to switch between subcategories. This popup menu represents either the content of a single file (.csv) or a single sheet (.xls).

Preview table of the a Mathematica file export.

When exporting as a Mathematica (.m) file, the data structure will be displayed using the function Dataset’s default formatting, with the content of each row corresponding to the data for that patient.

Data is exported in a compressed format, to import the user must import the file into Mathematica and uncompress it - place the file path of the exported file in place of "exportFilePath".

Radiological images

Due to the size of the image database, radiological image files are not automatically downloaded during the initial data import. Instead, details and metadata referring to the images as well as the image’s download paths are obtained. The radiological images workflow consists of three panes, the first “Download images”, downloads all images from a patient, “Progress report” tracks the import and “Image viewer” imports and displays the patient’s images.

Download images

The download images allows the user to select a single patient in order to download all their radiological images. If the user has not downloaded the “Radiological images” from the “Download Parameters” pane, then this will not display any available patients.

The pane is broken into three major sections: on the left hand side, a patient selection with the option of inputting or selecting the patient, in the middle the summary of the currently selected patient, and on the right a short advanced options section to control the download location of the images.

Radiological images download pane when radiological image data has been downloaded via the download parameters.

Patient selection

At the top of the patient selection column there is an input field which will autocomplete to match any patients. This can be used to either select a patient, or search and browse a short list of options.

Below the input field is a table with selectable columns. Each row corresponds to one patient, and contains a summary of the number of studies the patient participated in, and the number of series that are included with that patient. Only one patient may be selected at a time.

Patient summary

At the top of the patient summary section is the text “Patient summary for:” and then the currently selected patient in the left hand side. Below this is a table summarising all the different series available for that patient, with a summary of the number of images in the series, the provided series description, modality and the manufacturer of the machine used for the provided study.

Below the table, a “Total file size” text indicates how big the entire download of all images will be.

Progress report

The progress report is similar to the progress report for the data processing. At the top, a summary of the download progress is printed, with most recent messages on top. Some series contain a considerable amount of images, and therefore will take a while to download. Please pay attention to any messages starting “WARNING” or “ERROR”.

Below the progress section, we have the summary of data retrieval settings and the advanced options summary to summarise the settings used when download was commenced.

Screenshot showing in-progress download of radiological images.

Image viewer

The image viewer contains several different sections for selecting, browsing and inspecting different series of images. In the top left, there is a selection section that allows the user to switch between downloaded patients and their different series of images. In the bottom left, a grid of images will be displayed, allowing the user to browse through a side-by-side overview of the images in the selected series and click any for enlargement. On the right, a detailed inspection pane shows a large version of the currently selected image, with a detailed summary table below containing information about the metadata of the currently selected image. Scrolling up and down in this right hand panel will scroll between images.

Initially, the image viewer will only have the tables visible, and the user must select a patient and series and hit the button “Inspect series images” to import images.

Initial view of the image viewer after radiological images import is successful - only the patient and series inspector in the top left have any content.

Once the “Inspect series images” button has been pressed, then the images will be imported into Mathematica.

Image viewer after the images have been imported, displaying the current selected image series.

Patient and series selection

In the top left of the “Image viewer” pane is a pair of selectors for controlling which patient and series is being inspected. The left hand column is the patient selection, with an input field to input the patient, which will autocomplete to only valid patients that have been previously imported from the “Download images” pane, and each patient must be imported individually. Below the input field is a grid with selectable rows, which can also be used to select a patient.

On the right hand side is a second table for the selection of the individual series to be inspected. Each row is selectable, and contains a summary of the the series in question.

Once the patient and series to be inspected are selected, then the user should hit the “Import series images” button to import the images from the download folder into Mathematica. Depending on the number and size of the images, a spinning wheel will appear to indicate that importing is ongoing.

Image summary grid

The image summary grid contains a title at the top indicating the patient and series currently selected, which may not match that selected in the top left. Below, is a grid of all images in the selected patient’s series. These images can be scrolled up and down inside the pane. Clicking an image will cause it to become selected in the large grid to the right.

Closeup of the image grid, showing the patient and series selected at the top and a series of interactable images that can be clicked to inspect in detail.

Detailed image view

On the right hand side of the pane is a tall section showing the detail of a single image. At the top is a summary of the patient, series and the exact image file that is being selected, below this, the current image and details of the current image are displayed. Images can be scrolled through using normal scrolling methods, automatically switching between different images. The table below displays various details about the image being displayed, an is automatically updated during switching images.

Below the large image is a “Save” button, which when clicked causes a system dialogue to be opened for saving the image to an output format of choice, by default, a .png.

The enlarged image with summary table and text in the detail view.