Wolfram Language Paclet Repository

Community-contributed installable additions to the Wolfram Language

Primary Navigation

    • Cloud & Deployment
    • Core Language & Structure
    • Data Manipulation & Analysis
    • Engineering Data & Computation
    • External Interfaces & Connections
    • Financial Data & Computation
    • Geographic Data & Computation
    • Geometry
    • Graphs & Networks
    • Higher Mathematical Computation
    • Images
    • Knowledge Representation & Natural Language
    • Machine Learning
    • Notebook Documents & Presentation
    • Scientific and Medical Data & Computation
    • Social, Cultural & Linguistic Data
    • Strings & Text
    • Symbolic & Numeric Computation
    • System Operation & Setup
    • Time-Related Computation
    • User Interface Construction
    • Visualization & Graphics
    • Random Paclet
    • Alphabetical List
  • Using Paclets
    • Get Started
    • Download Definition Notebook
  • Learn More about Wolfram Language

DataReshapers

Guides

  • Data reshaping functions

Tech Notes

  • Data transformation workflows
  • Long form data transformation
  • Wide form data transformation

Symbols

  • CrossTabulate
  • CrossTabulationMatrixQ
  • CrossTensorate
  • CrossTensorateSplit
  • DatasetToMatrix
  • GridTableForm
  • LongFormDataset
  • PivotLonger
  • RecordsSummary
  • RecordsToLongFormDataset
  • RecordsToWideFormDataset
  • SeparateColumn
  • ToAutomaticKeysAssociation
  • TypeOfDataToBeReshaped
  • WideFormDataset
Data transformation workflows
Introduction
Illustrating example
All questions
Illustrating example 2
What kind of data?
References
What workflows considered?
​
Introduction
In this notebook we introduce (Tabular) Data Transformation Workflows through Questions and Answers.
Remark: This notebook is an expanded version of the notebook used in the Wolfram U session [AAv1].
Remark: We use several Wolfram Function Repository (WFR) functions:
ExampleDataset
and
RandomTabularDataset
.
Load the paclet
In[7]:=
Needs["AntonAntonov`DataReshapers`"]
All questions

How exciting is this?

◼
  • Not, but YMMV...
  • What are the simplifying assumptions for the family of workflows considered in this tutorial?

    ◼
  • We do not consider hard to generalize- or too specialized transformations.
  • ◼
  • For example, we do not consider:
  • ◼
  • Data gathering/harvesting
  • ◼
  • Data ingestion by parsing of, say, XML structures
  • ◼
  • Date-time specific transformations
  • ◼
  • Natural text specific transformations
  • ◼
  • Etc.
  • What are the simplifying assumptions about the targeted data and transformations?

    ◼
  • Tabular data and collections of tabular data. (E.g. lists of datasets.)
  • ◼
  • Transformation workflows that can be expressed with a certain "standard" or "well-known" subset of SQL.
  • ◼
  • The flow chart given below illustrates well the targeted DT.
  • Do these workflows apply to other programming languages and/or data systems?

    ◼
  • Yes, with the presented DT know-how we target multiple "data science" programming languages.
  • Are additional packages needed to run the codes?

    ◼
  • Yes and no: depends on the target data transformation system.
  • ◼
  • Sometimes the WL code uses Wolfram Function Repository functions.
  • Are there similar tutorials dedicated to other programming languages or packages?

    ◼
  • Yes, many. Both for WL and the rest (Julia/Python/R.)
  • ◼
  • That said, in this tutorial we use a certain simplified and streamlined data-and-transformations model
    that allows the development of cross-system, transferable know-how and workflows.
  • What are the most important concepts for a newcomer to data wrangling?

    1
    .
    Cross tabulation (or contingency matrices)
    2
    .
    (Inner) joins
    3
    .
    The so called Split-apply-combine pattern
    4
    .
    Long form and wide form (or data pivoting)

    How do the considered DT workflows relate to well known Machine Learning (ML) workflows?

    ◼
  • Generally speaking, data has to be transformed into formats convenient for the application of ML algorithms.
  • ◼
  • How much transformation is needed depends a lot on the host system or targeted ML package(s).
  • ◼
  • For example, WL has extensive ML data on-boarding functions.
  • ◼
  • ("Encoders" and "decoders" that make the use of Neural Networks more streamlined.)
  • ◼
  • Hence, for using ML in WL less DT is needed.
  • How much is WL used?

    ◼
  • Approximately 80% we use WL; for illustration purposes we show some of the workflows in other languages.
  • ◼
  • Run within the same Mathematica presentation notebooks.
  • What kind of data?
    We concentrate on using data frames (datasets).
    More precisely tabular data and simple collections of tabular data.
    I.e. in WL -- datasets or lists or associations of datasets.
    What workflows considered?
    Here is a flow chart that shows the targeted workflows:
    Out[8]=
    Illustrating example
    Get the Titanic dataset
    In[9]:=
    dfTitanic=ResourceFunction["ExampleDataset"][{"MachineLearning","Titanic"}]
    Out[9]=
    passenger class
    passenger age
    passenger sex
    passenger survival
    1st
    29.0
    female
    survived
    1st
    0.9167
    male
    survived
    1st
    2.0
    female
    died
    1st
    30.0
    male
    died
    1st
    25.0
    female
    died
    1st
    48.0
    male
    survived
    1st
    63.0
    female
    survived
    1st
    39.0
    male
    died
    1st
    53.0
    female
    survived
    1st
    71.0
    male
    died
    1st
    47.0
    male
    died
    1st
    18.0
    female
    survived
    1st
    24.0
    female
    survived
    1st
    26.0
    female
    survived
    1st
    80.0
    male
    survived
    1st
    —
    male
    died
    1st
    24.0
    male
    died
    1st
    50.0
    female
    survived
    1st
    32.0
    female
    survived
    1st
    36.0
    male
    died
    rows 1–20 of
    1309
    Here is the summary of the dataset
    In[25]:=
    RecordsSummary
    [dfTitanic]
    Out[25]=
    
    1 passenger class
    3rd
    709
    1st
    323
    2nd
    277
    ,
    2 passenger age
    Min
    0.1667
    1st Qu
    21.
    Median
    28.
    Mean
    29.8811
    3rd Qu
    39.
    Max
    80.
    Missing[___]
    263
    ,
    3 passenger sex
    male
    843
    female
    466
    ,
    4 passenger survival
    died
    809
    survived
    500
    
    Group by passenger class and find the number of records of each group
    Group by passenger class and cross tabulate the records of each group
    Cross tabulate passenger class vs passenger sex
    Group by passenger class then for each class-group group by passenger sex and find number of records
    Illustrating example 2
    In[1]:=

    Column names as data

    Take a dataset with time series specified through separate columns and convert into easier to query dataset:
    Get the Lake Mead levels data
    Get the Lake Mead levels dataset
    Convert to long form
    Modify month from string value to integer value
    For each year find the corresponding time series (of lake levels)
    Show a sample of the time series found above

    Alternative solution without long form

    (This sub-section was not discussed in Wolfram U recording of the “Q&A Introduction” session. It is given in this notebook in order to make a more convenient reference.)
    We can apply the Split-Transform-Combine pattern directly without using long form:
    Note that:
    ◼
  • In the transformation above we implicitly rely on the sorted order of the month name columns.
  • ◼
  • Using Rest and First in the third argument given to GroupBy corresponds to the transformations to get the long form.
  • References

    © 2025 Wolfram. All rights reserved.

    • Legal & Privacy Policy
    • Contact Us
    • WolframAlpha.com
    • WolframCloud.com