AntonAntonov/DataReshapers | Paclet Repository

Data transformation workflows

Introduction	Illustrating example
All questions	Illustrating example 2
What kind of data?	References
What workflows considered?

Introduction

In this notebook we introduce (Tabular) Data Transformation Workflows through Questions and Answers.

Remark: This notebook is an expanded version of the notebook used in the Wolfram U session [AAv1].

Remark: We use several Wolfram Function Repository (WFR) functions:

ExampleDataset

and

RandomTabularDataset

Load the paclet

In[7]:=

Needs["AntonAntonov`DataReshapers`"]

All questions

How exciting is this?

◼

Not, but YMMV...

What are the simplifying assumptions for the family of workflows considered in this tutorial?

◼

We do not consider hard to generalize- or too specialized transformations.

◼

For example, we do not consider:

◼

Data gathering/harvesting

◼

Data ingestion by parsing of, say, XML structures

◼

Date-time specific transformations

◼

Natural text specific transformations

◼

Etc.

What are the simplifying assumptions about the targeted data and transformations?

◼

Tabular data and collections of tabular data. (E.g. lists of datasets.)

◼

Transformation workflows that can be expressed with a certain "standard" or "well-known" subset of SQL.

◼

The flow chart given below illustrates well the targeted DT.

Do these workflows apply to other programming languages and/or data systems?

◼

Yes, with the presented DT know-how we target multiple "data science" programming languages.

Are additional packages needed to run the codes?

◼

Yes and no: depends on the target data transformation system.

◼

Sometimes the WL code uses Wolfram Function Repository functions.

Are there similar tutorials dedicated to other programming languages or packages?

◼

Yes, many. Both for WL and the rest (Julia/Python/R.)

◼

That said, in this tutorial we use a certain simplified and streamlined data-and-transformations model
that allows the development of cross-system, transferable know-how and workflows.

What are the most important concepts for a newcomer to data wrangling?

Cross tabulation (or contingency matrices)

(Inner) joins

The so called Split-apply-combine pattern

Long form and wide form (or data pivoting)

How do the considered DT workflows relate to well known Machine Learning (ML) workflows?

◼

Generally speaking, data has to be transformed into formats convenient for the application of ML algorithms.

◼

How much transformation is needed depends a lot on the host system or targeted ML package(s).

◼

For example, WL has extensive ML data on-boarding functions.

◼

("Encoders" and "decoders" that make the use of Neural Networks more streamlined.)

◼

Hence, for using ML in WL less DT is needed.

How much is WL used?

◼

Approximately 80% we use WL; for illustration purposes we show some of the workflows in other languages.

◼

Run within the same Mathematica presentation notebooks.

What kind of data?

We concentrate on using data frames (datasets).

More precisely tabular data and simple collections of tabular data.

I.e. in WL -- datasets or lists or associations of datasets.

What workflows considered?

Here is a flow chart that shows the targeted workflows:

Out[8]=

Illustrating example

Get the Titanic dataset

In[9]:=

dfTitanic=ResourceFunction["ExampleDataset"][{"MachineLearning","Titanic"}]

Out[9]=

passenger class	passenger age	passenger sex	passenger survival
1st	29.0	female	survived
1st	0.9167	male	survived
1st	2.0	female	died
1st	30.0	male	died
1st	25.0	female	died
1st	48.0	male	survived
1st	63.0	female	survived
1st	39.0	male	died
1st	53.0	female	survived
1st	71.0	male	died
1st	47.0	male	died
1st	18.0	female	survived
1st	24.0	female	survived
1st	26.0	female	survived
1st	80.0	male	survived
1st	—	male	died
1st	24.0	male	died
1st	50.0	female	survived
1st	32.0	female	survived
1st	36.0	male	died
rows 1–20 of 1309

Here is the summary of the dataset

In[25]:=

RecordsSummary

[dfTitanic]

Out[25]=



1 passenger class

3rd	709
1st	323
2nd	277

2 passenger age

Min	0.1667
1st Qu	21.
Median	28.
Mean	29.8811
3rd Qu	39.
Max	80.
Missing[___]	263

3 passenger sex

male	843
female	466

4 passenger survival

died	809
survived	500



Group by passenger class and find the number of records of each group

Group by passenger class and cross tabulate the records of each group

Cross tabulate passenger class vs passenger sex

Group by passenger class then for each class-group group by passenger sex and find number of records

Illustrating example 2

In[1]:=

Column names as data

Take a dataset with time series specified through separate columns and convert into easier to query dataset:

Get the Lake Mead levels data

Get the Lake Mead levels dataset

Convert to long form

Modify month from string value to integer value

For each year find the corresponding time series (of lake levels)

Show a sample of the time series found above

Alternative solution without long form

(This sub-section was not discussed in Wolfram U recording of the “Q&A Introduction” session. It is given in this notebook in order to make a more convenient reference.)

We can apply the Split-Transform-Combine pattern directly without using long form:

Note that:

◼

In the transformation above we implicitly rely on the sorted order of the month name columns.

◼

Using Rest and First in the third argument given to GroupBy corresponds to the transformations to get the long form.

References