What are the simplifying assumptions for the family of workflows considered in this tutorial?
◼
We do not consider hard to generalize- or too specialized transformations.
◼
For example, we do not consider:
◼
Data gathering/harvesting
◼
Data ingestion by parsing of, say, XML structures
◼
Date-time specific transformations
◼
Natural text specific transformations
◼
Etc.
What are the simplifying assumptions about the targeted data and transformations?
◼
Tabular data and collections of tabular data. (E.g. lists of datasets.)
◼
Transformation workflows that can be expressed with a certain "standard" or "well-known" subset of SQL.
◼
The flow chart given below illustrates well the targeted DT.
Do these workflows apply to other programming languages and/or data systems?
◼
Yes, with the presented DT know-how we target multiple "data science" programming languages.
Are additional packages needed to run the codes?
◼
Yes and no: depends on the target data transformation system.
◼
Sometimes the WL code uses Wolfram Function Repository functions.
Are there similar tutorials dedicated to other programming languages or packages?
◼
Yes, many. Both for WL and the rest (Julia/Python/R.)
◼
That said, in this tutorial we use a certain simplified and streamlined data-and-transformations model that allows the development of cross-system, transferable know-how and workflows.
What are the most important concepts for a newcomer to data wrangling?
1
.
Cross tabulation (or contingency matrices)
2
.
(Inner) joins
3
.
The so called Split-apply-combine pattern
4
.
Long form and wide form (or data pivoting)
How do the considered DT workflows relate to well known Machine Learning (ML) workflows?
◼
Generally speaking, data has to be transformed into formats convenient for the application of ML algorithms.
◼
How much transformation is needed depends a lot on the host system or targeted ML package(s).
◼
For example, WL has extensive ML data on-boarding functions.
◼
("Encoders" and "decoders" that make the use of Neural Networks more streamlined.)
◼
Hence, for using ML in WL less DT is needed.
How much is WL used?
◼
Approximately 80% we use WL; for illustration purposes we show some of the workflows in other languages.
◼
Run within the same Mathematica presentation notebooks.
What kind of data?
We concentrate on using data frames (datasets).
More precisely tabular data and simple collections of tabular data.
I.e. in WL -- datasets or lists or associations of datasets.
What workflows considered?
Here is a flow chart that shows the targeted workflows:
Group by passenger class and find the number of records of each group
Group by passenger class and cross tabulate the records of each group
Cross tabulate passenger class vs passenger sex
Group by passenger class then for each class-group group by passenger sex and find number of records
Illustrating example 2
In[1]:=
Column names as data
Take a dataset with time series specified through separate columns and convert into easier to query dataset:
Get the Lake Mead levels data
Get the Lake Mead levels dataset
Convert to long form
Modify month from string value to integer value
For each year find the corresponding time series (of lake levels)
Show a sample of the time series found above
Alternative solution without long form
(This sub-section was not discussed in Wolfram U recording of the “Q&A Introduction” session. It is given in this notebook in order to make a more convenient reference.)
We can apply the Split-Transform-Combine pattern directly without using long form:
Note that:
◼
In the transformation above we implicitly rely on the sorted order of the month name columns.
◼
Using Rest and First in the third argument given to GroupBy corresponds to the transformations to get the long form.