AntonAntonov/DataReshapers | Paclet Repository

Long form data transformation

Introduction	Long form applications
What kind of data?	Long form vs sparse array
What workflows considered?	References
Long form in more detail

Introduction

In this notebook we consider in more detail the (tabular) data transformation operations Long form.

Remark: This notebook is an expanded version of the notebook used in the Wolfram U session [AAv2].

Remark: We use several Wolfram Function Repository (WFR) functions:

ExampleDataset

and

RandomTabularDataset

Load the paclet

In[34]:=

Needs["AntonAntonov`DataReshapers`"]

What kind of data?

We concentrate on using data frames (datasets).

More precisely tabular data and simple collections of tabular data.

I.e. in WL -- datasets or lists or associations of datasets.

What workflows considered?

Here is a flow chart that shows the targeted workflows:

In[35]:=

plWorkflows=ImageCrop@Import["https://github.com/antononcube/ConversationalAgents/raw/master/ConceptualDiagrams/Tabular-data-transformation-workflows.jpg"]

Out[35]=

Long form in more detail

Remark: This “Illustrating example 2” from the tech note

Data transformation workflows

. (Also, session [AAv1].)

In[1]:=

Column names as data

Take a dataset with time series specified through separate columns and convert into easier to query dataset:

Get the Lake Mead levels data

In[36]:=

ExampleData[{"Statistics","LakeMeadLevels"},"LongDescription"]

Out[36]=

Elevation of Lake Mead at Hoover Dam, in feet, in years 1935-2009. The observation on January of 1935 was missing.

Get the Lake Mead levels dataset

In[37]:=

dsLakeData=ResourceFunction["ExampleDataset"][{"Statistics","LakeMeadLevels"}]

Out[37]=

Year	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep
1935	0.0	708.7	701.7	752.4	806.6	909.1	928.4	925.9	920.8
1936	907.9	908.4	906.9	922.2	982.4	1015.5	1020.4	1024.4	1024.6
1937	1022.2	1026.2	1031.0	1044.6	1078.7	1096.6	1102.8	1099.6	1097.6
1938	1095.0	1094.85	1100.2	1109.2	1134.4	1165.6	1173.5	1171.95	1173.8
1939	1165.4	1157.4	1158.2	1162.9	1176.2	1183.45	1181.3	1177.35	1178.95
1940	1166.9	1166.0	1164.65	1164.85	1175.35	1182.15	1179.8	1176.1	1174.45
1941	1167.1	1167.55	1170.35	1175.85	1201.25	1215.55	1220.4	1216.5	1209.85
1942	1184.8	1176.75	1171.25	1182.9	1195.5	1213.15	1213.2	1210.4	1205.0
1943	1182.45	1179.6	1177.0	1180.6	1188.75	1199.4	1201.55	1199.65	1195.7
1944	1170.9	1164.2	1158.4	1158.1	1173.2	1194.85	1199.55	1193.9	1187.1
1945	1161.25	1155.75	1149.6	1147.4	1161.3	1174.55	1180.7	1177.95	1177.95
1946	1157.55	1151.65	1147.45	1148.2	1155.15	1163.95	1163.3	1161.25	1157.65
1947	1142.6	1133.1	1135.38	1134.49	1151.93	1169.64	1177.18	1180.13	1178.02
1948	1164.52	1158.8	1154.51	1158.77	1175.95	1192.15	1190.32	1186.95	1180.82
1949	1158.04	1150.99	1147.17	1148.28	1164.51	1189.07	1196.61	1193.17	1186.86
1950	1163.4	1156.7	1151.75	1152.13	1158.57	1173.95	1177.15	1173.36	1168.77
1951	1151.47	1168.02	1144.07	1141.34	1146.16	1163.53	1168.19	1167.0	1163.75
1952	1149.06	1143.0	1133.99	1141.96	1171.53	1198.06	1199.88	1195.74	1190.0
1953	1162.56	1157.12	1152.42	1148.4	1147.33	1165.11	1165.87	1163.38	1157.67
1954	1141.44	1138.22	1134.08	1130.26	1130.76	1129.88	1126.25	1120.82	1115.39
rows 1–20 of 75 columns 1–10 of 13

Convert to long form

In[38]:=

dsLong=

LongFormDataset

[dsLakeData,"Year","VariablesTo""Month","ValuesTo""Elevation"]

Out[38]=

Year	Month	Elevation
1935	Apr	752.4
1935	Aug	925.9
1935	Dec	908.4
1935	Feb	708.7
1935	Jan	0.0
1935	Jul	928.4
1935	Jun	909.1
1935	Mar	701.7
1935	May	806.6
1935	Nov	908.3
1935	Oct	914.9
1935	Sep	920.8
1936	Apr	922.2
1936	Aug	1024.4
1936	Dec	1023.5
1936	Feb	908.4
1936	Jan	907.9
1936	Jul	1020.4
1936	Jun	1015.5
1936	Mar	906.9
rows 1–20 of 900

Modify month from string value to integer value

In[39]:=

dsLong=dsLong[All,Join[#,"Month"DateList[{#Month,{"MonthShort"}}]〚2〛]&]

Out[39]=

Year	Month	Elevation
1935	4	752.4
1935	8	925.9
1935	12	908.4
1935	2	708.7
1935	1	0.0
1935	7	928.4
1935	6	909.1
1935	3	701.7
1935	5	806.6
1935	11	908.3
1935	10	914.9
1935	9	920.8
1936	4	922.2
1936	8	1024.4
1936	12	1023.5
1936	2	908.4
1936	1	907.9
1936	7	1020.4
1936	6	1015.5
1936	3	906.9
rows 1–20 of 900

Group by year and show short form of the result

For each year find the corresponding time series (of lake levels)

Show a sample of the time series found above

Calendar time series

The previous time series are relative in time. What if we want to do calendar time series?

First we have to modify the month column to be "ObservationTime":

Split the long form dataset by year and make time series with the columns "ObservationTime" and "Elevation":

Plot the obtained time series:

Alternative solution without long form

(This sub-section was not discussed in Wolfram U recording of the “Q&A Introduction” session. It is given in this notebook in order to make a more convenient reference.)

We can apply the Split-Transform-Combine pattern directly without using long form:

Note that:

◼

In the transformation above we implicitly rely on the sorted order of the month name columns.

◼

Using Rest and First in the third argument given to GroupBy corresponds to the transformations to get the long form.

Long form applications

Many applications of Long form transformation come from the fact that metadata is converted to data

In[6]:=

Combinations of the heterogenous data

Using long forms makes easier the programmatic manipulation of heterogenous data.

Get datasets with different number of rows and columns that correspond to items and variables of different kinds:

Convert all datasets into long form datasets:

For each long form dataset change the automatic key column to include dataset's name:

Join all long form datasets into one dataset and show a sample:

Long form vs sparse array

Sparse matrices (sparse arrays) have representation very similar to that of long form. Here is a random sparse matrix:

Generate random (sparse) matrix

Matrix form of the random matrix

Matrix plot of the random matrix

Compare the array rules of that sparse matrix with its long form representation.

Here are the array rules:

Here is tabulation of the array rules:

Here is the long form representation:

Sort the long form dataset:

Cross tabulate the long form

Matrix form of the corresponding sparse matrix:

Here is comparison of the representations of sparse arrays vs long form:

References