Function Repository Resource:

LongFormDataset

Source Notebook

Converts tabular data into a long-form dataset

Contributed by: Anton Antonov

ResourceFunction["LongFormDataset"][data]

converts the argument data into a long-form dataset with columns "AutomaticKey","Variable" and "Value".

ResourceFunction["LongFormDataset"][data, idcols]

converts the argument data into a long-form dataset with columns idcols and columns "Variable" and "Value" derived from the columns of data that are not in idcols.

ResourceFunction["LongFormDataset"][data, idcols, valcols]

converts the argument data into a long-form dataset with columns idcols and columns "Variable" and "Value" derived from valcols.

Details and Options

The so called "long form" and "wide form" conversions of tabular data are fundamental "data wrangling" operations.
In long form conversion, the columns of the original dataset are seen in three groups: identifier columns, variable columns and ignored columns.
The conversion into long form converts the set of specified variable columns into two columns: one with variable names and the other with variable values.
The first argument of ResourceFunction["LongFormDataset"] can be a two-dimensional (non-hierarchical) dataset, a two dimensional full array (matrix), a list of associations or an association of associations.
The conversion of a dataset into a long form allows column names (of variables) to be treated as data.
If no identifier columns are given, ResourceFunction["LongFormDataset"] uses the order indexes of the rows as identifiers.
ResourceFunction["LongFormDataset"] takes the following options:
"AutomaticKeysTo""AutomaticKey"the column name if automatic keys are used
"VariablesTo""Variable"the column name of the column that has the variable names as values
"ValuesTo""Value"the column name of the column that has the variable values as values

Examples

Basic Examples (4) 

Here is a simple dataset:

In[1]:=
dataset = Block[{ds, k}, ds = Dataset[{
Association["a" -> "x", "b" -> 5], 
Association["a" -> "y", "b" -> 6], 
Association["a" -> "x", "b" -> 10], 
Association["a" -> "y", "b" -> 100], 
Association["a" -> "z", "b" -> Missing[]]}]; k = 1; ds[
   All, Prepend[#, "ID" -> Increment[k]]& ]]
Out[1]=

Here is a long form conversion with a specified identifier column and variable columns:

In[2]:=
ResourceFunction["LongFormDataset"][dataset, "ID", {"a", "b"}]
Out[2]=

Here is another long form derived with an automatically determined identifier column and variable columns:

In[3]:=
ResourceFunction["LongFormDataset"][dataset]
Out[3]=

Here is a conversion to long form with identifier columns "a" and "b":

In[4]:=
ResourceFunction["LongFormDataset"][dataset, {"a", "b"}]
Out[4]=

Scope (5) 

The first argument can be a matrix:

In[5]:=
mat = RandomReal[{100, 200}, {3, 6}];
ResourceFunction["LongFormDataset"][mat]
Out[2]=

The first argument can be a list of associations:

In[6]:=
alist = AssociationThread[Range[Length[#]], #] & /@ mat;
ResourceFunction["LongFormDataset"][alist]
Out[2]=

The second and third arguments can be Automatic:

In[7]:=
ResourceFunction["LongFormDataset"][dataset, Automatic, {"a", "b"}]
Out[7]=
In[8]:=
ResourceFunction["LongFormDataset"][dataset, "a", Automatic]
Out[8]=

The options "IdentifierColumns" and "VariableColumns" can be used instead of a second argument and third argument respectively:

In[9]:=
ResourceFunction["LongFormDataset"][dataset, "IdentifierColumns" -> "ID", "VariableColumns" -> {"a", "b"}]
Out[9]=

The second and third arguments can be column indexes:

In[10]:=
ResourceFunction["LongFormDataset"][dataset, 2, {1, 3}]
Out[10]=

Automatic keys are derived if the dataset argument has row keys:

In[11]:=
SeedRandom[12];
dataset2 = Dataset[AssociationThread[RandomWord[Length[#]], Normal[#]] &@
   dataset]
Out[8]=

Here is an automatic conversion to long form:

In[12]:=
ResourceFunction["LongFormDataset"][dataset2]
Out[12]=

The row keys are ignored if the identifier column is specified:

In[13]:=
ResourceFunction["LongFormDataset"][dataset2, "ID"]
Out[13]=

Options (3) 

AutomaticKeysTo (1) 

The option "AutomaticKeysTo" can be used to specify the name of the column that corresponds to the automatically determined identifier:

In[14]:=
dataset = Block[{ds, k}, ds = Dataset[{
Association["a" -> "x", "b" -> 5], 
Association["a" -> "y", "b" -> 6], 
Association["a" -> "x", "b" -> 10], 
Association["a" -> "y", "b" -> 100], 
Association["a" -> "z", "b" -> Missing[]]}]; k = 1; ds[
    All, Prepend[#, "ID" -> Increment[k]]& ]];
ResourceFunction["LongFormDataset"][dataset, "AutomaticKeysTo" -> "SpecialID"]
Out[8]=

VariablesTo (1) 

The option "VariablesTo" specifies the name of long form's column that has, as values, the names of the variable columns:

In[15]:=
dataset = Block[{ds, k}, ds = Dataset[{
Association["a" -> "x", "b" -> 5], 
Association["a" -> "y", "b" -> 6], 
Association["a" -> "x", "b" -> 10], 
Association["a" -> "y", "b" -> 100], 
Association["a" -> "z", "b" -> Missing[]]}]; k = 1; ds[
    All, Prepend[#, "ID" -> Increment[k]]& ]];
ResourceFunction["LongFormDataset"][dataset, "VariablesTo" -> "VAR"]
Out[8]=

ValuesTo (1) 

The option "ValuesTo" specifies the name of the column that has the values in the variable columns:

In[16]:=
dataset = Block[{ds, k}, ds = Dataset[{
Association["a" -> "x", "b" -> 5], 
Association["a" -> "y", "b" -> 6], 
Association["a" -> "x", "b" -> 10], 
Association["a" -> "y", "b" -> 100], 
Association["a" -> "z", "b" -> Missing[]]}]; k = 1; ds[
    All, Prepend[#, "ID" -> Increment[k]]& ]];
ResourceFunction["LongFormDataset"][dataset, "ValuesTo" -> "VAL"]
Out[8]=

Applications (9) 

Column names as data (5) 

The main advantage of the long form conversion is that variable column names become data, that is, values in a certain column. This can be demonstrated by making an association of time series objects for the rows of multiple time series data given in wide form.

Here is a randomly generated table (dataset) with multiple time series data:

In[17]:=
dsTSWide = Dataset[
Association[
  2018 -> Association[
    "Apr" -> 83.49, "Aug" -> 0, "Dec" -> 84.59, "Feb" -> 0, "Jan" -> 0, "Jul" -> 0, "Jun" -> 93.32000000000001, "Mar" -> 215.34, "May" -> 89.65, "Nov" -> 168.51999999999998`, "Oct" -> 0, "Sep" -> 116.71000000000001`], 2019 -> Association[
    "Apr" -> 23.61, "Aug" -> 0, "Dec" -> 133.88, "Feb" -> 12.74, "Jan" -> 152.55, "Jul" -> 0, "Jun" -> 93.34, "Mar" -> 59.69, "May" -> 126.05000000000001`, "Nov" -> 94.46000000000001, "Oct" -> 69.95, "Sep" -> 37.27], 2020 -> Association[
    "Apr" -> 94.48, "Aug" -> 38.480000000000004`, "Dec" -> 77.44, "Feb" -> 12.64, "Jan" -> 33.86, "Jul" -> 141.04000000000002`, "Jun" -> 50.36, "Mar" -> 0, "May" -> 0, "Nov" -> 12.65, "Oct" -> 0, "Sep" -> 26.13]]]
Out[17]=

Each row of that random time series dataset corresponds to a yearly sequence of monthly values. The interpretation of the dataset reveals heterogeneous semantics of its values: (1) each row corresponds to a year specified as a key and (2) the columns have short month names.

Convert the dataset into long form using the row keys as identifiers:

In[18]:=
dsLong = ResourceFunction["LongFormDataset"][dsTSWide, "AutomaticKeysTo" -> "Year", "VariablesTo" -> "Month"]
Out[18]=

Add the column "ObservationTime" derived from the columns "Year" and "Month":

In[19]:=
dsLong = dsLong[All, Join[#, <|"ObservationTime" -> AbsoluteTime[ToString[#Year] <> "-" <> #Month]|>] &]
Out[19]=

Split the long form dataset by year and make time series with the columns "ObservationTime" and "Value":

In[20]:=
timeSeriesCollection = GroupBy[Normal@dsLong, #Year &, TimeSeries[Values /@ #[[All, {"ObservationTime", "Value"}]]] &]
Out[20]=

Plot the obtained time series:

In[21]:=
DateListPlot /@ timeSeriesCollection
Out[21]=

Combinations of the heterogenous data (4) 

Using long forms simplifies the programmatic manipulation of heterogenous data.

Get datasets with different numbers of rows and columns that correspond to items and variables of different kinds:

In[22]:=
datasets = Association@
   Map[# -> ResourceFunction[
        "ExampleDataSpecToDataset"][{"Statistics", #}] &, \
{"AnimalWeights", "EmployeeAttitude", "OrangeTreeGrowth"}];
Magnify[#, 0.6] & /@ datasets
Out[18]=

Convert all datasets into long form datasets:

In[23]:=
datasets = ResourceFunction["LongFormDataset"] /@ datasets;
Magnify[#, 0.6] & /@ datasets
Out[20]=

For each long form dataset, change the automatic key column to include the name of that dataset:

In[24]:=
datasets = KeyValueMap[
   Function[{k, v}, v[All, Prepend[#, "AutomaticKey" -> k <> "-" <> ToString[#AutomaticKey]] &]], datasets];
Magnify[#, 0.6] & /@ datasets
Out[25]=

Join all long form datasets into one dataset and show a sample:

In[26]:=
SeedRandom[321];
Sort@RandomSample[Join @@ datasets, 8]
Out[27]=

Properties and Relations (2) 

Here is a random sparse matrix:

In[28]:=
SeedRandom[12];
smat = SparseArray[
  RandomReal[{0, 1}, {4, 5}] /. (x_?NumberQ /; x < 0.5) -> 0]
Out[29]=

Compare the array rules of this sparse matrix with its long-form representation:

In[30]:=
ArrayRules[smat]
Out[30]=

Here is the long-form representation:

In[31]:=
ResourceFunction["LongFormDataset"][Normal@smat][Select[#Value > 0 &]]
Out[31]=

Note that since using Normal on the sparse matrix makes a dense matrix with zeroes, those zeroes are filtered out from the displayed long form.


Cross tabulation can be seen under certain conditions as conversion to wide form. Therefore, certain long form conversions can be reverted into the original forms using cross tabulation.

Here is an example using the resource function over a subset of the Lake Mead elevation levels data:

In[32]:=
dsLakeData = ResourceFunction["ExampleDataSpecToDataset"][{"Statistics", "LakeMeadLevels"}][[1 ;; 4, 1 ;; 4]];
dsLong = ResourceFunction["LongFormDataset"][dsLakeData, "Year", "VariablesTo" -> "Month", "ValuesTo" -> "Elevation"];
dsCrossTbl = ResourceFunction["CrossTabulate"][dsLong];
In[33]:=
AssociationThread[{"Orignal", "Long form", "Cross tabulation"}, Magnify[#, 0.8] & /@ {dsLakeData, dsLong, dsCrossTbl}]
Out[33]=

Possible Issues (2) 

If the first argument is a dataset without column names, the second and third arguments are expected to consist of column indices.

Here is a correct specification:

In[34]:=
ResourceFunction["LongFormDataset"][dataset[Values], 2, {1, 3}]
Out[34]=

Here are some incorrect specifications:

In[35]:=
ResourceFunction["LongFormDataset"][dataset[Values], 2, {1, 5}]
Out[35]=
In[36]:=
ResourceFunction["LongFormDataset"][dataset[Values], "a"]
Out[36]=

If the identifier columns produce multiple corresponding rows for a given combination of identifier values, then only the last row is put in the long-form result:

In[37]:=
dataset
Out[37]=

The row <|"ID"1, "a" x, "b" 5|> is lost:

In[38]:=
ResourceFunction["LongFormDataset"][dataset, "a"]
Out[38]=

Publisher

Anton Antonov

Version History

  • 1.0.0 – 02 December 2020

Related Resources

Author Notes

Conversions to long format and wide format are fundamental data transformation techniques.

License Information