Function Repository Resource:

RecordsSummary

Source Notebook

Summarizes datasets, lists, or associations that can be transformed into full two dimensional arrays

Contributed by: Anton Antonov

ResourceFunction["RecordsSummary"][data]

summarizes the argument data.

ResourceFunction["RecordsSummary"][data,cols]

summarizes data using the specified column names cols.

Details and Options

ResourceFunction["RecordsSummary"] works on datasets that 2D tables of atomic objects, full 2D arrays (matrices), lists of atomic objects, and associations the values of which are vectors or full 2D arrays.
Missing values are summarized separately.
The number of the summarized categorical values shown in the summary can be with changed with the option settng "MaxTallies"_Integer.
ResourceFunction["RecordsSummary"] threads for a list of rules or an association with the option setting ThreadTrue.
For 2D full arrays by default ResourceFunction["RecordsSummary"] automatically names the columns.
For 2D full arrays, lists and associations, a second argument can be provided specifying the column names.
By default the summarized columns are numbered.
Automatic numbering can be prevented with the option setting "NumberedColumns"False.
If v is a vector (that is, has only one dimension) then ResourceFunction["RecordsSummary"][v] is equivalent to ResourceFunction["RecordsSummary"][List/@v].

Examples

Basic Examples (5) 

Summarize a vector of numbers:

In[1]:=
ResourceFunction["RecordsSummary"][RandomReal[{-10, 10}, 100]]
Out[1]=

Summarize a matrix of strings and specify the column names:

In[2]:=
sarr = Transpose[{RandomChoice[CharacterRange["A", "Z"], 20], RandomWord["CommonWords", 20]}];
ResourceFunction[
 "RecordsSummary"][sarr, {"random letter", "random word"}]
Out[2]=

Summarize a vector of numbers with missing values:

In[3]:=
ResourceFunction["RecordsSummary"][
 RandomSample[Join[RandomReal[{-10, 10}, 100], Table[Missing[], 4]]]]
Out[3]=

Summarize a full 2D array with numerical and categorical columns (numbers, strings, and symbols):

In[4]:=
Block[{n = 200},
  arr = Flatten /@ Transpose[{RandomReal[{-10, 10}, {n, 2}], MapAt[ToLowerCase, RandomChoice[CharacterRange["A", "Z"], {n, 2}], {All, 2}], RandomChoice[{E, I, \[CapitalGamma]}, n]}]
  ];
In[5]:=
ResourceFunction["RecordsSummary"][arr]
Out[5]=

Summarize a dataset:

In[6]:=
ResourceFunction["RecordsSummary"][Dataset[arr]]
Out[6]=

Summarize a dataset with column names:

In[7]:=
ds = Dataset[arr][All, AssociationThread[{"num1", "num2", "char1", "char2", "symb"}, #] &];
In[8]:=
ResourceFunction["RecordsSummary"][ds]
Out[8]=

Summarize an association of vectors:

In[9]:=
asc = AssociationThread[
   Range[10] -> Table[Append[RandomReal[1, 2], RandomWord[]], 10]];
In[10]:=
ResourceFunction["RecordsSummary"][asc, Thread -> True]
Out[10]=

Scope (4) 

Define a dataset:

In[11]:=
Block[{n = 200},
  arr = Flatten /@ Transpose[{RandomReal[{-10, 10}, {n, 2}], MapAt[ToLowerCase, RandomChoice[CharacterRange["A", "Z"], {n, 2}], {All, 2}], RandomChoice[{E, I, \[CapitalGamma]}, n]}]
  ];
In[12]:=
ds = Dataset[arr][All, AssociationThread[{"num1", "num2", "char1", "char2", "symb"}, #] &];

A larger number of categorical values can be seen using the option "MaxTallies":

In[13]:=
ResourceFunction["RecordsSummary"][ds, "MaxTallies" -> 12]
Out[13]=

The function works with missing values and summarizes them separately of the rest of the values in a column:

In[14]:=
ResourceFunction["RecordsSummary"][
 ds[All, {"num2" -> (If[# > 2, Missing[], #] &), "char1" -> (If[ToCharacterCode[#][[1]] > 76, Missing[], #] &)}]]
Out[14]=

Here we make a list of date objects with missing values:

In[15]:=
dateObjs = RandomSample[
   Join[DateObject /@ DateRange[{2011, 1, 1}, {2019, 12, 31}, Quantity[3, "Months"]], Table[Missing[], {4}]]];

Here is the summary of the date objects list and with a specified column name:

In[16]:=
ResourceFunction["RecordsSummary"][dateObjs, "date object"]
Out[16]=

Here we make an association of random images:

In[17]:=
asc = AssociationThread[
   Range[40] -> RandomChoice[Table[RandomImage[4, {40, 12}], 5], 40]];

This summarizes the list of rules in the association:

In[18]:=
ResourceFunction["RecordsSummary"][asc]
Out[18]=

We can summarize association’s keys and values separately using the option setting ThreadTrue:

In[19]:=
ResourceFunction["RecordsSummary"][asc, Thread -> True]
Out[19]=

A dataset does not have to have named columns:

In[20]:=
Block[{n = 200},
  arr = Flatten /@ Transpose[{RandomReal[{-10, 10}, {n, 2}], MapAt[ToLowerCase, RandomChoice[CharacterRange["A", "Z"], {n, 2}], {All, 2}], RandomChoice[{E, I, \[CapitalGamma]}, n]}]
  ];
In[21]:=
ds = Dataset[arr][All, AssociationThread[{"num1", "num2", "char1", "char2", "symb"}, #] &];
In[22]:=
ResourceFunction["RecordsSummary"][ds[Values]]
Out[22]=

Options (7) 

MaxTallies (1) 

With the option "MaxTallies" we specify how many of summarized items we want to see for each column (variable):

In[23]:=
Block[{n = 200},
  arr = Flatten /@ Transpose[{RandomReal[{-10, 10}, {n, 2}], MapAt[ToLowerCase, RandomChoice[CharacterRange["A", "Z"], {n, 2}], {All, 2}], RandomChoice[{E, I, \[CapitalGamma]}, n]}]
  ];
In[24]:=
ds = Dataset[arr][All, AssociationThread[{"num1", "num2", "char1", "char2", "symb"}, #] &];
In[25]:=
ResourceFunction["RecordsSummary"][ds, "MaxTallies" -> 12]
Out[25]=
In[26]:=
ResourceFunction["RecordsSummary"][ds, "MaxTallies" -> 1]
Out[26]=

NumberedColumns (2) 

By default the summarized columns (variables) are automatically numbered:

In[27]:=
Block[{n = 200},
  arr = Flatten /@ Transpose[{RandomReal[{-10, 10}, {n, 2}], MapAt[ToLowerCase, RandomChoice[CharacterRange["A", "Z"], {n, 2}], {All, 2}], RandomChoice[{E, I, \[CapitalGamma]}, n]}]
  ];
In[28]:=
ds = Dataset[arr][All, AssociationThread[{"num1", "num2", "char1", "char2", "symb"}, #] &];
In[29]:=
ResourceFunction["RecordsSummary"][ds]
Out[29]=

With the option "NumberedColumns" the automatic numbering can be prevented:

In[30]:=
ResourceFunction["RecordsSummary"][ds, "NumberedColumns" -> False]
Out[30]=

Thread (4) 

The option Thread is used to specify should the summarization be "threaded" if data to be summarized is an association or a list of rules.

Here is an association of 3D points:

In[31]:=
asc = AssociationThread[Range[40] -> RandomReal[10, {40, 3}]];
Short[asc]
Out[30]=

Summarizing without threading:

In[32]:=
ResourceFunction["RecordsSummary"][asc, Thread -> False]
Out[32]=

Summarizing with threading:

In[33]:=
ResourceFunction["RecordsSummary"][asc, Thread -> True]
Out[33]=

Optionally column names can be added:

In[34]:=
ResourceFunction["RecordsSummary"][asc, "Key" -> {"X", "Y", "Z"}, "NumberedColumns" -> False, Thread -> True]
Out[34]=

Applications (3) 

Summarize Classify-ready data (2) 

Here we summarize the Titanic data:

In[35]:=
ResourceFunction["RecordsSummary"][
 ExampleData[{"MachineLearning", "Titanic"}, "Data"],
 ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"],
 Thread -> True]
Out[35]=

Here we summarize the Mushroom data:

In[36]:=
ResourceFunction["RecordsSummary"][
 ExampleData[{"MachineLearning", "Mushroom"}, "Data"],
 ExampleData[{"MachineLearning", "Mushroom"}, "VariableDescriptions"],
 Thread -> True]
Out[36]=

Summaries browser (1) 

If we have a set of datasets we can easily build an interactive interface that allows browsing of dataset summaries:

In[37]:=
dataNames = ExampleData["Statistics"];
Manipulate[
 Column[{
   Grid[{{"Dataset name:", name},
     {"Dimensions:", ExampleData[name, "Dimensions"]}},
    Alignment -> Left
    ],
   Multicolumn[
    ResourceFunction["RecordsSummary"][ExampleData[name], ExampleData[name, "ColumnDescriptions"]], 4, Alignment -> Top]
   }],
 {{name, dataNames[[29]], "Dataset name"}, dataNames, ControlType -> PopupMenu}]
Out[37]=

Possible Issues (7) 

It is expected that the first argument of RecordsSummary is an object that can be converted to a full array atom objects:

In[38]:=
dataset = Dataset[{
   <|"a" -> 1, "b" -> "x", "c" -> {1}|>,
   <|"a" -> 2, "b" -> "y", "c" -> {2, 3}|>,
   <|"a" -> 3, "b" -> "z", "c" -> {3}|>,
   <|"a" -> 4, "b" -> "x", "c" -> {4, 5}|>,
   <|"a" -> 5, "b" -> "y", "c" -> {5, 6, 7}|>,
   <|"a" -> 6, "b" -> "z", "c" -> {}|>}]
Out[38]=

This fails because dataset cannot be converted to a full 2D array:

In[39]:=
ResourceFunction["RecordsSummary"][dataset]
Out[39]=

A work-around is to use HoldForm for the columns that are not vectors:

In[40]:=
ResourceFunction["RecordsSummary"][dataset[All, {"c" -> HoldForm}]]
Out[40]=

If the numerical columns have Quantity values those columns are treated as categorical:

In[41]:=
ResourceFunction["RecordsSummary"][
 ExampleData[{"Dataset", "Planets"}][[All, {1, 2}]]]
Out[41]=

A summary of numerical values can be obtained by using QuantityMagnitude:

In[42]:=
ResourceFunction["RecordsSummary"][
 ExampleData[{"Dataset", "Planets"}][[All, {1, 2}]][
  All, {"Mass" -> QuantityMagnitude, "Radius" -> QuantityMagnitude}]]
Out[42]=

For associations the values of which are not full arrays, using the option setting ThreadTrue produces a failure:

In[43]:=
ResourceFunction["RecordsSummary"][<|1 -> Range[2], 2 -> Range[3]|>, Thread -> True]
Out[43]=

This works though:

In[44]:=
ResourceFunction["RecordsSummary"][<|1 -> Range[2], 2 -> Range[3]|>, Thread -> False]
Out[44]=

Neat Examples (1) 

Summarize subsets of Titanic data that correspond to each passenger class:

In[45]:=
titanic = ExampleData[{"Dataset", "Titanic"}];
In[46]:=
ColumnForm@
 Normal@Map[Grid[{ResourceFunction["RecordsSummary"][Dataset[#]]}] &, Normal[titanic[GroupBy["class"]]]]
Out[46]=

Publisher

Anton Antonov

Version History

  • 1.0.0 – 02 October 2019

Related Resources

Author Notes

This function, RecordsSummary, corresponds to R’s fundamental function summary.

License Information