Wolfram Research

Function Repository Resource:

RecordsSummary

Source Notebook

Summarizes datasets, lists, or associations that can be transformed into full two dimensional arrays

Contributed by: Anton Antonov

ResourceFunction["RecordsSummary"][data]

summarizes the argument data .

ResourceFunction["RecordsSummary"][data,cols]

summarizes data using the specified column names cols.

Details and Options

ResourceFunction["RecordsSummary"] works on datasets that 2D tables of atomic objects, full 2D arrays (matrices), lists of atomic objects, and associations the values of which are vectors or full 2D arrays.
Missing values are summarized separately.
The number of the summarized categorical values shown in the summary can be with changed with the option settng "MaxTallies"_Integer.
ResourceFunction["RecordsSummary"] threads for a list of rules or an association with the option setting ThreadTrue.
For 2D full arrays by default ResourceFunction["RecordsSummary"] automatically names the columns.
For 2D full arrays, lists, and associations a second argument can be provided specifying the column names.
By default the summarized columns are numbered.
Automatic numbering can be prevented with the option setting "NumberedColumns"False.
If v is a vector (that is, has only one dimension) then ResourceFunction["RecordsSummary"][v] is equivalent to ResourceFunction["RecordsSummary"][List/@v].

Examples

Basic Examples

Summarize a vector of numbers:

In[1]:=
ResourceFunction["RecordsSummary"][RandomReal[{-10, 10}, 100]]
Out[1]=

Summarize a matrix of strings and specify the column names:

In[2]:=
sarr = Transpose[{RandomChoice[CharacterRange["A", "Z"], 20], RandomWord["CommonWords", 20]}];
ResourceFunction[
 "RecordsSummary"][sarr, {"random letter", "random word"}]
Out[3]=

Summarize a vector of numbers with missing values:

In[4]:=
ResourceFunction["RecordsSummary"][
 RandomSample[Join[RandomReal[{-10, 10}, 100], Table[Missing[], 4]]]]
Out[4]=

Summarize a full 2D array with numerical and categorical columns (numbers, strings, and symbols):

In[5]:=
Block[{n = 200},
  arr = Flatten /@ Transpose[{RandomReal[{-10, 10}, {n, 2}], MapAt[ToLowerCase, RandomChoice[CharacterRange["A", "Z"], {n, 2}], {All, 2}], RandomChoice[{E, I, \[CapitalGamma]}, n]}]
  ];
In[6]:=
ResourceFunction["RecordsSummary"][arr]
Out[6]=

Summarize a dataset:

In[7]:=
ResourceFunction["RecordsSummary"][Dataset[arr]]
Out[7]=

Summarize a dataset with column names:

In[8]:=
ds = Dataset[arr][All, AssociationThread[{"num1", "num2", "char1", "char2", "symb"}, #] &];
In[9]:=
ResourceFunction["RecordsSummary"][ds]
Out[9]=

Summarize an association of vectors:

In[10]:=
asc = AssociationThread[
   Range[10] -> Table[Append[RandomReal[1, 2], RandomWord[]], 10]];
In[11]:=
ResourceFunction["RecordsSummary"][asc, Thread -> True]
Out[11]=

Scope

A larger number of categorical values can be seen using the option "MaxTallies":

In[12]:=
ResourceFunction["RecordsSummary"][ds, "MaxTallies" -> 12]
Out[12]=

The function works with missing values and summarizes them separately of the rest of the values in a column:

In[13]:=
ResourceFunction["RecordsSummary"][
 ds[All, {"num2" -> (If[# > 2, Missing[], #] &), "char1" -> (If[ToCharacterCode[#][[1]] > 76, Missing[], #] &)}]]
Out[13]=

Here we make a list of date objects with missing values:

In[14]:=
dateObjs = RandomSample[
   Join[DateObject /@ DateRange[{2011, 1, 1}, {2019, 12, 31}, Quantity[3, "Months"]], Table[Missing[], {4}]]];

Here is the summary of the date objects list and with a specified column name:

In[15]:=
ResourceFunction["RecordsSummary"][dateObjs, "date object"]
Out[15]=

Here we make an association of random images:

In[16]:=
asc = AssociationThread[
   Range[40] -> RandomChoice[Table[RandomImage[4, {40, 12}], 5], 40]];

This summarizes the list of rules in the association:

In[17]:=
ResourceFunction["RecordsSummary"][asc]
Out[17]=

We can summarize association’s keys and values separately using the option setting ThreadTrue:

In[18]:=
ResourceFunction["RecordsSummary"][asc, Thread -> True]
Out[18]=

A dataset does not have to have named columns:

In[19]:=
ResourceFunction["RecordsSummary"][ds[Values]]
Out[19]=

Options

MaxTallies

With the option "MaxTallies" we specify how many of summarized items we want to see for each column (variable):

In[20]:=
ResourceFunction["RecordsSummary"][ds, "MaxTallies" -> 12]
Out[20]=
In[21]:=
ResourceFunction["RecordsSummary"][ds, "MaxTallies" -> 1]
Out[21]=

NumberedColumns

By default the summarized columns (variables) are automatically numbered:

In[22]:=
ResourceFunction["RecordsSummary"][ds]
Out[22]=

With the option "NumberedColumns" the automatic numbering can be prevented:

In[23]:=
ResourceFunction["RecordsSummary"][ds, "NumberedColumns" -> False]
Out[23]=

Thread

The option Thread is used to specify should the summarization be “threaded” if data to be summarized is an association or a list of rules.

Here is an association of 3D points:

In[24]:=
asc = AssociationThread[Range[40] -> RandomReal[10, {40, 3}]];
Short[asc]
Out[23]=

Summarizing without threading:

In[25]:=
ResourceFunction["RecordsSummary"][asc, Thread -> False]
Out[25]=

Summarizing with threading:

In[26]:=
ResourceFunction["RecordsSummary"][asc, Thread -> True]
Out[26]=

Optionally column names can be added:

In[27]:=
ResourceFunction["RecordsSummary"][asc, "Key" -> {"X", "Y", "Z"}, "NumberedColumns" -> False, Thread -> True]
Out[27]=

Applications

Summarize Classify-ready data

Here we summarize the Titanic data:

In[28]:=
ResourceFunction["RecordsSummary"][
 ExampleData[{"MachineLearning", "Titanic"}, "Data"],
 ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"],
 Thread -> True]
Out[28]=

Here we summarize the Mushroom data:

In[29]:=
ResourceFunction["RecordsSummary"][
 ExampleData[{"MachineLearning", "Mushroom"}, "Data"],
 ExampleData[{"MachineLearning", "Mushroom"}, "VariableDescriptions"],
 Thread -> True]
Out[29]=

Summaries browser

If we have a set of datasets we can easily build an interactive interface that allows browsing of dataset summaries:

In[30]:=
dataNames = ExampleData["Statistics"];
Manipulate[
 Column[{
   Grid[{{"Dataset name:", name},
     {"Dimensions:", ExampleData[name, "Dimensions"]}},
    Alignment -> Left
    ],
   Multicolumn[
    ResourceFunction["RecordsSummary"][ExampleData[name], ExampleData[name, "ColumnDescriptions"]], 4, Alignment -> Top]
   }],
 {{name, dataNames[[29]], "Dataset name"}, dataNames, ControlType -> PopupMenu}]
Out[29]=

Possible Issues

It is expected that the first argument of RecordsSummary is an object that can be converted to a full array atom objects:

In[31]:=
dataset = Dataset[{
   <|"a" -> 1, "b" -> "x", "c" -> {1}|>,
   <|"a" -> 2, "b" -> "y", "c" -> {2, 3}|>,
   <|"a" -> 3, "b" -> "z", "c" -> {3}|>,
   <|"a" -> 4, "b" -> "x", "c" -> {4, 5}|>,
   <|"a" -> 5, "b" -> "y", "c" -> {5, 6, 7}|>,
   <|"a" -> 6, "b" -> "z", "c" -> {}|>}]
Out[31]=

This fails because dataset cannot be converted to a full 2D array:

In[32]:=
ResourceFunction["RecordsSummary"][dataset]
Out[32]=

A work-around is to use HoldForm for the columns that are not vectors:

In[33]:=
ResourceFunction["RecordsSummary"][dataset[All, {"c" -> HoldForm}]]
Out[33]=

If the numerical columns have Quantity values those columns are treated as categorical:

In[34]:=
ResourceFunction["RecordsSummary"][
 ExampleData[{"Dataset", "Planets"}][[All, {1, 2}]]]
Out[34]=

A summary of numerical values can be obtained by using QuantityMagnitude:

In[35]:=
ResourceFunction["RecordsSummary"][
 ExampleData[{"Dataset", "Planets"}][[All, {1, 2}]][
  All, {"Mass" -> QuantityMagnitude, "Radius" -> QuantityMagnitude}]]
Out[35]=

For associations the values of which are not full arrays using the option setting ThreadTrue produces $Failure:

In[36]:=
ResourceFunction["RecordsSummary"][<|1 -> Range[2], 2 -> Range[3]|>, Thread -> True]
Out[36]=

This works though:

In[37]:=
ResourceFunction["RecordsSummary"][<|1 -> Range[2], 2 -> Range[3]|>, Thread -> False]
Out[37]=

Neat Examples

Summarize subsets of Titanic data that correspond to each passenger class:

In[38]:=
titanic = ExampleData[{"Dataset", "Titanic"}];
In[39]:=
ColumnForm@
 Normal@Map[Grid[{ResourceFunction["RecordsSummary"][Dataset[#]]}] &, Normal[titanic[GroupBy["class"]]]]
Out[39]=

Resource History

Source Metadata