Wolfram Research

Function Repository Resource:

CrossTabulate

Source Notebook

Compute the contingency table for a two- or three- column dataset or array

Contributed by: Anton Antonov

ResourceFunction["CrossTabulate"][data]

finds the contingency table for the Dataset or array data.

Details and Options

ResourceFunction["CrossTabulate"] works on two dimensional full arrays with two or three columns, or on datasets that can be represented in that way.
If present, the third column is expected to be numerical.
If the argument has two columns, the computed contingency values are co-occurance counts for each unique pair of values of the first and second columns.
If the argument has three columns, the computed contingency values are sums of the third column values for each unique pair of values of the first and second columns.

Examples

Basic Examples

Here is an array of random integer-word pairs:

In[1]:=
SeedRandom[4];
iwPairs = Transpose[{RandomInteger[5, 200], RandomChoice[RandomWord[5], 200]}];
Short[iwPairs]
Out[3]=

Compute the contingency table:

In[4]:=
ResourceFunction["CrossTabulate"][iwPairs]
Out[4]=

Here is a Dataset, the first two columns of which are categorical columns and the third column which is numeric:

In[5]:=
dataset = Dataset[{
   <|"a" -> 1, "b" -> "x", "c" -> 5|>,
   <|"a" -> 2, "b" -> "y", "c" -> 6|>,
   <|"a" -> 3, "b" -> "z", "c" -> 4.5|>,
   <|"a" -> 1, "b" -> "x", "c" -> 10|>,
   <|"a" -> 2, "b" -> "y", "c" -> 100|>,
   <|"a" -> 3, "b" -> "z", "c" -> Missing[]|>}]
Out[5]=

Compute the contingency table:

In[6]:=
ResourceFunction["CrossTabulate"][dataset]
Out[6]=

Scope

Result representation

For large contingency tables instead of using Dataset it is faster and more convenient to use sparse arrays. That is specified with the option “Sparse”:

In[7]:=
Block[{n = 30},
 SeedRandom[32];
 sarr = Transpose[{RandomChoice[CharacterRange["A", "D"], n], RandomChoice[RandomWord["CommonWords", 5], n], RandomReal[100, n]}]
 ]
Out[7]=
In[8]:=
ResourceFunction["CrossTabulate"][sarr, "Sparse" -> True]
Out[8]=

Using a third, numerical column

Here is a full array with three columns:

In[9]:=
Block[{n = 30},
 SeedRandom[32];
 sarr = Transpose[{RandomChoice[CharacterRange["A", "D"], n], RandomChoice[RandomWord["CommonWords", 5], n], RandomReal[100, n]}]
 ]
Out[9]=

Compute the contingency table of the co-occurrences of each letter and with each word found by cross tabulating over the first two columns only:

In[10]:=
ResourceFunction["CrossTabulate"][sarr[[All, {1, 2}]]]
Out[10]=

Here the cross tabulation uses the third column -- for each unique letter-word pair the corresponding values of the third column are added:

In[11]:=
ResourceFunction["CrossTabulate"][sarr]
Out[11]=

Missing values

If any of the columns have missing values they are shown in the contingency table:

In[12]:=
dataset2 = Dataset[{
   <|"a" -> 1, "b" -> "x", "c" -> 5|>,
   <|"a" -> Missing["first"], "b" -> "x", "c" -> 6|>,
   <|"a" -> 3, "b" -> "z", "c" -> 4.5|>,
   <|"a" -> 1, "b" -> "x", "c" -> 10|>,
   <|"a" -> 2, "b" -> "y", "c" -> 100|>,
   <|"a" -> 3, "b" -> "z", "c" -> Missing[]|>}]
Out[12]=
In[13]:=
ResourceFunction["CrossTabulate"][dataset2]
Out[13]=

Options

Sparse

The result of CrossTabulate is a Dataset by default. With the option setting "Sparse"True the result is an Association with three elements: a sparse matrix with the contingency values, row names, and column names.

Here is an example:

In[14]:=
Block[{n = 40},
  data = Transpose[{ToString /@ RandomInteger[{10, 20}, n], ToString /@ RandomInteger[{1, 6}, n]}]
  ];
In[15]:=
res = ResourceFunction["CrossTabulate"][data, "Sparse" -> True]
Out[15]=

Using MatrixForm we can visualize the result:

In[16]:=
MatrixForm[res["SparseMatrix"], TableHeadings -> {res["RowNames"], res["ColumnNames"]}]
Out[16]=

Applications

Data study

Take the Titanic dataset:

In[17]:=
titanic = ExampleData[{"Dataset", "Titanic"}];

Find how many males and females are in each passenger class:

In[18]:=
ResourceFunction["CrossTabulate"][titanic[All, {"sex", "class"}]]
Out[18]=

Find how many males and females survived:

In[19]:=
ResourceFunction["CrossTabulate"][titanic[All, {"survived", "sex"}]]
Out[19]=

Find the aggregated ages of the class-sex breakdown:

In[20]:=
ResourceFunction["CrossTabulate"][
 titanic[All, {"class", "sex", "age"}]]
Out[20]=

Here is a function to plot sparse contingency tables:

In[21]:=
CTMatrixPlot[x_Association /; KeyExistsQ[x, "SparseMatrix"], opts___] := MatrixPlot[#1, Append[{opts}, FrameLabel -> {{Keys[x][[2]], None}, {Keys[x][[3]], None}}]] & @@
    x;
In[22]:=
CTMatrixPlot[
 ResourceFunction["CrossTabulate"][data, "Sparse" -> True]]
Out[22]=

Word-sentiment analysis of movie reviews

Start with movie review data:

In[23]:=
movieReviewData = Flatten@*List @@@ ExampleData[{"MachineLearning", "MovieReview"}, "Data"];
Dimensions[movieReviewData]
Out[22]=

For each movie review we make a list of word-sentiment pairs and then join them into one big list:

In[24]:=
movieReviewData = Join @@ Map[
    Thread[{DeleteStopwords[StringSplit[#[[1]]]], #[[2]]}] &, movieReviewData];
Dimensions[movieReviewData]
Out[20]=

Here is a sample:

In[25]:=
RandomSample[movieReviewData, 12]
Out[25]=

Here we find the word-sentiment contingency table as a sparse matrix in order to plot it below:

In[26]:=
cm = ResourceFunction["CrossTabulate"][movieReviewData, "Sparse" -> True];

Here is a function to plot sparse contingency tables:

In[27]:=
CTMatrixPlot[x_Association /; KeyExistsQ[x, "SparseMatrix"], opts___] := MatrixPlot[#1, Append[{opts}, FrameLabel -> {{Keys[x][[2]], None}, {Keys[x][[3]], None}}]] & @@
    x;

Plot the contingency table:

In[28]:=
CTMatrixPlot[cm, AspectRatio -> 2]
Out[28]=

Find the contingency table Dataset:

In[29]:=
ct = ResourceFunction["CrossTabulate"][movieReviewData, "Sparse" -> False];

Show the most prominent words for negative reviews:

In[30]:=
ct[ReverseSortBy[#negative - #positive &]][1 ;; 200, All]
Out[30]=

Properties and Relations

Tally and GroupBy

The functionality of CrossTabulate can be emulated with Tally or GroupBy.

Here is a contingency matrix of a two column array:

In[31]:=
Block[{n = 30},
  SeedRandom[32];
  sarr = Transpose[{RandomChoice[CharacterRange["A", "D"], n], RandomChoice[RandomWord["CommonWords", 5], n], RandomReal[100, n]}]
  ];
ResourceFunction["CrossTabulate"][sarr[[All, 1 ;; 2]]]
Out[31]=

Obtain the contingency value triplets using Tally:

In[32]:=
Tally[sarr[[All, 1 ;; 2]]]
Out[32]=

Obtain the contingency values rules using GroupBy:

In[33]:=
GroupBy[sarr, {#[[1]], #[[2]]} &, Length]
Out[33]=

GroupBy generalizes better than Tally -- we can use GroupBy to get the contingency values for three column data:

In[34]:=
GroupBy[sarr, #[[1 ;; 2]] &, Total[#[[All, 3]]] &]
Out[34]=

Find the corresponding result of CrossTabulate:

In[35]:=
ResourceFunction["CrossTabulate"][sarr]
Out[35]=

Conversion sparse results into datasets

Convert the Association obtained with the option setting "Sparse"True into a Dataset:

In[36]:=
Block[{n = 30},
  SeedRandom[32];
  sarr = Transpose[{RandomChoice[CharacterRange["A", "D"], n], RandomChoice[RandomWord["CommonWords", 5], n], RandomReal[100, n]}]
  ];
res = ResourceFunction["CrossTabulate"][sarr, "Sparse" -> True]
Out[36]=
In[37]:=
Dataset@AssociationThread[res["RowNames"], AssociationThread[res["ColumnNames"], #] & /@ Normal[res["SparseMatrix"]]]
Out[37]=

Possible Issues

If the second variable is numerical or has missing values the resulting Dataset would not have a tabular form:

In[38]:=
sarr2 = iwPairs;
sarr2[[2, 1]] = Missing[];
sarr2[[4, 2]] = Missing[\[Infinity]];
ResourceFunction["CrossTabulate"][sarr2]
Out[38]=

One way to get a tabular form is to replace Missing[___] with a string:

In[39]:=
ResourceFunction["CrossTabulate"][
 sarr2 /. Missing[x___] :> ToString[Missing[x]]]
Out[39]=

Find the co-occurrence of the integers [1,3] in a list of random integer pairs:

In[40]:=
iiPairs = RandomInteger[3, {200, 2}];
ResourceFunction["CrossTabulate"][iiPairs]
Out[40]=

Again, replacing the integer values with strings produces tabular form:

In[41]:=
ResourceFunction["CrossTabulate"][iiPairs /. x_Integer :> ToString[x]]
Out[41]=

Neat Examples

Here is a grid of contingency tables showing various breakdown perspectives of the Titanic data:

In[42]:=
titanic = ExampleData[{"Dataset", "Titanic"}];
columnNames = {"sex", "class", "survived"};
res = Outer[
   If[#1 == #2, "", ResourceFunction["CrossTabulate"][titanic[All, {#1, #2}]]] &, columnNames, columnNames];
Grid[Prepend[MapThread[Prepend, {res, columnNames}], Prepend[columnNames, ""]], Dividers -> All]
Out[38]=

Resource History

Source Metadata

Related Resources