Function Repository Resource:

CrossTabulate

Compute the contingency table for a two- or three- column dataset or array

Contributed by: Anton Antonov

ResourceFunction["CrossTabulate"][data]

finds the contingency table for the Dataset or array data.

Details and Options

ResourceFunction["CrossTabulate"] works on two dimensional full arrays with two or three columns, or on datasets that can be represented in that way.

If present, the third column is expected to be numerical.

If the argument has two columns, the computed contingency values are co-occurance counts for each unique pair of values of the first and second columns.

If the argument has three columns, the computed contingency values are sums of the third column values for each unique pair of values of the first and second columns.

Examples

Basic Examples (2)

Here is an array of random integer-word pairs:

In[1]:=

SeedRandom[4];
iwPairs = Transpose[{RandomInteger[5, 200], RandomChoice[RandomWord[5], 200]}];
Short[iwPairs]

Out[3]=

Compute the contingency table:

In[4]:=

Out[4]=

Here is a Dataset, the first two columns of which are categorical columns and the third column which is numeric:

In[5]:=

dataset = Dataset[{
<|"a" -> 1, "b" -> "x", "c" -> 5|>,
<|"a" -> 2, "b" -> "y", "c" -> 6|>,
<|"a" -> 3, "b" -> "z", "c" -> 4.5|>,
<|"a" -> 1, "b" -> "x", "c" -> 10|>,
<|"a" -> 2, "b" -> "y", "c" -> 100|>,
<|"a" -> 3, "b" -> "z", "c" -> Missing[]|>}]

Out[5]=

Compute the contingency table:

In[6]:=

Out[6]=

Scope (5)

Result representation (1)

For large contingency tables instead of using Dataset it is faster and more convenient to use sparse arrays. That is specified with the option “Sparse”:

In[7]:=

Block[{n = 30},
SeedRandom[32];
sarr = Transpose[{RandomChoice[CharacterRange["A", "D"], n], RandomChoice[RandomWord["CommonWords", 5], n], RandomReal[100, n]}]
]

Out[7]=

In[8]:=

Out[8]=

Using a third, numerical column (3)

Here is a full array with three columns:

In[9]:=

Out[9]=

Compute the contingency table of the co-occurrences of each letter and with each word found by cross tabulating over the first two columns only:

In[10]:=

Out[10]=

Here the cross tabulation uses the third column -- for each unique letter-word pair the corresponding values of the third column are added:

In[11]:=

Out[11]=

Missing values (1)

If any of the columns have missing values they are shown in the contingency table:

In[12]:=

dataset2 = Dataset[{
<|"a" -> 1, "b" -> "x", "c" -> 5|>,
<|"a" -> Missing["first"], "b" -> "x", "c" -> 6|>,
<|"a" -> 3, "b" -> "z", "c" -> 4.5|>,
<|"a" -> 1, "b" -> "x", "c" -> 10|>,
<|"a" -> 2, "b" -> "y", "c" -> 100|>,
<|"a" -> 3, "b" -> "z", "c" -> Missing[]|>}]

Out[12]=

In[13]:=

Out[13]=

Options (2)

Sparse (2)

The result of CrossTabulate is a Dataset by default. With the option setting "Sparse"→True the result is an Association with three elements: a sparse matrix with the contingency values, row names, and column names.

Here is an example:

In[14]:=

Block[{n = 40},
data = Transpose[{ToString /@ RandomInteger[{10, 20}, n], ToString /@ RandomInteger[{1, 6}, n]}]
];

In[15]:=

Out[15]=

Using MatrixForm we can visualize the result:

In[16]:=

Out[16]=

Applications (13)

Data study (5)

Take the Titanic dataset:

In[17]:=

Find how many males and females are in each passenger class:

In[18]:=

Out[18]=

Find how many males and females survived:

In[19]:=

Out[19]=

Find the aggregated ages of the class-sex breakdown:

In[20]:=

Out[20]=

Here is a function to plot sparse contingency tables:

In[21]:=

$CTMatrixPlot[x_Association /; KeyExistsQ[x, "SparseMatrix"], opts___] := MatrixPlot[#1, Append[{opts}, FrameLabel -> {{Keys[x][[2]], None}, {Keys[x][[3]], None}}]] & @@ x;$

In[22]:=

Out[22]=

Word-sentiment analysis of movie reviews (8)

Start with movie review data:

In[23]:=

movieReviewData = Flatten@*List @@@ ExampleData[{"MachineLearning", "MovieReview"}, "Data"];
Dimensions[movieReviewData]

Out[22]=

For each movie review we make a list of word-sentiment pairs and then join them into one big list:

In[24]:=

movieReviewData = Join @@ Map[
Thread[{DeleteStopwords[StringSplit[#[[1]]]], #[[2]]}] &, movieReviewData];
Dimensions[movieReviewData]

Out[20]=

Here is a sample:

In[25]:=

Out[25]=

Here we find the word-sentiment contingency table as a sparse matrix in order to plot it below:

In[26]:=

Here is a function to plot sparse contingency tables:

In[27]:=

$CTMatrixPlot[x_Association /; KeyExistsQ[x, "SparseMatrix"], opts___] := MatrixPlot[#1, Append[{opts}, FrameLabel -> {{Keys[x][[2]], None}, {Keys[x][[3]], None}}]] & @@ x;$