Function Repository Resource:

MapReduceOperator

Source Notebook

Like an operator form of GroupBy, but where one also specifies a reducer function to be applied

Contributed by: Seth J. Chandler

ResourceFunction["MapReduceOperator"][fg,h]

creates an operator that will group its argument using function f, map g onto each top level element of its argument and then map the reducer function h over each grouping.

Details and Options

ResourceFunction["MapReduceOperator"] takes the option "Parallelize". Its default value is False. If set to True, ParallelMap is used instead of Map.
The function works by using Curry to create a variant of GroupBy. It thus works similarly to Function[x,GroupBy[x,fg,h]].
The reducer function is referred to in the documentation for GroupBy as the combiner function.
ResourceFunction["MapReduceOperator"] accepts any functions in its arguments that GroupBy would accept.
ResourceFunction["MapReduceOperator"], like GroupBy, will not work where one wants successive groupings.
ParallelMap does not work on all functions. ResourceFunction["MapReduceOperator"] does not address this issue.

Examples

Basic Examples (2) 

Group the data by the first element in each list, then get the last element of each list, and then compute the Mean of each grouping:

In[1]:=
ResourceFunction["MapReduceOperator"][First -> Last, Mean][{{a, x}, {b, v}, {a, y}, {a, z}, {b, w}}]
Out[1]=

Group a list of integers according to their parity, compute their square roots, and display the results as column vectors:

In[2]:=
ResourceFunction["MapReduceOperator"][Positive -> Sqrt, MatrixForm][
 Range[-4, 4]]
Out[2]=

Scope (2) 

Use MapReduceOperator with slot functions to create an operator that works on lists of associations (optionally wrapped in Dataset) and then apply it to the Titanic dataset:

In[3]:=
ResourceFunction["MapReduceOperator"][(#sex &) -> (#survived &), Counts][ExampleData[{"Dataset", "Titanic"}]]
Out[3]=

The same query as above, but with the reducer function Counts run in parallel on the Normal form of the dataset:

In[4]:=
ResourceFunction["MapReduceOperator"][(#sex &) -> (#survived &), Counts, "Parallelize" -> True][
 Normal@ExampleData[{"Dataset", "Titanic"}]]
Out[4]=

Options (2) 

"Parallelize" can take on the value True or False. If it is set to anything else, it is treated as if it were False. "Parallelize" will produce error messages if the data is wrapped in Dataset:

In[5]:=
ResourceFunction["MapReduceOperator"][(#sex &) -> (#survived &), Counts, "Parallelize" -> True][ExampleData[{"Dataset", "Titanic"}]]
Out[5]=

Applying Normal to the dataset will fix this problem:

In[6]:=
ResourceFunction["MapReduceOperator"][(#sex &) -> (#survived &), Counts, "Parallelize" -> True][
 Normal@ExampleData[{"Dataset", "Titanic"}]]
Out[6]=

Applications (4) 

Get histograms of the number of dates between the diagnosis of AIDS and death of the patient broken down by Australian state:

In[7]:=
With[{died = Select[ExampleData[{"Statistics", "AustraliaAIDS"}], #[[5]] === "D" &]}, ResourceFunction[
   "MapReduceOperator"][(#[[1]] &) -> (#[[4]] - #[[3]] &), Histogram[#, Automatic, "PDF", PlotRange -> {{0, 2500}, Automatic}] &][died]
 ]
Out[7]=

Group the planets according to the number of moons and compute for each grouping the median volume of the planets (assuming they are spherical):

In[8]:=
ResourceFunction[
  "MapReduceOperator"][(Length[#Moons] &) -> (4/3 \[Pi] #Radius^3 &), Median][ExampleData[{"Dataset", "Planets"}]]
Out[8]=

Compute the fraction surviving (True) and dying on the Titanic as a function of their cabin class and sex and do so in parallel using a composition of the resource function Proportions and a built-in function:

In[9]:=
ResourceFunction[
  "MapReduceOperator"][({#class, #sex} &) -> (#survived &), ResourceFunction["Proportions"]/*N, "Parallelize" -> True][
 Normal@ExampleData[{"Dataset", "Titanic"}]]
Out[9]=

Compute the Mean age of passengers on the Titanic broken down by class and deleting data for which age is Missing:

In[10]:=
ResourceFunction[
  "MapReduceOperator"][(#class &) -> (#age &), (Select[
     FreeQ[#, _Missing] &]/*Mean)][
 ExampleData[{"Dataset", "Titanic"}]]
Out[10]=

Possible Issues (1) 

To omit the transformation of the grouped data, use Identity as follows:

In[11]:=
ResourceFunction["MapReduceOperator"][First -> Identity, Counts][{{"x", c}, {"x", b}, {"y", d}, {"y", d}, {"z", g}, {"z", e}, {"x", a}, {"z", f}}]
Out[11]=

Publisher

Seth J. Chandler

Version History

  • 1.0.0 – 14 June 2019

License Information