Function Repository Resource:

RandomTabularDataset

Source Notebook

Generate a random tabular dataset

Contributed by: Anton Antonov

ResourceFunction["RandomTabularDataset"][{m,n}]

generates a random tabular dataset with m rows and n columns.

ResourceFunction["RandomTabularDataset"][m]

generates a random tabular dataset with m rows and a random number of columns.

ResourceFunction["RandomTabularDataset"][]

generates a random tabular dataset with a random number of rows and columns.

Details and Options

ResourceFunction["RandomTabularDataset"][] is the same as ResourceFunction["RandomTabularDataset"][{Automatic,Automatic}].
ResourceFunction["RandomTabularDataset"][m] is the same as ResourceFunction["RandomTabularDataset"][{m,Automatic}].
If the number of rows is Automatic, then a random integer is generated with PoissonDistribution[20].
If the number of columns is not specified or it is Automatic, then a random integer is generated with PoissonDistribution[7].
It is possible to specify a fixed number of rows, a fixed number of columns or both.
It is possible to specify concrete column names.
ResourceFunction["RandomTabularDataset"] takes the following options:
"ColumnNamesGenerator"Automaticgenerator of column names
"Form""Wide"the form of the generated dataset (long or wide)
"Generators"Automaticgenerators for the values in each column
"MaxNumberOfValues"Automaticmax number of non-missing values
"MinNumberOfValues"Automaticmin number of non-missing values
"RowKeys"Falseshould the rows have keys or not
"PointwiseGeneration"Falseshould the generators be applied in pointwise or vectorwise manner
The option "Generators" can be used to specify how the values in the columns are generated.
If the option "Generators" is given the value Automatic, then the column value generators are derived through a random choice of functions that produce random reals, random integers and random words.
If the option value is a function, G, then all values are generated with the function G.
If the option value is a list, {G1,G2,,Gi,}, then the generator Gi is applied to the ith column. The list of generators is repeated if its length is smaller than the number of columns.
If the option value is an association, <|,kiGi,|>, then the generator Gi is applied to the kith column. Unassigned columns use the specified "Generators" to create values.
If vectorwise generation is used and the number of columns is ncols, then with Automatic the generators are generated with the expression: RandomChoice[{RandomReal[{-10,10},#]&,RandomInteger[{-100,100},#]&,RandomWord[#]&},ncols]
If pointwise generation is used and the number of columns is ncols, then with Automatic the generators are generated with the expression: RandomChoice[{RandomReal[{-10,10}]&,RandomInteger[{-100,100}]&,RandomWord[]&},ncols]
The generated datasets can have row keys and can be in long form or wide form.
If the value of "MaxNumberOfValues" is Automatic or All for an m×n dataset, then "MaxNumberOfValues" is interpreted as m×n.
If the value of "MinNumberOfValues" is Automatic or All, then it is interpreted to be the same as "MaxNumberOfValues".
The column names generator function can be either pointwise or vectorwise.
A pointwise generator is considered to be an one-argument function; the value passed to it is column index.
A vectorwise generator is seen as a two-argument function; the values it accepts are the current column index and list of all column indexes.

Examples

Basic Examples (3) 

Generate a random tabular dataset:

In[1]:=
SeedRandom[2];
ResourceFunction["RandomTabularDataset"][]
Out[2]=

Generate a random tabular dataset with specified number of rows:

In[3]:=
SeedRandom[2];
ResourceFunction["RandomTabularDataset"][4]
Out[4]=

Generate a random tabular dataset with specified random value generators for certain columns:

In[5]:=
SeedRandom[4];
ResourceFunction["RandomTabularDataset"][{5, Automatic}, "Generators" -> <|1 -> NormalDistribution[100, 3], 3 -> RandomColor|>]
Out[6]=

Scope (4) 

The generated dataset can be produced in long form or wide form and can have row keys. Here is a wide form dataset with row keys:

In[7]:=
SeedRandom[32];
ResourceFunction["RandomTabularDataset"][{4, 3}, "Form" -> "Wide", "RowKeys" -> True]
Out[8]=

Here is the corresponding long form with row keys:

In[9]:=
SeedRandom[32];
ResourceFunction["RandomTabularDataset"][{4, 3}, "Form" -> "Long", "RowKeys" -> True]
Out[10]=

Generate a random tabular dataset with specified column names:

In[11]:=
SeedRandom[32];
ResourceFunction["RandomTabularDataset"][{4, {"A", "B", "C"}}]
Out[12]=

Using Identity or symbols without down values to specify the column name generation or column value generation gives insight about how the random generator functions are called. Here is an example with "pointwise" generators:

In[13]:=
Clear[H, V];
ResourceFunction["RandomTabularDataset"][{4, 5}, "ColumnNamesGenerator" -> (ToString@*H), "Generators" -> V, "PointwiseGeneration" -> True]
Out[14]=

Here is an example with "vector-wise" generators:

In[15]:=
Clear[H, V];
ResourceFunction["RandomTabularDataset"][{4, 5}, "ColumnNamesGenerator" -> (ToString@*H /@ #2 &), "Generators" -> Table[V /@ #2 &, {5}], "PointwiseGeneration" -> False]
Out[16]=

Options (18) 

ColumnNamesGenerator (3) 

The option "ColumnNamesGenerator" specifies a function that generates the column names:

In[17]:=
SeedRandom[116];
ResourceFunction["RandomTabularDataset"][{5, 6}, "ColumnNamesGenerator" -> (RandomWord["Stopwords", #] &)]
Out[18]=

The column names generator function application adheres to the value given to the option "PointwiseGeneration". Here is an example with the pointwise generator (ToString[k++]&):

In[19]:=
SeedRandom[116];
k = -2;
ResourceFunction["RandomTabularDataset"][{5, 6}, "ColumnNamesGenerator" -> (ToString[k++] &), "PointwiseGeneration" -> True]
Out[21]=

If the column names generator is None, the dataset will not have column names:

In[22]:=
SeedRandom[12];
ResourceFunction["RandomTabularDataset"][5, "ColumnNamesGenerator" -> None]
Out[23]=

Generate random datasets for which column i has the name F[i], using pointwise generation:

In[24]:=
SeedRandom[11];
ResourceFunction["RandomTabularDataset"][{3, 5}, "ColumnNamesGenerator" -> (ToString@*F), "PointwiseGeneration" -> True]
Out[25]=

Here is the vectorwise generation:

In[26]:=
SeedRandom[11];
ResourceFunction["RandomTabularDataset"][{3, 5}, "ColumnNamesGenerator" -> (ToString@*F /@ #2 &), "PointwiseGeneration" -> False]
Out[27]=

Form (1) 

The option "Form" specifies the form (format) of the generated dataset; it takes the values Automatic, RandomChoice, "Long" or "Wide":

In[28]:=
Table[Labeled[
  BlockRandom[
   ResourceFunction["RandomTabularDataset"][{2, 4}, "Form" -> f], RandomSeeding -> 4], f], {f, {Automatic, RandomChoice, "Long", "Wide"}}]
Out[28]=

Generators (6) 

If the option "Generators" is given the value Automatic, then the column value generators are derived through a random choice of functions that produce random reals, random integers and random words. The following two examples show the generated datasets have columns with corresponding types:

In[29]:=
SeedRandom[32];
ncols = 7;
lsColNames = Take[CharacterRange["A", "Z"], ncols];
ResourceFunction["RandomTabularDataset"][{3, lsColNames}, "Generators" -> Automatic]
Out[30]=
In[31]:=
SeedRandom[32];
ResourceFunction["RandomTabularDataset"][{3, lsColNames}, "Generators" -> AssociationThread[Range[ncols], RandomChoice[{RandomReal[{-10, 10}, #] &, RandomInteger[{-100, 100}, #] &, RandomWord[#] &}, ncols]]]
Out[32]=

Here is a table that shows which generator is used for which column:

In[33]:=
SeedRandom[32];
ResourceFunction["GridTableForm"]@
 AssociationThread[lsColNames, RandomChoice[{RandomReal[{-10, 10}, #] &, RandomInteger[{-100, 100}, #] &, RandomWord[#] &}, ncols]]
Out[34]=

Specify all values to be generated by RandomInteger:

In[35]:=
SeedRandom[32];
ResourceFunction["RandomTabularDataset"][{5, 3}, "Generators" -> (RandomInteger[{-3, 7}, #] &)]
Out[36]=

If the generators are given in a list, then that list is repeated to match all columns:

In[37]:=
SeedRandom[32];
ResourceFunction["RandomTabularDataset"][{3, 5}, "Generators" -> {Table[RandomImage[], #] &, RandomReal[100, #] &}]
Out[38]=

Specify the values of the first column to be generated with RandomColor and the values of the second column to be generated with PoissonDistribution. The third column has values derived from the default generator:

In[39]:=
SeedRandom[32];
ResourceFunction["RandomTabularDataset"][{10, 3}, "Generators" -> <|1 -> (RandomColor[#] &), 2 -> (RandomVariate[PoissonDistribution[300], #] &)|>]
Out[40]=

Generators using built-in symbolic distributions can be specified in a short form. Instead of specifying column value generation with RandomVariate, just the symbolic distributions can be used.

Use NormalDistribution for both columns, first with the standard specification and next with the short form:

In[41]:=
SeedRandom[12];
\[ScriptCapitalD]1 = NormalDistribution[100, 2];
ResourceFunction["RandomTabularDataset"][{6, 2}, "ColumnNamesGenerator" -> ({"RandomVariate[\[ScriptCapitalD]1,#]&", "\[ScriptCapitalD]1"} &), "Generators" -> <|1 -> (RandomVariate[\[ScriptCapitalD]1, #] &), 2 -> \[ScriptCapitalD]1|>]
Out[26]=

Here is another example using a derived, mixture distribution:

In[42]:=
SeedRandom[12];
\[ScriptCapitalD]2 = MixtureDistribution[{1, 1}, {PoissonDistribution[13], BinomialDistribution[200, 0.4]}];
ResourceFunction["RandomTabularDataset"][{6, 2}, "ColumnNamesGenerator" -> ({"RandomVariate[\[ScriptCapitalD]2,#]&", "\[ScriptCapitalD]2"} &), "Generators" -> <|1 -> (RandomVariate[\[ScriptCapitalD]2, #] &), 2 -> \[ScriptCapitalD]2|>]
Out[26]=

MaxNumberOfValues (1) 

Use the option "MaxNumberOfValues" to specify the maximum number of (non-missing) values in the generated random dataset:

In[43]:=
Table[SeedRandom[22]; Labeled[ResourceFunction["RandomTabularDataset"][{3, 4}, "MaxNumberOfValues" -> n], n, Top], {n, {Automatic, 6}}]
Out[43]=

MinNumberOfValues (2) 

Use the option "MinNumberOfValues" to specify the minimum number of (non-missing) values in the generated random dataset:

In[44]:=
Grid[Table[SeedRandom[12]; Labeled[ResourceFunction["RandomTabularDataset"][{3, 4}, "MinNumberOfValues" -> n1, "MaxNumberOfValues" -> n2], {n1, n2}, Top], {n1, {Automatic, 3}}, {n2, {Automatic, 6}}]]
Out[44]=

The value of "MinNumberOfValues" is ignored if it is greater than "MaxNumberOfValues":

In[45]:=
SeedRandom[44];
ResourceFunction["RandomTabularDataset"][{4, 4}, "MinNumberOfValues" -> 12, "MaxNumberOfValues" -> 6]
Out[46]=

RowKeys (2) 

The option "RowKeys" specifies whether the generated dataset has row keys:

In[47]:=
SeedRandom[32];
Table[r -> ResourceFunction["RandomTabularDataset"][{3, 4}, "RowKeys" -> r], {r, {Automatic, False, True}}]
Out[48]=

If the option value is Automatic then a random choice between False and True is made; False is chosen more often:

In[49]:=
SeedRandom[2];
Table[ResourceFunction["RandomTabularDataset"][{2, 2}, "RowKeys" -> Automatic], 4]
Out[50]=

PointwiseGeneration (3) 

The generators can be pointwise or vectorwise; in general, pointwise generation is much slower:

In[51]:=
SeedRandom[81];
AbsoluteTiming[
 Table[ResourceFunction["RandomTabularDataset"][{20, 5}, "Generators" -> RandomWord, "PointwiseGeneration" -> True], 10];]
Out[52]=
In[53]:=
SeedRandom[81];
AbsoluteTiming[
 Table[ResourceFunction["RandomTabularDataset"][{20, 5}, "Generators" -> RandomWord, "PointwiseGeneration" -> False], 10];]
Out[54]=

A single call to a pointwise generator produces a single value:

In[55]:=
SeedRandom[99];
k = 0;
ResourceFunction["RandomTabularDataset"][{3, 5}, "Generators" -> {(k++ &)}, "PointwiseGeneration" -> True]
Out[53]=

A pointwise generator takes entry coordinates as a single argument:

In[56]:=
SeedRandom[99];
k = 0;
ResourceFunction["RandomTabularDataset"][{3, 5}, "Generators" -> {(F[#] &)}, "PointwiseGeneration" -> True]
Out[57]=

A single call to a vectorwise generator produces a vector of values with length corresponding to the number of rows:

In[58]:=
SeedRandom[99];
ResourceFunction["RandomTabularDataset"][{3, 5}, "Generators" -> {Range[#] &}, "PointwiseGeneration" -> False]
Out[59]=

A vectorwise generator is a two-argument function consisting of vector length and a list of entry coordinates:

In[60]:=
SeedRandom[99];
ResourceFunction["RandomTabularDataset"][{3, 5}, "Generators" -> {F /@ #1 &}, "PointwiseGeneration" -> False]
Out[53]=
In[61]:=
SeedRandom[99];
ResourceFunction["RandomTabularDataset"][{3, 5}, "Generators" -> {F /@ #2 &}, "PointwiseGeneration" -> False]
Out[58]=

Applications (1) 

The ability to generate random datasets (tabular or hierarchical) is very useful for developing and testing data wrangling, data science and machine learning algorithms.

Here we use the resource functions RecordsSummary and ParallelCoordinatesPlot:

In[62]:=
SeedRandom[83];
dsRand1 = ResourceFunction["RandomTabularDataset"][{120, 3}, "Generators" -> <|
     1 -> With[{ws = RandomWord[4]}, RandomChoice[ws, #] &]|>];
Row[{dsRand1, Spacer[3], ResourceFunction["RecordsSummary"][dsRand1], Spacer[3], ResourceFunction["ParallelCoordinatesPlot"][
   Values /@ Normal[GroupBy[dsRand1, #[[1]] &, #[[All, {2, 3}]] &]], ImageSize -> Medium]}]
Out[61]=

Properties and Relations (4) 

Here is an association of random tabular datasets:

In[63]:=
aTbls = Association[
   Table[i -> ResourceFunction[
      "RandomTabularDataset"][{Automatic, RandomInteger[{2, 6}]}], {i,
      3}]];
Magnify[#, 0.6] & /@ aTbls
Out[60]=

The generated datasets can be summarized with the resource function RecordsSummary:

In[64]:=
SeedRandom[26];
<|"Dimensions" -> Dimensions[#], "Summary" -> ResourceFunction["RecordsSummary"][#]|> & /@ aTbls
Out[56]=

Here is a randomly generated tabular dataset in wide form:

In[65]:=
SeedRandom[2423];
dsWide = ResourceFunction["RandomTabularDataset"][{4, 3}, "Form" -> "Wide", "RowKeys" -> True]
Out[60]=

Here is the same the dataset in long form:

In[66]:=
SeedRandom[2423];
dsLong = ResourceFunction["RandomTabularDataset"][{4, 3}, "Form" -> "Long"]
Out[56]=

The resource function CrossTabulate can be used to convert from long form to wide form:

In[67]:=
dsCTbl = ResourceFunction["CrossTabulate"][dsLong]
Out[67]=

Here we verify that result from CrossTabulate is the same as the generated wide form (by sorting the keys in the wide form first):

In[68]:=
dsCTbl == dsWide[All, KeySort]
Out[68]=

RandomTabularDataset can be seen as a dataset version of the results from ProductDistribution. Here is a ProductDistribution of two independent variables:

In[69]:=
SeedRandom[3];
\[ScriptCapitalD] = ProductDistribution[SkewNormalDistribution[0, 2, 0.1], PoissonDistribution[10]];
lsVals = RandomVariate[\[ScriptCapitalD], 9000];
Row[{Framed@ColumnForm@RandomSample[lsVals, 4], Spacer[10], Histogram3D[lsVals, 12, ImageSize -> Medium]}]
Out[56]=

Generate a random tabular dataset with 9000 rows and generators that correspond to the distributions given to ProductDistribution above:

In[70]:=
SeedRandom[3];
dsRProd = ResourceFunction["RandomTabularDataset"][{9000, 2}, "Generators" -> {SkewNormalDistribution[0, 2, 0.1], PoissonDistribution[10]}];
Row[{RandomSample[dsRProd, 4], Spacer[10], Histogram3D[Normal[dsRProd[Values]], 12, ImageSize -> Medium]}]
Out[33]=

The resource function ExampleDataset makes datasets from ExampleData. Here is an example dataset:

In[71]:=
dsAW = ResourceFunction["ExampleDataset"][{"Statistics", "AnimalWeights"}]
Out[71]=

Here is a similar random dataset:

In[72]:=
SeedRandom[23];
dsCW = ResourceFunction["RandomTabularDataset"][
   {60, {"Creature", "BodyWeight", "BrainWeight"}},
   "Generators" -> <| 1 -> (Table[
         StringJoin[
          RandomChoice[CharacterRange["a", "z"], 5]], #] &),
     2 -> FindDistribution[Normal@dsAW[All, "BodyWeight"]],
     3 -> FindDistribution[Normal@dsAW[All, "BrainWeight"]]|>];
IQB = Interval[
   Quartiles[N@Normal[dsAW[All, #BrainWeight/#BodyWeight &]]][[{1, 3}]]];
dsCW[Select[IntervalMemberQ[IQB, #BrainWeight/ #BodyWeight] &]]
Out[70]=

Possible Issues (4) 

If the generated (unique) column names are too few, then additional column names are generated as string forms of integers:

In[73]:=
SeedRandom[32];
ResourceFunction["RandomTabularDataset"][{5, 6}, "ColumnNamesGenerator" -> (RandomChoice[Characters["abc"], #] &)]
Out[72]=

Using pointwise generators with "PointwiseGeneration" set to False produces constant value columns:

In[74]:=
SeedRandom[5];
ResourceFunction["RandomTabularDataset"][{3, 4}, "Generators" -> {RandomReal[100] &, RandomInteger[12] &}]
Out[72]=

If the value of the option "MaxNumberOfValues" is zero or if the value of the option "Generators" is None, then the generated dataset has only Missing values:

In[75]:=
SeedRandom[45];
{ResourceFunction["RandomTabularDataset"][{3, 4}, "MaxNumberOfValues" -> 0], ResourceFunction["RandomTabularDataset"][{3, 2}, "Generators" -> None]}
Out[72]=

If the number of rows and columns are equal to one, then the dataset has a one-dimensional form:

In[76]:=
SeedRandom[3];
ResourceFunction["RandomTabularDataset"][{1, 1}]
Out[72]=

Neat Examples (2) 

A table of random tabular datasets:

In[77]:=
SeedRandom[44];
Multicolumn[
 Table[Magnify[
   ResourceFunction["RandomTabularDataset"][4, "RowKeys" -> RandomChoice], 0.5], 8], 2]
Out[72]=

Here is a random dataset with values produced by resource functions that generate random objects:

In[78]:=
SeedRandom[3];
ResourceFunction[
 "RandomTabularDataset"][{5, {"Mondrian", "Mandala", "Haiku", "Scribble", "Maze", "Fortune"}},
 "Generators" ->
  <|
   1 -> (ResourceFunction["RandomMondrian"][] &),
   2 -> (ResourceFunction["RandomMandala"][] &),
   3 -> (ResourceFunction["RandomEnglishHaiku"][] &),
   4 -> (ResourceFunction["RandomScribble"][] &),
   5 -> (ResourceFunction["RandomMaze"][12] &),
   6 -> (ResourceFunction["RandomFortune"][] &)|>,
 "PointwiseGeneration" -> True]
Out[72]=

Publisher

Anton Antonov

Version History

  • 1.0.0 – 19 January 2021

Related Resources

License Information