Wolfram Function Repository
Instant-use add-on functions for the Wolfram Language
Function Repository Resource:
Generate a random tabular dataset
ResourceFunction["RandomTabularDataset"][{m,n}] generates a random tabular dataset with m rows and n columns. | |
ResourceFunction["RandomTabularDataset"][m] generates a random tabular dataset with m rows and a random number of columns. | |
ResourceFunction["RandomTabularDataset"][] generates a random tabular dataset with a random number of rows and columns. |
"ColumnNamesGenerator" | Automatic | generator of column names |
"Form" | "Wide" | the form of the generated dataset (long or wide) |
"Generators" | Automatic | generators for the values in each column |
"MaxNumberOfValues" | Automatic | max number of non-missing values |
"MinNumberOfValues" | Automatic | min number of non-missing values |
"RowKeys" | False | should the rows have keys or not |
"PointwiseGeneration" | False | should the generators be applied in pointwise or vectorwise manner |
Generate a random tabular dataset:
In[1]:= |
|
Out[2]= |
|
Generate a random tabular dataset with specified number of rows:
In[3]:= |
|
Out[4]= |
|
Generate a random tabular dataset with specified random value generators for certain columns:
In[5]:= |
|
Out[6]= |
|
The generated dataset can be produced in long form or wide form and can have row keys. Here is a wide form dataset with row keys:
In[7]:= |
|
Out[8]= |
|
Here is the corresponding long form with row keys:
In[9]:= |
|
Out[10]= |
|
Generate a random tabular dataset with specified column names:
In[11]:= |
|
Out[12]= |
|
Using Identity or symbols without down values to specify the column name generation or column value generation gives insight about how the random generator functions are called. Here is an example with "pointwise" generators:
In[13]:= |
|
Out[14]= |
|
Here is an example with "vector-wise" generators:
In[15]:= |
|
Out[16]= |
|
The option "ColumnNamesGenerator" specifies a function that generates the column names:
In[17]:= |
|
Out[18]= |
|
The column names generator function application adheres to the value given to the option "PointwiseGeneration". Here is an example with the pointwise generator (ToString[k++]&):
In[19]:= |
|
Out[21]= |
|
If the column names generator is None, the dataset will not have column names:
In[22]:= |
|
Out[23]= |
|
Generate random datasets for which column i has the name F[i], using pointwise generation:
In[24]:= |
|
Out[25]= |
|
Here is the vectorwise generation:
In[26]:= |
|
Out[27]= |
|
The option "Form" specifies the form (format) of the generated dataset; it takes the values Automatic, RandomChoice, "Long" or "Wide":
In[28]:= |
|
Out[28]= |
|
If the option "Generators" is given the value Automatic, then the column value generators are derived through a random choice of functions that produce random reals, random integers and random words. The following two examples show the generated datasets have columns with corresponding types:
In[29]:= |
|
Out[30]= |
|
In[31]:= |
|
Out[32]= |
|
Here is a table that shows which generator is used for which column:
In[33]:= |
|
Out[34]= |
|
Specify all values to be generated by RandomInteger:
In[35]:= |
|
Out[36]= |
|
If the generators are given in a list, then that list is repeated to match all columns:
In[37]:= |
|
Out[38]= |
|
Specify the values of the first column to be generated with RandomColor and the values of the second column to be generated with PoissonDistribution. The third column has values derived from the default generator:
In[39]:= |
|
Out[40]= |
|
Generators using built-in symbolic distributions can be specified in a short form. Instead of specifying column value generation with RandomVariate, just the symbolic distributions can be used.
Use NormalDistribution for both columns, first with the standard specification and next with the short form:
In[41]:= |
|
Out[26]= |
|
Here is another example using a derived, mixture distribution:
In[42]:= |
|
Out[26]= |
|
Use the option "MaxNumberOfValues" to specify the maximum number of (non-missing) values in the generated random dataset:
In[43]:= |
|
Out[43]= |
|
Use the option "MinNumberOfValues" to specify the minimum number of (non-missing) values in the generated random dataset:
In[44]:= |
|
Out[44]= |
|
The value of "MinNumberOfValues" is ignored if it is greater than "MaxNumberOfValues":
In[45]:= |
|
Out[46]= |
|
The option "RowKeys" specifies whether the generated dataset has row keys:
In[47]:= |
|
Out[48]= |
|
If the option value is Automatic then a random choice between False and True is made; False is chosen more often:
In[49]:= |
|
Out[50]= |
|
The generators can be pointwise or vectorwise; in general, pointwise generation is much slower:
In[51]:= |
|
Out[52]= |
|
In[53]:= |
|
Out[54]= |
|
A single call to a pointwise generator produces a single value:
In[55]:= |
|
Out[53]= |
|
A pointwise generator takes entry coordinates as a single argument:
In[56]:= |
|
Out[57]= |
|
A single call to a vectorwise generator produces a vector of values with length corresponding to the number of rows:
In[58]:= |
|
Out[59]= |
|
A vectorwise generator is a two-argument function consisting of vector length and a list of entry coordinates:
In[60]:= |
|
Out[53]= |
|
In[61]:= |
|
Out[58]= |
|
The ability to generate random datasets (tabular or hierarchical) is very useful for developing and testing data wrangling, data science and machine learning algorithms.
Here we use the resource functions RecordsSummary and ParallelCoordinatesPlot:
In[62]:= |
|
Out[61]= |
|
Here is an association of random tabular datasets:
In[63]:= |
|
Out[60]= |
|
The generated datasets can be summarized with the resource function RecordsSummary:
In[64]:= |
|
Out[56]= |
|
Here is a randomly generated tabular dataset in wide form:
In[65]:= |
|
Out[60]= |
|
Here is the same the dataset in long form:
In[66]:= |
|
Out[56]= |
|
The resource function CrossTabulate can be used to convert from long form to wide form:
In[67]:= |
|
Out[67]= |
|
Here we verify that result from CrossTabulate is the same as the generated wide form (by sorting the keys in the wide form first):
In[68]:= |
|
Out[68]= |
|
RandomTabularDataset can be seen as a dataset version of the results from ProductDistribution. Here is a ProductDistribution of two independent variables:
In[69]:= |
|
Out[56]= |
|
Generate a random tabular dataset with 9000 rows and generators that correspond to the distributions given to ProductDistribution above:
In[70]:= |
|
Out[33]= |
|
The resource function ExampleDataset makes datasets from ExampleData. Here is an example dataset:
In[71]:= |
|
Out[71]= |
|
Here is a similar random dataset:
In[72]:= |
|
Out[70]= |
|
If the generated (unique) column names are too few, then additional column names are generated as string forms of integers:
In[73]:= |
|
Out[72]= |
|
Using pointwise generators with "PointwiseGeneration" set to False produces constant value columns:
In[74]:= |
|
Out[72]= |
|
If the value of the option "MaxNumberOfValues" is zero or if the value of the option "Generators" is None, then the generated dataset has only Missing values:
In[75]:= |
|
Out[72]= |
|
If the number of rows and columns are equal to one, then the dataset has a one-dimensional form:
In[76]:= |
|
Out[72]= |
|
A table of random tabular datasets:
In[77]:= |
|
Out[72]= |
|
Here is a random dataset with values produced by resource functions that generate random objects:
In[78]:= |
|
Out[72]= |
|
This work is licensed under a Creative Commons Attribution 4.0 International License