# Wolfram Function Repository

Instant-use add-on functions for the Wolfram Language

Function Repository Resource:

Resample data to correct for sampling bias

Contributed by:
Jon McLoone

ResourceFunction["SampleRebalance"][ takes | |

ResourceFunction["SampleRebalance"][ takes | |

ResourceFunction["SampleRebalance"][ takes |

ResourceFunction["SampleRebalance"] can be used to to correct for sampling bias by resampling the data so that a specific column of the data follows a presumed target distribution.

ResourceFunction["SampleRebalance"] assigns a weight to each record according to the PDF of the distribution evaluated at the value of the target column.

If an underspecified parametric distribution is provided for the distribution, its parameters are estimated from the target column data.

If data is given for the target distribution, that data is used to create a SmoothKernelDistribution.

A dataset:

In[1]:= |

Out[1]= |

Resample the data so that the "Age" column is uniformly distributed:

In[2]:= |

Out[2]= |

Resample the data so that the "Age" column follows an Exponential[0.1] distribution:

In[3]:= |

Out[3]= |

A dataset:

In[4]:= |

Out[4]= |

Compute the mean of the data in the "Score" column:

In[5]:= |

Out[5]= |

The target distribution for "Age" can be given as a fully specified distribution:

In[6]:= |

Out[6]= |

One or more parameters can also be estimated from the "Age" data:

In[7]:= |

Out[7]= |

A representative sample of "Age" values can also be provided:

In[8]:= |

Out[8]= |

The column being used for rebalancing can be specified by position instead of name:

In[9]:= |

Out[9]= |

Data can be provided as a Dataset of associations, a Dataset of lists or, as here, a List of lists:

In[10]:= |

Out[10]= |

If a column contains any non-numeric data, then it can be balanced against a CategoricalDistribution:

In[11]:= |

Out[11]= |

Non-numeric reference data will be converted into a CategoricalDistribution:

In[12]:= |

Out[12]= |

In this synthetic dataset of children's test scores, it would appear that the scores are normally distributed:

In[13]:= |

Out[13]= |

However, this is because the data sample was biased toward 10-year-olds. If we assume that, across the whole population, all ages are equally represented, then the "Score" distribution appears more uniform:

In[14]:= |

Out[14]= |

If we assume that the ages of the population follow a certain ExponentialDistribution, then we see a different result:

In[15]:= |

Out[15]= |

After resampling the column used to rebalance it, the data should tend to the target distribution:

In[16]:= |

Out[16]= |

In[17]:= |

Out[17]= |

SampleRebalance only uses values from the original data and cannot always achieve the target reference distribution. In this example, there are no negative values, so the attempt to rebalance to NormalDistribution[0,5] results in the data distributed more like TruncatedDistribution[{0,∞},NormalDistribution[0,5]]:

In[18]:= |

Out[18]= |

- 1.0.0 – 11 October 2021

This work is licensed under a Creative Commons Attribution 4.0 International License