Function Repository Resource:

SampleRebalance

Source Notebook

Resample data to correct for sampling bias

Contributed by: Jon McLoone

ResourceFunction["SampleRebalance"][data,n]

takes n samples from data weighted so that the first column of the result is uniformly distributed.

ResourceFunction["SampleRebalance"][data,n,c]

takes n samples from data weighted so that column c is uniformly distributed.

ResourceFunction["SampleRebalance"][data,n,c,dist]

takes n samples from data weighted so that column c is distributed according to dist.

Details

ResourceFunction["SampleRebalance"] can be used to to correct for sampling bias by resampling the data so that a specific column of the data follows a presumed target distribution.

ResourceFunction["SampleRebalance"] assigns a weight to each record according to the PDF of the distribution evaluated at the value of the target column.

If an underspecified parametric distribution is provided for the distribution, its parameters are estimated from the target column data.

If data is given for the target distribution, that data is used to create a SmoothKernelDistribution.

Examples

Basic Examples (3)

A dataset:

In[1]:=

Out[1]=

Resample the data so that the "Age" column is uniformly distributed:

In[2]:=

Out[2]=

Resample the data so that the "Age" column follows an Exponential[0.1] distribution:

In[3]:=

Out[3]=

Scope (2)

A dataset:

In[4]:=

Out[4]=

Compute the mean of the data in the "Score" column:

In[5]:=

Out[5]=

The target distribution for "Age" can be given as a fully specified distribution:

In[6]:=

Out[6]=

One or more parameters can also be estimated from the "Age" data:

In[7]:=

$ResourceFunction["SampleRebalance"][data, 1000, "Age", NormalDistribution[\[Mu], \[Sigma]]][Mean, "Score"]$

Out[7]=

A representative sample of "Age" values can also be provided:

In[8]:=

Out[8]=

The column being used for rebalancing can be specified by position instead of name:

In[9]:=

Out[9]=

Data can be provided as a Dataset of associations, a Dataset of lists or, as here, a List of lists:

In[10]:=

Out[10]=

If a column contains any non-numeric data, then it can be balanced against a CategoricalDistribution:

In[11]:=

data2 = ResourceFunction[
"SampleRebalance"][{{"A", 1}, {"A", 2}, {"B", 3}, {"C", 4}}, 10000,
1, CategoricalDistribution[{"A", "B", "C"}]];
Counts[data2[[All, 1]]]

Out[11]=

Non-numeric reference data will be converted into a CategoricalDistribution:

In[12]:=

Out[12]=

Applications (3)

In this synthetic dataset of children's test scores, it would appear that the scores are normally distributed:

In[13]:=

Out[13]=

However, this is because the data sample was biased toward 10-year-olds. If we assume that, across the whole population, all ages are equally represented, then the "Score" distribution appears more uniform:

In[14]:=

Out[14]=

If we assume that the ages of the population follow a certain ExponentialDistribution, then we see a different result:

In[15]:=

Out[15]=

Properties and Relations (1)

After resampling the column used to rebalance it, the data should tend to the target distribution:

In[16]:=

Out[16]=

In[17]:=

Out[17]=

Possible Issues (1)

SampleRebalance only uses values from the original data and cannot always achieve the target reference distribution. In this example, there are no negative values, so the attempt to rebalance to NormalDistribution[0,5] results in the data distributed more like TruncatedDistribution[{0,∞},NormalDistribution[0,5]]:

In[18]:=

data = Dataset[
Table[age = Abs@RandomVariate[NormalDistribution[10, 2]]; <|
"Age" -> age, "Score" -> 2 age + RandomReal[]|>, {1000}], MaxItems -> 5];
ResourceFunction["SampleRebalance"][data, 10000, "Age", NormalDistribution[0, 5]][Histogram, "Age"]

Out[18]=

Publisher

Jon McLoone

Version History

1.0.0 – 11 October 2021

License Information

This work is licensed under a Creative Commons Attribution 4.0 International License