Function Repository Resource:

SupportSizeEstimate

Source Notebook

Estimate the full size of a set given the number of distinct results in a sample

Contributed by: Ed Pegg Jr

ResourceFunction["SupportSizeEstimate"][samples,distincts]

estimates the full population using a given number of distincts in the samples.

Examples

Basic Examples (2) 

Ask five hundred people when their birthday is and count the number of distinct results:

In[1]:=
Length[Union[RandomInteger[{1, 365}, {500}]]]
Out[1]=

Based on that result, make an estimate for the number of days in a year:

In[2]:=
ResourceFunction["SupportSizeEstimate"][500, 264]
Out[2]=

Calculate the number of birthdays for Saturn, but keep the number secret:

In[3]:=
saturnbirthdays = Ceiling[Entity["Planet", "Saturn"][
     EntityProperty["Planet", "OrbitPeriod"]] / Entity["Planet", "Saturn"][
     EntityProperty["Planet", "RotationPeriod"]]];

Count the number of distinct results in fifty thousand random birthdays on Saturn:

In[4]:=
Length[Union[RandomInteger[{1, saturnbirthdays}, {50000}]]]
Out[4]=

With sample sizes 50,000 and 21,265, distinct results estimate how many days per year there are on Saturn:

In[5]:=
ResourceFunction["SupportSizeEstimate"][50000, 21265]
Out[5]=

Applications (2) 

Sample sorted subsets and use that to estimate the the full support size:

In[6]:=
sample = 4000;
distinct = Length[Union[Table[Sort[RandomSample[Range[20], 4]], {sample}]]];
ResourceFunction["SupportSizeEstimate"][sample, distinct]
Out[6]=

The actual answer:

In[7]:=
Binomial[20, 4]
Out[7]=

Possible Issues (2) 

Sample sorted 4-tuples and use that to estimate the the full support size:

In[8]:=
sample = 4000;
distinct = Length[Union[Table[Sort[RandomInteger[{1, 20}, {4}]], {sample}]]];
ResourceFunction["SupportSizeEstimate"][sample, distinct]
Out[8]=

This sampling method is not uniformly distributed, so the support size estimate is an undercount:

In[9]:=
Length[Union[Sort /@ Tuples[Range[20], {4}]]]
Out[9]=

If the number of distinct items is the same as the sample size, you will need a larger sample.

Version History

  • 1.0.0 – 11 November 2019

License Information