Function Repository Resource:

PrincipalAxisClustering

Source Notebook

Quickly cluster a point cloud by recursive separation

Contributed by: Richard Hennigan (Wolfram Research)

ResourceFunction["PrincipalAxisClustering"][{{p11,p12,},{p21,p22,},}]

recursively partitions the given points into approximately equal-sized clusters along their principal axis.

ResourceFunction["PrincipalAxisClustering"][points,n]

partitions points into at most n clusters.

Details and Options

ResourceFunction["PrincipalAxisClustering"][points] is equivalent to ResourceFunction["PrincipalAxisClustering"][points,Automatic].
If n is a power of 2, the clusters will be approximately equal-sized.
ResourceFunction["PrincipalAxisClustering"] accepts a Method option, which decides how to separate points according to their projected values on the principal axis.
The value for Method can be one of the following:
Medianseparate points into approximately equal-sized clusters
Meanseparate points at the center of mass

Examples

Basic Examples (3) 

Cluster one-dimensional data:

In[1]:=
ResourceFunction[
 "PrincipalAxisClustering"][{{1}, {2}, {10}, {12}, {3}, {1}, {13}, {25}}]
Out[1]=

Find exactly four clusters:

In[2]:=
ResourceFunction[
 "PrincipalAxisClustering"][{{1}, {2}, {10}, {12}, {3}, {1}, {13}, {25}}, 4]
Out[2]=

Cluster vectors of real values:

In[3]:=
ResourceFunction[
 "PrincipalAxisClustering"][{{2.5, 3.1}, {5.9, 3.4}, {10, 15}, {2.2, 1.5}, {100, 7.5}}]
Out[3]=

Scope (2) 

Partition a 3D point cloud:

In[4]:=
mesh = ResourceData["Stanford Bunny", "MeshRegion"]
Out[4]=
In[5]:=
ListPointPlot3D[
 ResourceFunction["PrincipalAxisClustering"][
  RandomPoint[mesh, 10000]], Sequence[
 BoxRatios -> 1, Axes -> False, Boxed -> False, PlotStyle -> PointSize[0.025]]]
Out[5]=

Cluster high-dimensional data:

In[6]:=
points = RandomReal[1, {10000, 100}];
First[AbsoluteTiming[
  ResourceFunction["PrincipalAxisClustering"][points]]]
Out[6]=

Options (3) 

Method (3) 

With the default setting of MethodMedian, clusters are likely to span across gaps:

In[7]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/b0e2a141-50ff-4385-8885-aeaf7a9fe444"]
Out[7]=
In[8]:=
c1 = ResourceFunction["PrincipalAxisClustering"][data, Method -> Median];
Show[RegionUnion @@ ConvexHullMesh /@ c1, ListPlot[c1]]
Out[8]=

Using MethodMean can improve separation:

In[9]:=
c2 = ResourceFunction["PrincipalAxisClustering"][data, Method -> Mean];
Show[RegionUnion @@ ConvexHullMesh /@ c2, ListPlot[c2]]
Out[9]=

However, Mean will tend to produce unbalanced cluster sizes compared to Median:

In[10]:=
Length /@ c1
Out[10]=
In[11]:=
Length /@ c2
Out[11]=

Applications (2) 

Downsample a large point cloud by choosing nicely spaced representative points:

In[12]:=
points = RandomPoint[ResourceData["Stanford Bunny", "MeshRegion"], 100000];
In[13]:=
Graphics3D[
 Point[Median /@ ResourceFunction["PrincipalAxisClustering"][points, 1000]]]
Out[13]=

Compare to random sampling:

In[14]:=
Graphics3D[Point[RandomSample[points, 1000]]]
Out[14]=

Properties and Relations (4) 

The axis representing each cluster separation corresponds to the first component in PrincipalComponents:

In[15]:=
points = N[Position[ImageData[\!\(\*
GraphicsBox[
TagBox[RasterBox[CompressedData["
1:eJzd2UFOw0AMBdCYFUuuwC3YsmTbigO0IlRsgkiRELcnLCmZsf//zoQQqZWq
jv8bO60StbfH193zVdd15+vpaXf4uB/Hw+f+ZnrxOJxfTkP/9DC896d+vDt+
L9tPj7fuDxxmrT1rS5om4sUJIFRtpolwdQ4IlIugwfWrgaS4Hhiu10DbEkiJ
GwCJLWrVJon/HmSKFdBUEBVbg6aDmJgBUvcKUKWtDTI3J0jhL4+5kIrgsme/
NTjnsaef98jpCB5VrXmRhHTQy4BB13NiUDDkVYKwccS9YhQIAl4hDAPnMiEz
tilvd3HS31ENrOaEF0ZBb1LzqTToNV/KRcHajyueaBwYab9Mlltfiiy23lZc
dq7Z5Iaa5MUVPjwkqYgUKYEEKXo4qYMYmeEhZJIXJfO4qJhqthaDXh4ZB9t/
NXQS1GSR8DSSA7ckkp7+/1IbMZCYKsY6SBPr3I+4DNHjLuJkMOBd5GliAXDi
aLDcjxtHgbXZBNJQsKJdvF9PiFjzXmlJJSOKzYDlVUhoVEzLjJGtuNzjCyxr
GaM=
"], {{0, 83}, {113, 0}}, {0, 1},
ColorFunction->GrayLevel],
BoxForm`ImageTag["Bit", ColorSpace -> Automatic, Interleaving -> None],
Selectable->False],
BaseStyle->"ImageGraphics",
ImageSizeRaw->{113, 83},
PlotRange->{{0, 113}, {0, 83}}]\)], 1, {2}]];
In[16]:=
ListPlot[ResourceFunction["PrincipalAxisClustering"][points, 2]]
Out[16]=
In[17]:=
ListPlot[SplitBy[SortBy[PrincipalComponents[points], First], Negative@*First]]
Out[17]=

The principal axis can be obtained from the Eigenvectors of the Covariance matrix:

In[18]:=
pa = First[Eigenvectors[Covariance[points], UpTo[1]]]
Out[18]=

Visualize the axis over the original data:

In[19]:=
ListPlot[points, Epilog -> Style[InfiniteLine[Mean[points], pa], Red]]
Out[19]=

Projecting the standardized points onto the principal axis gives scalar values that indicate which cluster they belong to:

In[20]:=
ListPlot[GatherBy[Standardize[points], Negative[# . pa] &]]
Out[20]=

PrincipalAxisClustering finds clusters very quickly:

In[21]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/8164a169-5170-4549-9667-939576c8db03"]
Out[21]=
In[22]:=
First@RepeatedTiming[
  c1 = ResourceFunction["PrincipalAxisClustering"][points, 16]]
Out[22]=

Compare to FindClusters:

In[23]:=
First@RepeatedTiming[c2 = FindClusters[points, 16]]
Out[23]=

PrincipalAxisClustering gives clusters that better represent the local point cloud density:

In[24]:=
ListPlot[c1]
Out[24]=

FindClusters represents better separation of the data:

In[25]:=
ListPlot[c2]
Out[25]=

PrincipalAxisClustering partitions points into non-overlapping convex regions:

In[26]:=
clusters = ResourceFunction["PrincipalAxisClustering"][
   RandomVariate[BinormalDistribution[0], 20000]];
In[27]:=
Show[RegionUnion @@ ConvexHullMesh /@ clusters, ListPlot[clusters], PlotRange -> 2]
Out[27]=

The number of clusters locally scales with the point cloud density:

In[28]:=
Graphics[Point[points = (circlePts[
Pattern[r, 
Blank[]], 
Pattern[d, 
Blank[]], 
Pattern[n, 
Blank[]]] := RandomPoint[
Circle[{0, 0}, r], n] + RandomVariate[
NormalDistribution[0, d], {n, 2}]; Join[
circlePts[0.5, 0.1, 5000], 
circlePts[2, 0.1, 1000]])]]
Out[28]=
In[29]:=
Graphics[Point[
  Median /@ ResourceFunction["PrincipalAxisClustering"][points, 500]]]
Out[29]=

Possible Issues (1) 

If the number of requested clusters is not a power of 2, then cluster sizes will not be well balanced:

In[30]:=
points = RandomReal[1, {100, 2}];
In[31]:=
Length /@ ResourceFunction["PrincipalAxisClustering"][points, 5]
Out[31]=
In[32]:=
Length /@ ResourceFunction["PrincipalAxisClustering"][points, 8]
Out[32]=

Neat Examples (2) 

Visualize the recursive nature of the clustering:

In[33]:=
data = RandomReal[{-1, 1}, {10^4, 2}];
In[34]:=
plots = Table[
   ListPlot[ResourceFunction["PrincipalAxisClustering"][data, 2^n], Sequence[
    AspectRatio -> 1, Frame -> True, Axes -> False, FrameTicks -> None]], {n, 0, 5}];
In[35]:=
GraphicsGrid[Partition[plots, 3]]
Out[35]=

Partition a point cloud into clusters:

In[36]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/4b531bfe-a45c-49ef-95d0-bf0b530cdd54"]
Out[36]=

Remove some outliers from each cluster:

In[37]:=
filterOutliers[cluster_, q_] :=
  Module[{m, d, t},
   m = Mean[cluster];
   d = Norm[# - m] & /@ cluster;
   t = Quantile[d, 1 - q];
   Pick[cluster, # < t & /@ d]
   ];
In[38]:=
filtered = filterOutliers[#, .3] & /@ clusters;
In[39]:=
expand[cluster_, s_] :=
  With[{m = Mean[cluster]},
   m + s (# - m) & /@ cluster
   ];

Construct a parameterized topological representation of the point cloud:

In[40]:=
Manipulate[
 Show[RegionUnion @@ (ConvexHullMesh[expand[#, e]] & /@ filtered), ListPlot[points]], {{e, 1}, 0.01, 5}]
Out[40]=

Version History

  • 1.0.0 – 03 January 2022

Related Resources

License Information