Wolfram Research

Function Repository Resource:

MultisetJaccardDissimilarity

Source Notebook

Compute the Jaccard dissimilarity of two multisets

Contributed by: Robert B. Nachbar (Wolfram Solutions)

ResourceFunction["MultisetJaccardDissimilarity"][list1,list2]

gives the Jaccard dissimilarity between multisets list1 and list2.

ResourceFunction["MultisetJaccardDissimilarity"][assoc1,assoc2]

gives the Jaccard dissimilarity between multisets assoc1 and assoc2.

Details and Options

If the listi are considered as multisets, MultisetJaccardDissimilarity gives their dissimilarity.
The listi must have the same head, but it need not be List.
The values of associ must be counts, that is, non-negative integer values.
ResourceFunction["MultisetJaccardDissimilarity"][A,B] is equivalent to .

Examples

Basic Examples

Jaccard dissimilarity between two List multisets:

In[1]:=
ResourceFunction[
 "MultisetJaccardDissimilarity"][{"a", "c"}, {"a", "b"}]
Out[1]=

Jaccard dissimilarity between two Association multisets:

In[2]:=
ResourceFunction[
 "MultisetJaccardDissimilarity"][<|"a" -> 1, "c" -> 1, "d" -> 1|>, <|
  "a" -> 1, "b" -> 1, "d" -> 1, "e" -> 1|>]
Out[2]=

Scope

The number of elements of each distinct kind affects the result:

In[3]:=
ResourceFunction[
 "MultisetJaccardDissimilarity"][{"a", "a", "b", "c", "c", "c"}, {"a",
   "b", "c"}]
Out[3]=
In[4]:=
ResourceFunction[
 "MultisetJaccardDissimilarity"][{"a", "a", "b", "c", "c", "c"}, {"a",
   "b", "b", "b", "c"}]
Out[4]=
In[5]:=
ResourceFunction[
 "MultisetJaccardDissimilarity"][<|"a" -> 1, "b" -> 2, "c" -> 3|>, <|
  "a" -> 2, "b" -> 3, "c" -> 1|>]
Out[5]=
In[6]:=
ResourceFunction[
 "MultisetJaccardDissimilarity"][<|"a" -> 1, "b" -> 2, "c" -> 3|>, <|
  "a" -> 2, "b" -> 2, "c" -> 2|>]
Out[6]=

Applications

The Jaccard dissimilarity measure, sometimes called Tanimoto dissimilarity, has it origins in ecology. It was developed by Paul Jaccard as a measure of similarity of plant distribution in different regions.

The measure can be used in a number of fields, as shown by the following examples. The role of object and attribute can also be reversed, and the first application demonstrates this duality.

Ecology

Here are some ground-based animal index counts:

In[7]:=
blockData = Dataset@<|"Elephant" -> <|"Block 1" -> 16, "Block 2" -> 3, "Block 3" -> 19, "Block 4" -> 26|>, "Buffalo" -> <|"Block 1" -> 36, "Block 2" -> 15, "Block 3" -> 63, "Block 4" -> 30|>, "Sable" -> <|"Block 1" -> 8, "Block 2" -> 2, "Block 3" -> 9, "Block 4" -> 7|>, "Zebra" -> <|"Block 1" -> 22, "Block 2" -> 2, "Block 3" -> 35, "Block 4" -> 29|>, "Impala" -> <|"Block 1" -> 57, "Block 2" -> 15, "Block 3" -> 67, "Block 4" -> 89|>|>
Out[7]=

Here the different blocks (habitats) are compared using the animals present as characters. The arrangement of the table in this manner originated in the field of psychology and was later adopted by numerical taxonomy.

The MultisetJaccardDissimilarity distance matrix of the blocks:

In[8]:=
DistanceMatrix[Values@Transpose@blockData, DistanceFunction -> ResourceFunction[
   "MultisetJaccardDissimilarity"]] // Row[{MatrixForm@#, MatrixForm@N@#}, Spacer[18]] &
Out[8]=

A Dendrogram showing how the blocks cluster:

In[9]:=
Dendrogram[Normal@Transpose@blockData, DistanceFunction -> ResourceFunction["MultisetJaccardDissimilarity"]]
Out[9]=

The preceding analysis that compares the columns of the table is known as a Q-type (deriving from early factor analysis studies in the area of psychology). Conversely, comparing the rows is known as an R-type analysis and in this case compares the animals. Begin by first transposing the data table:

In[10]:=
speciesData = Dataset@Normal@Transpose@blockData
Out[10]=

A Dendrogram shows that elephants and zebras, for example, are distributed similarly:

In[11]:=
Dendrogram[Normal@Transpose@speciesData, DistanceFunction -> ResourceFunction["MultisetJaccardDissimilarity"]]
Out[11]=

Sociology

This example compares households on a single city block using the composition of household members (head, wife, daughter, brother-in-law, etc.). The data was compiled from the 1920 US Census for the 300 block of Wyoming Ave., Buffalo, NY. The households are labeled by street number.

Load the data:

In[12]:=
data = Association[
  "306" -> {"head", "wife", "step-son"}, "312a" -> {"head", "wife", "son", "son"}, "312b" -> {"head", "wife", "son"}, "314" -> {"head", "wife", "daughter", "son", "son", "son", "son", "daughter", "son"}, "316" -> {"head", "wife", "son", "daughter", "son"}, "318" -> {"head", "wife", "son", "mother-in-law"}, "322" -> {"head", "wife", "daughter", "daughter", "niece", "niece"}, "328a" -> {"head", "son", "son", "daughter", "son"}, "328b" -> {"head", "wife", "son"}, "332" -> {"head", "wife", "daughter", "daughter", "daughter", "daughter", "daughter", "brother-in-law", "mother-in-law"}, "334" -> {"head", "wife", "daughter", "son", "daughter", "daughter", "grandson", "grandson", "grandson"}, "338" -> {"head", "wife", "son", "son", "son"}, "340" -> {"head", "wife", "son", "son", "daughter"}, "346a" -> {"head", "wife", "son", "daughter", "daughter"}, "346b" -> {"head", "wife", "son", "son", "daughter", "daughter", "son", "sister"}, "352" -> {"head", "wife", "son", "son", "son", "son", "son"}, "358" -> {"head", "wife", "daughter", "son", "daughter", "son", "daughter", "daughter", "son", "son", "son"}, "360" -> {"head", "wife", "daughter", "daughter"}, "364" -> {"head", "wife", "daughter", "mother-in-law"}, "370a" -> {"head", "wife"}, "370b" -> {"head", "wife", "son"}, "372" -> {"head", "wife", "daughter", "daughter", "daughter", "son", "son"}, "376" -> {"head", "wife", "son", "son", "daughter"}, "380" -> {"head", "wife"}];
RandomSample[%, 3]
Out[8]=
In[13]:=
Join @@ Values@data // Counts // KeySort
Out[13]=

The distance matrix using MultisetDiceDissimilarity can be displayed with an ArrayPlot (note the presence of off-diagonal 0s colored yellow):

In[14]:=
With[{labels = Keys@data}, (\[ScriptCapitalD] = DistanceMatrix[Values@data, DistanceFunction -> ResourceFunction[
      "MultisetJaccardDissimilarity"]]) // ArrayPlot[#, FrameTicks -> {Thread[{Range@Length@labels, labels}], Thread[{Range@Length@labels, Rotate[#, 90 \[Degree]] & /@ labels}]}, ColorRules -> {0 -> Yellow}] &]
Out[14]=

The clustering of the households as shown by a Dendrogram:

In[15]:=
Dendrogram[Association@data, Right, DistanceFunction -> ResourceFunction["MultisetJaccardDissimilarity"],
  ClusterDissimilarityFunction -> "Average"]
Out[15]=

The 14 yellow-colored, off-diagonal 0s of the preceding distance matrix represent pairs of identical households. They also appear in the preceding dendrogram as leaves with no initial height. Here they are extracted as clusters:

In[16]:=
clusters = ClusteringTree[data, 0, DistanceFunction -> ResourceFunction[
     "MultisetJaccardDissimilarity"], ClusterDissimilarityFunction -> "Average"] // PropertyValue[#, "LeafLabels"] &;
In[17]:=
identicalHousholds = Select[clusters, Length[#] > 1 &] // Values
Out[17]=

Mapping back to the original household data gives:

In[18]:=
Function[cluster, KeySelect[data, MemberQ[cluster, #] &]] /@ identicalHousholds
Out[18]=

Because the MultisetJaccardDissimilarity is a distance metric, the distance matrix can be used to generate a set of 3D coordinates with distance geometry methods:

In[19]:=
coords = Module[{d = N@\[ScriptCapitalD], n, dSqr, sumSqr, c, g, vals,
     vec, nDim = 3},
   n = Length@d;
   dSqr = d^2;
   sumSqr = Map[Total, LowerTriangularize[dSqr], {0, 1}];
   c = ConstantArray[Mean[dSqr] - sumSqr/n^2, n];
   g = (Transpose[c] + c - dSqr)/2;
   {vals, vecs} = Eigensystem[g];
   Transpose[
    DiagonalMatrix[Take[Sqrt[Abs[vals]], nDim]].Take[vecs, nDim]]
   ];
In[20]:=
ListPointPlot3D[MapThread[Callout, {coords, Keys@data}], PlotRangePadding -> Scaled[.10]]
Out[20]=

Economics

Understanding market forces and characteristics of the competition are essential components of profitable business decisions. This example shows how shopping malls might be compared using the numbers of kinds of businesses in each. The data was collected from the websites of each mall using the self-reported number of businesses in each category. These malls are all managed by the same organization, so the categories can be assumed to be consistent.

Load the data:

In[21]:=
data = Dataset[
Association[
  "Men's" -> Association[
    "The Mills at Jersey Gardens" -> 80, "Rockaway Townsquare" -> 36, "Menlo Park Mall" -> 38, "Newport Centre" -> 31, "Quaker Bridge Mall" -> 29, "Livingston Mall" -> 28], "Women's" -> Association[
    "The Mills at Jersey Gardens" -> 90, "Rockaway Townsquare" -> 44, "Menlo Park Mall" -> 50, "Newport Centre" -> 34, "Quaker Bridge Mall" -> 41, "Livingston Mall" -> 35], "Shoes" -> Association[
    "The Mills at Jersey Gardens" -> 85, "Rockaway Townsquare" -> 44, "Menlo Park Mall" -> 52, "Newport Centre" -> 35, "Quaker Bridge Mall" -> 41, "Livingston Mall" -> 36], "Entertainment" -> Association[
    "The Mills at Jersey Gardens" -> 1, "Rockaway Townsquare" -> 3, "Menlo Park Mall" -> 4, "Newport Centre" -> 2, "Quaker Bridge Mall" -> 4, "Livingston Mall" -> 2], "Food" -> Association[
    "The Mills at Jersey Gardens" -> 22, "Rockaway Townsquare" -> 23, "Menlo Park Mall" -> 35, "Newport Centre" -> 32, "Quaker Bridge Mall" -> 20, "Livingston Mall" -> 12]], 
TypeSystem`Assoc[
TypeSystem`Atom[String], 
TypeSystem`Struct[{
    "The Mills at Jersey Gardens", "Rockaway Townsquare", "Menlo Park Mall", "Newport Centre", "Quaker Bridge Mall", "Livingston Mall"}, {
TypeSystem`Atom[Integer], 
TypeSystem`Atom[Integer], 
TypeSystem`Atom[Integer], 
TypeSystem`Atom[Integer], 
TypeSystem`Atom[Integer], 
TypeSystem`Atom[Integer]}], 5], 
Association["ID" -> 85792143290300]]
Out[21]=

The distance matrix using MultisetJaccardDissimilarity can be displayed with an ArrayPlot:

In[22]:=
With[{labels = Normal@Keys@Transpose@data}, DistanceMatrix[Values@Transpose@data, DistanceFunction -> ResourceFunction[
    "MultisetJaccardDissimilarity"]] // ArrayPlot[#, FrameTicks -> {Thread[{Range@Length@labels, labels}], Thread[{Range@Length@labels, Rotate[#, 90 \[Degree]] & /@ labels}]}] &]
Out[22]=

The Dendrogram shows that The Mills at Jersey Gardens is very different from the others:

In[23]:=
Dendrogram[Normal@Transpose@data, Right, DistanceFunction -> ResourceFunction["MultisetJaccardDissimilarity"],
  ClusterDissimilarityFunction -> "Average"]
Out[23]=

Chemistry

Similarity analysis of chemical structures goes back to the original work of Carhart et al. at Lederle Laboratories. It is used extensively to search though chemical databases to find compounds of interest and to cluster chemical structures into similar groups.

Here is small set of entities from the ChemicalData collection:

In[24]:=
molNames = ChemicalData[EntityClass["Chemical", "Steroids"]]
Out[24]=

They can be converted into computable Molecule objects:

In[25]:=
mols = # -> Molecule@# & /@ molNames // Association;
Short@%
Out[23]=

These are the topological features that will characterize each chemical structure (they are not the same as those used by Carhart et al., but are easily computable with MoleculeValue):

In[26]:=
properties = {"FullAtomCount", "FullBondCount", "AliphaticCarbocycleCount", "AliphaticHeterocycleCount", "AliphaticRingCount", "AmideBondCount", "AromaticCarbocycleCount", "AromaticHeterocycleCount", "AromaticRingCount", "BridgeheadAtomCount", "HBondAcceptorCount", "HBondDonorCount", "HeteroatomCount", "HeterocycleCount", "LipinskiHBondAcceptorCount", "LipinskiHBondDonorCount", "RingCount", "RotatableBondCount", "SaturatedCarbocycleCount", "SaturatedHeterocycleCount", "SaturatedRingCount", "SpiroAtomCount", "StereocenterCount", "UnspecifiedStereocenterCount"};

Using Dataset, a searchable database can be made:

In[27]:=
database = Dataset@(DeleteCases[#, 0] &@
      AssociationThread[
       properties -> MoleculeValue[#, properties]] & /@ mols)
Out[27]=

Estradiol, for example, can be taken as the query molecule:

In[28]:=
queryMolecule = Normal@database[Entity["Chemical", "Estradiol"]]
Out[28]=
In[29]:=
hits = Query[
    Select[ResourceFunction["MultisetJaccardDissimilarity"][
         queryMolecule, #] <= 0.10 &] /* Keys]@database // Normal
Out[29]=
In[30]:=
MoleculePlot /@ (hits /. mols)
Out[30]=

Properties and Relations

Jaccard multiset dissimilarity is bounded by 0 and 1:

In[31]:=
ResourceFunction[
 "MultisetJaccardDissimilarity"][{"a", "c", "c", "d"}, {"a", "c", "c",
   "d"}]
Out[31]=
In[32]:=
ResourceFunction[
 "MultisetJaccardDissimilarity"][{"a", "c", "c", "d"}, {"X", "Y", "Z"}]
Out[32]=

The result is the same as JaccardDissimilarity when the multisets are sets:

In[33]:=
ResourceFunction[
 "MultisetJaccardDissimilarity"][<|"a" -> 1, "c" -> 1, "d" -> 1|>, <|
  "a" -> 1, "b" -> 1, "d" -> 1, "e" -> 1|>]
Out[33]=
In[34]:=
JaccardDissimilarity[{1, 0, 1, 1, 0}, {1, 1, 0, 1, 1}]
Out[34]=

The MultisetJaccardDissimilarity is a true distance measure, as it does obey the triangle inequality:

In[35]:=
U = {"a"};
V = {"b"};
W = {"a", "b"};
ResourceFunction["MultisetJaccardDissimilarity"][U, V] <= ResourceFunction["MultisetJaccardDissimilarity"][U, W] + ResourceFunction["MultisetJaccardDissimilarity"][V, W]
Out[30]=
In[36]:=
U = RandomChoice[CharacterRange["a", "f"], 10] // Echo;
V = RandomChoice[CharacterRange["a", "f"], 10] // Echo;
W = RandomChoice[CharacterRange["a", "f"], 10] // Echo;
ResourceFunction["MultisetJaccardDissimilarity"][U, V] <= ResourceFunction["MultisetJaccardDissimilarity"][U, W] + ResourceFunction["MultisetJaccardDissimilarity"][V, W]
Out[37]=

The same is true for JaccardDissimilarity:

In[38]:=
u = {1, 0};
v = {0, 1};
w = {1, 1};
JaccardDissimilarity[u, v] <= JaccardDissimilarity[u, w] + JaccardDissimilarity[v, w]
Out[41]=
In[42]:=
u = RandomInteger[1, 10] // Echo;
v = RandomInteger[1, 10] // Echo;
w = RandomInteger[1, 10] // Echo;
JaccardDissimilarity[u, v] <= JaccardDissimilarity[u, w] + JaccardDissimilarity[v, w]
Out[45]=

Resource History

Source Metadata

See Also

License Information