Function Repository Resource:

MultisetDiceDissimilarity

Source Notebook

Compute the Dice dissimilarity of two multisets

Contributed by: Robert B. Nachbar (Wolfram Solutions)

ResourceFunction["MultisetDiceDissimilarity"][list1,list2]

gives the Dice dissimilarity between multisets list1 and list2.

ResourceFunction["MultisetDiceDissimilarity"][assoc1,assoc2]

gives the Dice dissimilarity between multisets assoc1 and assoc2.

Details and Options

The listi are taken to be multisets, and ResourceFunction["MultisetDiceDissimilarity"] gives their Dice dissimilarity.
The listi must have the same head, but it need not be List.
The values of associ must be counts—that is, non-negative Integer values.
ResourceFunction["MultisetDiceDissimilarity"][A,B] is equivalent to .

Examples

Basic Examples (2) 

Dice dissimilarity between two List multisets:

In[1]:=
ResourceFunction["MultisetDiceDissimilarity"][{"a", "c"}, {"a", "b"}]
Out[1]=

Dice dissimilarity between two Association multisets:

In[2]:=
ResourceFunction[
 "MultisetDiceDissimilarity"][<|"a" -> 1, "c" -> 1, "d" -> 1|>, <|
  "a" -> 1, "b" -> 1, "d" -> 1, "e" -> 1|>]
Out[2]=

Scope (1) 

The number of elements of each distinct kind affects the result:

In[3]:=
ResourceFunction[
 "MultisetDiceDissimilarity"][{"a", "a", "b", "c", "c", "c"}, {"a", "b", "c"}]
Out[3]=
In[4]:=
ResourceFunction[
 "MultisetDiceDissimilarity"][{"a", "a", "b", "c", "c", "c"}, {"a", "b", "b", "b", "c"}]
Out[4]=
In[5]:=
ResourceFunction[
 "MultisetDiceDissimilarity"][<|"a" -> 1, "b" -> 2, "c" -> 3|>, <|
  "a" -> 2, "b" -> 3, "c" -> 1|>]
Out[5]=
In[6]:=
ResourceFunction[
 "MultisetDiceDissimilarity"][<|"a" -> 1, "b" -> 2, "c" -> 3|>, <|
  "a" -> 2, "b" -> 2, "c" -> 2|>]
Out[6]=

Applications (19) 

The Dice dissimilarity measure, sometimes called the Sørensen–Dice dissimilarity, has its origins in ecology. It was developed independently by Thorvald Sørensen and Lee R. Dice as a measure of association or similarity of plant species given their distribution in different locales.

The measure can be used in a number of fields, as shown by the following examples. The role of object and attribute can also be reversed, and the first application demonstrates this duality.

Ecology (5) 

Here are some ground-based animal index counts:

In[7]:=
blockData = Dataset@<|"Elephant" -> <|"Block 1" -> 16, "Block 2" -> 3, "Block 3" -> 19, "Block 4" -> 26|>, "Buffalo" -> <|"Block 1" -> 36, "Block 2" -> 15, "Block 3" -> 63, "Block 4" -> 30|>, "Sable" -> <|"Block 1" -> 8, "Block 2" -> 2, "Block 3" -> 9, "Block 4" -> 7|>, "Zebra" -> <|"Block 1" -> 22, "Block 2" -> 2, "Block 3" -> 35, "Block 4" -> 29|>, "Impala" -> <|"Block 1" -> 57, "Block 2" -> 15, "Block 3" -> 67, "Block 4" -> 89|>|>
Out[7]=

Here, the different blocks (habitats) are compared using the animals present as characters. The arrangement of the table in this manner originated in the field of psychology and was later adopted by numerical taxonomy.

Apply MultisetDiceDissimilarity to construct a distance matrix of the blocks:

In[8]:=
DistanceMatrix[Values@Transpose@blockData, DistanceFunction -> ResourceFunction["MultisetDiceDissimilarity"]] //
  Row[{MatrixForm@#, MatrixForm@N@#}, Spacer[18]] &
Out[8]=

A Dendrogram showing how the blocks cluster:

In[9]:=
Dendrogram[Normal@Transpose@blockData, DistanceFunction -> ResourceFunction["MultisetDiceDissimilarity"]]
Out[9]=

The preceding analysis that compares the columns of the table is known as a Q-type (deriving from early factor analysis studies in the area of psychology). Conversely, comparing the rows is known as an R-type analysis and in this case compares the animals. Begin by first transposing the data table:

In[10]:=
speciesData = Dataset@Normal@Transpose@blockData
Out[10]=

A Dendrogram shows that elephants and zebras, for example, are distributed similarly:

In[11]:=
Dendrogram[Normal@Transpose@speciesData, DistanceFunction -> ResourceFunction["MultisetDiceDissimilarity"]]
Out[11]=

Sociology (6) 

This example compares households on a single city block using the composition of household members (head, wife, daughter, brother-in-law, etc.). The data was compiled from the 1920 US Census for the 300 block of Wyoming Ave., Buffalo, NY. The households are labeled by street number.

Load the data:

In[12]:=
data = <|"306" -> {"head", "wife", "step-son"}, "312a" -> {"head", "wife", "son", "son"}, "312b" -> {"head", "wife", "son"}, "314" -> {"head", "wife", "daughter", "son", "son", "son", "son", "daughter", "son"}, "316" -> {"head", "wife", "son", "daughter", "son"}, "318" -> {"head", "wife", "son", "mother-in-law"}, "322" -> {"head", "wife", "daughter", "daughter", "niece", "niece"},
    "328a" -> {"head", "son", "son", "daughter", "son"}, "328b" -> {"head", "wife", "son"}, "332" -> {"head", "wife", "daughter", "daughter", "daughter", "daughter", "daughter", "brother-in-law", "mother-in-law"}, "334" -> {"head", "wife", "daughter", "son", "daughter", "daughter", "grandson", "grandson", "grandson"}, "338" -> {"head", "wife", "son", "son", "son"}, "340" -> {"head", "wife", "son", "son", "daughter"}, "346a" -> {"head", "wife", "son", "daughter", "daughter"}, "346b" -> {"head", "wife", "son", "son", "daughter", "daughter", "son", "sister"}, "352" -> {"head", "wife", "son", "son", "son", "son", "son"}, "358" -> {"head", "wife", "daughter", "son", "daughter", "son", "daughter", "daughter", "son", "son", "son"}, "360" -> {"head", "wife", "daughter", "daughter"}, "364" -> {"head", "wife", "daughter", "mother-in-law"}, "370a" -> {"head", "wife"}, "370b" -> {"head", "wife", "son"}, "372" -> {"head", "wife", "daughter", "daughter", "daughter", "son", "son"}, "376" -> {"head", "wife", "son", "son", "daughter"}, "380" -> {"head", "wife"}|>;
RandomSample[%, 3]
Out[8]=

The aggregate composition of the neighborhood:

In[13]:=
Join @@ Values@data // Counts // KeySort
Out[13]=

The distance matrix using MultisetDiceDissimilarity can be displayed with an ArrayPlot (note the presence of off-diagonal 0s colored yellow):

In[14]:=
With[{labels = Keys@data}, DistanceMatrix[Values@data, DistanceFunction -> ResourceFunction[
    "MultisetDiceDissimilarity"]] // ArrayPlot[#, FrameTicks -> {Thread[{Range@Length@labels, labels}], Thread[{Range@Length@labels, Rotate[#, 90 °] & /@ labels}]}, ColorRules -> {0 -> Yellow}] &]
Out[14]=

The clustering of the households as shown by a Dendrogram:

In[15]:=
clusteringTree = ClusteringTree[data, DistanceFunction -> ResourceFunction["MultisetDiceDissimilarity"], ClusterDissimilarityFunction -> "Average"];
In[16]:=
Dendrogram[clusteringTree, Right]
Out[16]=

The 14 yellow-colored, off-diagonal 0s of the preceding distance matrix represent pairs of identical households. They also appear in the preceding dendrogram as leaves with no initial height. Here they are extracted as clusters:

In[17]:=
clusters = ClusteringTree[data, 0, DistanceFunction -> ResourceFunction["MultisetDiceDissimilarity"],
     ClusterDissimilarityFunction -> "Average"] // PropertyValue[#, "LeafLabels"] &;
In[18]:=
identicalHousholds = Select[clusters, Length[#] > 1 &] // Values
Out[18]=

Finally, mapping back to the original household data gives:

In[19]:=
Function[cluster, KeySelect[data, MemberQ[cluster, #] &]] /@ identicalHousholds
Out[19]=

Economics (3) 

Understanding market forces and characteristics of the competition are essential components of profitable business decisions. This example shows how shopping malls might be compared using the numbers of kinds of businesses in each. The data was collected from the websites of each mall using the self-reported number of businesses in each category. These malls are all managed by the same organization, so the categories can be assumed to be consistent.

Load the data:

In[20]:=
data = Dataset[<|"Men's" -> <|"The Mills at Jersey Gardens" -> 80, "Rockaway Townsquare" -> 36, "Menlo Park Mall" -> 38, "Newport Centre" -> 31, "Quaker Bridge Mall" -> 29, "Livingston Mall" -> 28|>, "Women's" -> <|"The Mills at Jersey Gardens" -> 90, "Rockaway Townsquare" -> 44, "Menlo Park Mall" -> 50, "Newport Centre" -> 34, "Quaker Bridge Mall" -> 41, "Livingston Mall" -> 35|>, "Shoes" -> <|"The Mills at Jersey Gardens" -> 85, "Rockaway Townsquare" -> 44, "Menlo Park Mall" -> 52, "Newport Centre" -> 35, "Quaker Bridge Mall" -> 41, "Livingston Mall" -> 36|>, "Entertainment" -> <|"The Mills at Jersey Gardens" -> 1, "Rockaway Townsquare" -> 3, "Menlo Park Mall" -> 4, "Newport Centre" -> 2, "Quaker Bridge Mall" -> 4, "Livingston Mall" -> 2|>, "Food" -> <|"The Mills at Jersey Gardens" -> 22, "Rockaway Townsquare" -> 23, "Menlo Park Mall" -> 35, "Newport Centre" -> 32, "Quaker Bridge Mall" -> 20, "Livingston Mall" -> 12|>|>, 
TypeSystem`Assoc[
TypeSystem`Atom[String], 
TypeSystem`Struct[{"The Mills at Jersey Gardens", "Rockaway Townsquare", "Menlo Park Mall", "Newport Centre", "Quaker Bridge Mall", "Livingston Mall"}, {
TypeSystem`Atom[Integer], 
TypeSystem`Atom[Integer], 
TypeSystem`Atom[Integer], 
TypeSystem`Atom[Integer], 
TypeSystem`Atom[Integer], 
TypeSystem`Atom[Integer]}], 5], <|"ID" -> 85792143290300|>]
Out[20]=

The distance matrix using MultisetDiceDissimilarity can be displayed with an ArrayPlot:

In[21]:=
With[{labels = Normal@Keys@Transpose@data}, DistanceMatrix[Values@Transpose@data, DistanceFunction -> ResourceFunction[
    "MultisetDiceDissimilarity"]] // ArrayPlot[#, FrameTicks -> {Thread[{Range@Length@labels, labels}], Thread[{Range@Length@labels, Rotate[#, 90 °] & /@ labels}]}] &]
Out[21]=

The Dendrogram shows that The Mills at Jersey Gardens is very different from the others:

In[22]:=
Dendrogram[Normal@Transpose@data, Right, DistanceFunction -> ResourceFunction["MultisetDiceDissimilarity"], ClusterDissimilarityFunction -> "Average"]
Out[22]=

Chemistry (5) 

Similarity analysis of chemical structures goes back to the original work of Carhart et al. at Lederle Laboratories. It is used extensively to search though chemical databases to find compounds of interest and to cluster chemical structures into similar groups. Carhart et al. devised their own similarity measure, which is the same as 1-MultisetDiceDissimilarity.

Here is small set of entities from the ChemicalData collection:

In[23]:=
molNames = ChemicalData[EntityClass["Chemical", "Steroids"]]
Out[23]=

They can be converted into computable Molecule objects:

In[24]:=
mols = # -> Molecule@# & /@ molNames // Association;
Short@%
Out[22]=

These are the topological features that will characterize each chemical structure (they are not the same as those used by Carhart et al., but are easily computable with MoleculeValue):

In[25]:=
properties = {"FullAtomCount", "FullBondCount", "AliphaticCarbocycleCount", "AliphaticHeterocycleCount", "AliphaticRingCount", "AmideBondCount", "AromaticCarbocycleCount", "AromaticHeterocycleCount", "AromaticRingCount", "BridgeheadAtomCount", "HBondAcceptorCount", "HBondDonorCount", "HeteroatomCount", "HeterocycleCount", "LipinskiHBondAcceptorCount", "LipinskiHBondDonorCount", "RingCount", "RotatableBondCount", "SaturatedCarbocycleCount", "SaturatedHeterocycleCount", "SaturatedRingCount", "SpiroAtomCount", "StereocenterCount", "UnspecifiedStereocenterCount"};

Using Dataset, a searchable database can be made:

In[26]:=
database = Dataset@(DeleteCases[#, 0] &@
      AssociationThread[
       properties -> MoleculeValue[#, properties]] & /@ mols)
Out[26]=

Cortisone, for example, can be taken as the query molecule:

In[27]:=
queryMolecule = Normal@database[Entity["Chemical", "Cortisone"]]
Out[27]=
In[28]:=
hits = Query[
    Select[ResourceFunction["MultisetDiceDissimilarity"][
         queryMolecule, #] <= 0.05 &]/*Keys]@database // Normal
Out[28]=
In[29]:=
MoleculePlot /@ (hits /. mols)
Out[29]=

Properties and Relations (3) 

Dice dissimilarity is bounded by 0 and 1:

In[30]:=
ResourceFunction[
 "MultisetDiceDissimilarity"][{"a", "c", "c", "d"}, {"a", "c", "c", "d"}]
Out[30]=
In[31]:=
ResourceFunction[
 "MultisetDiceDissimilarity"][{"a", "c", "c", "d"}, {"X", "Y", "Z"}]
Out[31]=

The result is the same as DiceDissimilarity when the multisets are sets:

In[32]:=
ResourceFunction[
 "MultisetDiceDissimilarity"][<|"a" -> 1, "c" -> 1, "d" -> 1|>, <|
  "a" -> 1, "b" -> 1, "d" -> 1, "e" -> 1|>]
Out[32]=
In[33]:=
DiceDissimilarity[{1, 0, 1, 1, 0}, {1, 1, 0, 1, 1}]
Out[33]=

The MultisetDiceDissimilarity is not a true distance measure, as it does not obey the triangle inequality:

In[34]:=
U = {"a"};
V = {"b"};
W = {"a", "b"};
ResourceFunction["MultisetDiceDissimilarity"][U, V] <= ResourceFunction["MultisetDiceDissimilarity"][U, W] + ResourceFunction["MultisetDiceDissimilarity"][V, W]
Out[29]=
In[35]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/b641d234-4a14-4012-8f1c-d4f3abcb342b"]
Out[38]=

The same is true for DiceDissimilarity:

In[39]:=
u = {1, 0};
v = {0, 1};
w = {1, 1};
DiceDissimilarity[u, v] <= DiceDissimilarity[u, w] + DiceDissimilarity[v, w]
Out[42]=
In[43]:=
u = RandomInteger[1, 10] // Echo;
v = RandomInteger[1, 10] // Echo;
w = RandomInteger[1, 10] // Echo;
DiceDissimilarity[u, v] <= DiceDissimilarity[u, w] + DiceDissimilarity[v, w]
Out[46]=

Publisher

Robert Nachbar

Version History

  • 1.0.0 – 01 July 2019

Source Metadata

Related Resources

License Information