# Wolfram Function Repository

Instant-use add-on functions for the Wolfram Language

Function Repository Resource:

Compute the Sokal-Sneath dissimilarity of two multisets

Contributed by:
Robert B. Nachbar (Wolfram Solutions)

ResourceFunction["MultisetSokalSneathDissimilarity"][ gives the Sokal–Sneath dissimilarity between multisets | |

ResourceFunction["MultisetSokalSneathDissimilarity"][ gives the Sokal–Sneath dissimilarity between multisets |

If the *list*_{i} are considered as multisets, ResourceFunction["MultisetSokalSneathDissimilarity"] gives their dissimilarity.

The *list*_{i} must have the same head, but it need not be List.

The values of *assoc*_{i} must be counts—that is, non-negative Integer values.

ResourceFunction["MultisetSokalSneathDissimilarity"][*A*,*B*] is equivalent to .

Sokal-Sneath dissimilarity between two List multisets:

In[1]:= |

Out[1]= |

Sokal-Sneath dissimilarity between two Association multisets:

In[2]:= |

Out[2]= |

The number of elements of each distinct kind affects the result:

In[3]:= |

Out[3]= |

In[4]:= |

Out[4]= |

In[5]:= |

Out[5]= |

In[6]:= |

Out[6]= |

The Sokal-Sneath dissimilarity measure has its origins in numerical taxonomy. It was proposed by Robert R. Sokal and Peter H. A. Sneath as a measure of similarity in the same class as the Jaccard and Dice measures but with an alternative weighting of unmatched features.

The measure can be used in a number of fields, as shown by the following examples. The role of object and attribute can also be reversed, and the first application demonstrates this duality.

Here are some ground-based animal index counts:

In[7]:= |

Out[7]= |

Here the different blocks (habitats) are compared using the animals present as characters. The arrangement of the table in this manner originated in the field of psychology, and was later adopted by numerical taxonomy.

The MultisetSokalSneathDissimilarity distance matrix of the blocks:

In[8]:= |

Out[8]= |

A Dendrogram showing how the blocks cluster:

In[9]:= |

Out[9]= |

The preceding analysis that compares the columns of the table is known as a Q-type (deriving from early factor analysis studies in the area of psychology). Conversely, comparing the rows is known as an R-type analysis and in this case compares the animals. Begin by first transposing the data table:

In[10]:= |

Out[10]= |

A Dendrogram shows that elephants and zebras, for example, are distributed similarly:

In[11]:= |

Out[11]= |

This application uses the chemical composition of flower parts from Hawaiian anthurium plants to classify different species and commercial cultivars. Load the data:

In[12]:= |

In[13]:= |

Out[13]= |

In[14]:= |

Out[10]= |

A Dendrogram shows the clustering of the species:

In[15]:= |

Out[15]= |

A feature space plot can be made from the PrincipalComponents of the distance matrix:

In[16]:= |

Out[16]= |

This example compares households on a single city block using the composition of household members (head, wife, daughter, brother-in-law, etc.). The data was compiled from the 1920 US Census for the 300 block of Wyoming Ave., Buffalo, NY. The households are labeled by street number.

Load the data:

In[17]:= |

Out[13]= |

The aggregate composition of the neighborhood:

In[18]:= |

Out[18]= |

The distance matrix using MultisetDiceDissimilarity can be displayed with an ArrayPlot (note the presence of off-diagonal 0s colored yellow):

In[19]:= |

Out[19]= |

The clustering of the households as shown by a Dendrogram:

In[20]:= |

Out[20]= |

The 14 yellow-colored, off-diagonal 0s of the preceding distance matrix represent pairs of identical households. They also appear in the preceding dendrogram as leaves with no initial height. Here they are extracted as clusters:

In[21]:= |

In[22]:= |

Out[22]= |

Finally, mapping back to the original household data gives:

In[23]:= |

Out[23]= |

Understanding market forces and characteristics of the competition are essential components of profitable business decisions. This example shows how shopping malls might be compared using the numbers of kinds of businesses in each. The data was collected from the websites of each mall using the self-reported number of businesses in each category. These malls are all managed by the same organization, so the categories can be assumed to be consistent.

Load the data:

In[24]:= |

Out[24]= |

The distance matrix using MultisetDiceDissimilarity can be displayed with an ArrayPlot:

In[25]:= |

Out[25]= |

The Dendrogram shows that The Mills at Jersey Gardens is very different from the others:

In[26]:= |

Out[26]= |

Similarity analysis of chemical structures goes back to the original work of Carhart* *et al. at Lederle Laboratories. It is used extensively to search though chemical databases to find compounds of interest and to cluster chemical structures into similar groups.

Here is small set of entities from the ChemicalData collection:

In[27]:= |

Out[27]= |

They can be converted into computable Molecule objects:

In[28]:= |

Out[26]= |

These are the topological features that will characterize each chemical structure (they are not the same as those used by Carhart et al., but are easily computable with MoleculeValue):

In[29]:= |

Using Dataset, a searchable database can be made:

In[30]:= |

Out[30]= |

Testosterone, for example, can be taken as the query molecule:

In[31]:= |

Out[31]= |

In[32]:= |

Out[32]= |

In[33]:= |

Out[33]= |

Sokal-Sneath dissimilarity is bounded by 0 and 1:

In[34]:= |

Out[34]= |

In[35]:= |

Out[35]= |

The result is the same as SokalSneathDissimilarity when the multisets are sets:

In[36]:= |

Out[36]= |

In[37]:= |

Out[37]= |

- 1.0.0 – 01 July 2019

This work is licensed under a Creative Commons Attribution 4.0 International License