Function Repository Resource:

MostFrequentKStringDistance

Calculate a distance metric between two strings based on the occurrences of their top-k most frequent characters

Contributed by: Haomin Yang

ResourceFunction["MostFrequentKStringDistance"][str₁, str₂,k]

gives the distance between strings str₁ and str₂ based on the top k frequent characters.

ResourceFunction["MostFrequentKStringDistance"][str₁,str₂,k,max]

uses max as the base distance from which similarity is subtracted.

Details

ResourceFunction["MostFrequentKStringDistance"] computes the distance by identifying the k most frequent characters in each string.

ResourceFunction["MostFrequentKStringDistance"] sums the counts of characters that appear in the top k list of both strings.

This similarity sum is subtracted from max (default 100) to return the final distance.

If k is larger than the number of unique characters in a string, all characters are used.

The comparison is case-sensitive.

Compute the distance between two protein-like sequences:

In[1]:=

Out[3]=

Changing K affects the calculated distance:

In[4]:=

Out[4]=

In[5]:=

Out[5]=

Comparing identical strings with high frequency overlap results in a lower distance (or negative if max is exceeded):

In[6]:=

Out[6]=

Find the "closest" string in a list based on top-1 frequency:

In[7]:=

Out[8]=

Wolfram Language 14.0 (January 2024) or above