Function Repository Resource:

StringAlign

Source Notebook

Find interleaved, common and different substrings in a pair of strings

Contributed by: Daniel Lichtblau

ResourceFunction["StringAlign"][s1,s2]

aligns characters in strings s1 and s2 so that long common consecutive substrings are preserved.

Details

ResourceFunction["StringAlign"] uses LongestCommonSubsequence to split its inputs, recursively calling itself on substring pairs that appear before and after the common substring.
String differences are kept as a list of two disjoint substrings, one of which might be an empty string.
ResourceFunction["StringAlign"] is similar to SequenceAlignment in how it presents its result. It uses a greedy algorithm rather than optimizing a string distance function. As a result it tends to be faster but can give a result with string differences that are not minimized. In practice it seems to come close to the quality of SequenceAlignment while requiring significantly less time.

Examples

Basic Examples (1) 

Find the common substrings and difference pairs for two strings:

In[1]:=
str1 = "The quick brown fox jumped over the lazy dog.";
str2 = "the slow brown dog jumped over the lazy fox??!!";
diffsimp = ResourceFunction["StringAlign"][str1, str2]
Out[3]=

Scope (3) 

Create a pseudorandom string with several hundred characters and make random substitutions to form a second string:

In[4]:=
SeedRandom[1234];
atoz = Flatten@{ToCharacterCode["a"], ToCharacterCode["z"]};
str1 = StringJoin[FromCharacterCode[RandomInteger[atoz, 400]]];
rndposns = RandomSample[Range[397], 6];
rndsubstrings = Map[StringJoin[FromCharacterCode[#]] &, RandomInteger[atoz, {6, 4}]];
str2 = str1;
Do[rpos = rndposns[[k]];
  str2 = StringReplacePart[str2, rndsubstrings[[k]],
    {rpos, rpos + 3}];
  , {k, 6}];

Compute the alignment of these two strings:

In[5]:=
ResourceFunction["StringAlign"][str1, str2]
Out[5]=

SequenceAlignment gives a similar but not identical result:

In[6]:=
SequenceAlignment[str1, str2]
Out[6]=

Properties and Relations (2) 

Create a pseudorandom string with two hundred characters and make random substitutions to form a second string:

In[7]:=
SeedRandom[1234];
atoz = Flatten@{ToCharacterCode["a"], ToCharacterCode["z"]};
str1 = StringJoin[FromCharacterCode[RandomInteger[atoz, 200]]];
rndposns = RandomSample[Range[197], 6];
rndsubstrings = Map[StringJoin[FromCharacterCode[#]] &, RandomInteger[atoz, {6, 4}]];
str2 = str1;
Do[rpos = rndposns[[k]];
  str2 = StringReplacePart[str2, rndsubstrings[[k]],
    {rpos, rpos + 3}];
  , {k, 6}];

Compute the alignment of these two strings:

In[8]:=
ResourceFunction["StringAlign"][str1, str2]
Out[8]=

SequenceAlignment gives a result that is similar but not identical:

In[9]:=
SequenceAlignment[str1, str2]
Out[9]=

Create a pair of strings of 30000 characters, with 10 alterations of four characters each:

In[10]:=
SeedRandom[1234];
atoz = Flatten@{ToCharacterCode["a"], ToCharacterCode["z"]};
len = 30000;
str1 = StringJoin[FromCharacterCode[RandomInteger[atoz, len]]];
changes = 10;
clen = 4;
rndposns = RandomSample[Range[len - clen + 1], changes];
rndsubstrings = Map[StringJoin[FromCharacterCode[#]] &, RandomInteger[atoz, {changes, clen}]];
str2 = str1;
Do[rpos = rndposns[[k]];
  str2 = StringReplacePart[str2, rndsubstrings[[k]],
    {rpos, rpos + clen - 1}];
  , {k, changes}];

SequenceAlignment takes several seconds to find the optimal alignments:

In[11]:=
Timing[align1 = SequenceAlignment[str1, str2];]
Out[11]=

StringAlign is several times faster:

In[12]:=
Timing[align2 = ResourceFunction["StringAlign"][str1, str2];]
Out[12]=

The byte sizes of the results are similar and both are near the common string length, indicating both do a good job of compressing the string differences:

In[13]:=
N@{ByteCount[align1], ByteCount[align2]}/StringLength[str1]
Out[13]=

Requirements

Wolfram Language 12.3 (May 2021) or above

Version History

  • 1.0.0 – 20 December 2023

Related Resources

License Information