Function Repository Resource:

AlignNearlyIdenticalSequences

Source Notebook

Align sequences known to be nearly identical

Contributed by: John Cassel, Wolfram|Alpha Scientific Content

ResourceFunction["AlignNearlyIdenticalSequences"][seq1,seq2]

create an alignment between two strings or biomolecular sequences seq1 and seq2.

Details and Options

The "UniqueSubsequenceLength" option asserts how long a contiguous subsequence must be to be different than all other contiguous subsequences of the same length. The default value is 1000.
The result of this function is an alignment given as a list of successive matching and differing sequences. See SequenceAlignment for further documentation and examples.
The alignment generated by this method may not be optimal, though it should always be correct if the subsequence length parameter is set appropriately.
Intended for similar sequences as are typically found in organisms of the same species.

Examples

Basic Examples (1) 

Find an alignment for nearly identical sequences:

In[1]:=
ResourceFunction[
 "AlignNearlyIdenticalSequences"]["axxxxxbyyyyy", "cxxxxdyyyy", "UniqueSubsequenceLength" -> 5]
Out[1]=

Scope (1) 

This function is suitable for aligning biomolecular sequences:

In[2]:=
ResourceFunction["AlignNearlyIdenticalSequences"][
 BioSequence["DNA", "GATCGC"], BioSequence["DNA", "GATAGC"]]
Out[2]=

Properties and Relations (1) 

This function can be used with the AlignmentToPositionDifferences resource function to produce manageable differences between quite large but nearly identical sequences:

In[3]:=
originalSeq = RandomInstance[BioSequence["DNA", 100000]];
alteredSequence = StringReplacePart[
   originalSeq, {"A", "C", "G", "T"}, {{19998, 19998}, {24997, 24997}, {50003, 50003}, {75001, 75001}}];
ResourceFunction["AlignmentToPositionDifferences"][
 ResourceFunction["AlignNearlyIdenticalSequences"][originalSeq, alteredSequence]]
Out[3]=

Neat Examples (2) 

Use this alignment to compare a variant of the SARS-CoV-2 coronavirus with the reference sequence:

In[4]:=
referenceSARSCoV2Seq = ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus", "ReferenceBioSequence"];
otherSARSCoV2Seq = BioSequence["DNA", ResourceFunction["ImportFASTA"]["MW850352"][[2, 1]]];
AbsoluteTiming[
  alignment = ResourceFunction["AlignNearlyIdenticalSequences"][
    referenceSARSCoV2Seq, otherSARSCoV2Seq]][[1]]
Out[4]=
In[5]:=
ResourceFunction["AlignmentToPositionDifferences"][alignment]
Out[5]=

When we cannot assume these sequences are nearly identical, we have to work longer to assure an optimal alignment:

In[6]:=
AbsoluteTiming[
  alignment2 = SequenceAlignment[referenceSARSCoV2Seq, otherSARSCoV2Seq]][[1]]
Out[6]=
In[7]:=
ResourceFunction["AlignmentToPositionDifferences"][alignment2]
Out[7]=

Version History

  • 1.0.0 – 13 April 2021

Related Resources

License Information