Function Repository Resource:

SequenceOverlapFraction

Source Notebook

Compute the overlap fraction between two strings or biosequences

Contributed by: Soutick Saha

ResourceFunction["SequenceOverlapFraction"][seq1, seq2]

returns a list of the total length of overlap between seq1 and seq2 normalized over individual sequence lengths.

Details and Options

seq1 and seq2 must both be either BioSequence or String expressions.
ResourceFunction["SequenceOverlapFraction"] performs a sequence alignment in order to calculate the total length of the common elements of seq1 and seq2. The result is divided by the lengths of seq1 and seq2 respectively to obtain the overlap fraction.
The first and second element of the list corresponds to seq1 and seq2 respectively.

Examples

Basic Examples (2) 

Compute the overlap fraction between two strings:

In[1]:=
ResourceFunction[
 "SequenceOverlapFraction"]["abcXabcXabc", "abcYabcYabcKXK"]
Out[1]=

Get the overlap fraction between two BioSequences:

In[2]:=
ResourceFunction["SequenceOverlapFraction"][BioSequence[
 "Peptide", "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS", {}], BioSequence[
 "Peptide", "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLR", {}]]
Out[2]=

Properties and Relations (7) 

Obtain different overlap fractions for the two sequences:

SequenceAlignment of the two sequences returns the following output:

In[3]:=
seqOverlap = SequenceAlignment[BioSequence[
  "Peptide", "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS", {}], BioSequence[
  "Peptide", "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLR", {}]]
Out[3]=

Extract the common elements of the alignment:

In[4]:=
commonElements = Cases[seqOverlap, _String]
Out[4]=

Find the total length of the common elements:

In[5]:=
overlapLength = Total@Map[StringLength, commonElements]
Out[5]=

Find the lengths of the input sequences:

In[6]:=
seqLengths = {BioSequence[
   "Peptide", "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS", {}][
   "SequenceLength"], BioSequence[
   "Peptide", "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLR", {}]["SequenceLength"]}
Out[6]=

Divide the overlap length by the sequence lengths to obtain the overlap fractions:

In[7]:=
N@(overlapLength/#) & /@ seqLengths
Out[7]=

SequenceOverlapFraction gives the same result directly:

In[8]:=
ResourceFunction["SequenceOverlapFraction"][BioSequence[
 "Peptide", "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS", {}], BioSequence[
 "Peptide", "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLR", {}]]

Publisher

WolframChemistry

Requirements

Wolfram Language 14.0 (January 2024) or above

Version History

  • 1.0.0 – 28 August 2024

Related Resources

License Information