Function Repository Resource:

BioSequenceToRegularExpression

Source Notebook

Construct a regular expression from a BioSequence with degenerate letters

Contributed by: Jan Mangaldan

ResourceFunction["BioSequenceToRegularExpression"][bioseq]

constructs a RegularExpression that is functionally equivalent to the biomolecular sequence bioseq, for purposes of pattern matching.

Details

With the default option setting "IncludeDegenerateLetters"True, ResourceFunction["BioSequenceToRegularExpression"] includes degenerate letters in the regular expression returned.
Setting the option "IncludeDegenerateLetters"False makes ResourceFunction["BioSequenceToRegularExpression"] include only unambiguous letters.

Examples

Basic Examples (1) 

Convert a DNA strand with degenerate letters to an equivalent regular expression:

In[1]:=
ResourceFunction["BioSequenceToRegularExpression"][
 BioSequence["DNA", "ACWTMAN"]]
Out[1]=

Scope (2) 

A biomolecular sequence with degenerate letters:

In[2]:=
bs = BioSequence["DNA", "GGY"]
Out[2]=

Convert it to a regular expression:

In[3]:=
regex = ResourceFunction["BioSequenceToRegularExpression"][bs]
Out[3]=

The regular expression can be used in string functions like StringCases:

In[4]:=
StringCases["CTGGCGGTGGYGG", regex]
Out[4]=

This is equivalent to using StringCases directly on BioSequence objects:

In[5]:=
StringCases[BioSequence["DNA", "CTGGCGGTGGYGG"], bs]
Out[5]=

Convert a peptide sequence with degenerate letters:

In[6]:=
ResourceFunction["BioSequenceToRegularExpression"][
 BioSequence["Peptide", "CVWXKPRJSTBEGHZ"]]
Out[6]=

Options (2) 

IncludeDegenerateLetters (2) 

With the setting "IncludeDegenerateLetters"True, BioSequenceToRegularExpression includes degenerate letters in the regular expression returned:

In[7]:=
ResourceFunction["BioSequenceToRegularExpression"][
 BioSequence["DNA", "ACGTB"], "IncludeDegenerateLetters" -> True]
Out[7]=

With the setting "IncludeDegenerateLetters"False, degenerate letters are excluded:

In[8]:=
ResourceFunction["BioSequenceToRegularExpression"][
 BioSequence["DNA", "ACGTB"], "IncludeDegenerateLetters" -> False]
Out[8]=

Possible Issues (2) 

If the input biomolecular sequence does not contain degenerate letters, BioSequenceToRegularExpression returns a trivial regular expression:

In[9]:=
ResourceFunction["BioSequenceToRegularExpression"][
 BioSequence["DNA", "GATC"]]
Out[9]=

Hybrid strands and sequence collections are not supported:

In[10]:=
ResourceFunction["BioSequenceToRegularExpression"][
 BioSequence[{BioSequence["DNA", "ACCT"], BioSequence["RNA", "AGGUC"]}]]
Out[10]=

Version History

  • 1.0.0 – 23 February 2022

Author Notes

This is a Mathematica implementation of the MATLAB function seq2regexp.

License Information