Function Repository Resource:

BLASTSearch

Source Notebook

Analyze biological sequence similarity using Basic Local Alignment Search Tool

Contributed by: Keiko Hirayama

ResourceFunction["BLASTSearch"][query]

perform a genomic sequence similarity search for a given sequence query.

Details and Options

BLAST (Basic Local Alignment Search Tool) is a tool provided by the NCBI (National Center for Biotechnology Information) for aligning query sequences against those present in a selected target database.
The query can be a raw nucleotide or protein sequence, FASTA formatted sequence, GI (GenInfo) identifier or the accession number for nucleotide/protein sequence.
The following options can be given:
"Program""blastn"BLAST program to access including "blastn", "blastp", "blastx", "tblastn", "tblastx", and "megablast"
"Database""core_nt"BLAST database to access such as "core_nt" and "swissprot"
"Filter""mL"masking off regions of low compositional complexity that may cuase spurious or misleading results ; "F" to disable; "T" or "L" to enable; prepend "m" for masking query while producing seeds used to scan database, but not for extensions (e.g. "mL"); following values are used by default: blastn: "mL" blastp, tblastn, blastx: "F", tblastx: "L"
"ExpectThreshold"10expected number of chance matches in a random model
"RewardPenaltyScores"{2, -3}pair of reward and penalty scores for matching and mismatching bases; applicable to blastn and megablast only; allowed combinations of reward and penalty parameters include: {1,-2}, {1,-3}, {2,-3}, {1,-4}, {4,-5}, {1,-1} following values are used by default: blastn: {2, -3} megablast: {1, -2}
"GapCosts"{5, 2}pair of positive integers indicating costs to create and extend a gap in an alignment; applicable to blastn, blastp, blastx and tblastn only; allowed pair of parameters include: blastn: {4, 4}, {2, 4}, {0, 4}, {3, 3}, {6, 2}, {5, 2}, {4, 2}, {2, 2} blastp, blastx, tblastn: {11, 2}, {10, 2}, {9, 2}, {8, 2}, {7, 2}, {6, 2}, {13, 1}, {12, 1}, {11, 1}, {10, 1}, {9, 1} following values are used by default: blastn: {5, 2} blastp, tblastn, blastx: {11, 2}
"WordSize"11the length of the seed for initial matches; allowed values include: blastn: 7, 11, 15 megablast: 16, 20, 24, 28, 32, 48, 64 blastp: 3, 5, 6 tblastn, blastx: 2, 3, 5, 6 tblastx: 2, 3 following values are used by default: blastn: 11 megablast: 28 blastp, tblastx: 3, tblastn,blastx:5
"Matrix""BLOSUM62"scoring matrix name; applicable to blastp, blastx, tblastn and tblastx only; allowed values include: "BLOSUM45", "BLOSUM50", "BLOSUM62", "BLOSUM80", "BLOSUM90", "PAM250", "PAM30", or "PAM70"
"CompositionBasedStatistics"2composition based statistics algorithm to use; applicable to blastp, blastx, tblastn and tblastx only; allowed values include: 0, 1, 2, or 3
"ShortQueryAdjust"Falseautomatically adjusting parameters for input sequences shorter than 30 bases/residues to improve results; applicable to blastn and blastp only
"Species"Allspecifying the taxon included in the search
MaxItems100maximum number of aligned sequences to keep
TimeConstraintInfinitymaximum computation timeout in seconds
The query result is a Dataset containing details of closely aligned nucleotide or protein sequences with the following properties:
Descriptionshort description of the database sequence
RefSeqAccessionunique accession number assigned to the database sequence
TaxonIDNCBI taxonomy identifier associated with the database sequence
ScientificNamescientific name of the organism associated with the database sequence
Lengthlength of the database sequence
StartPositionstart position of the aligned sequence
EndPositionend position of the aligned sequence
NumberOfMatchestotal number of sequence overlap
NumberOfGapOpeningstotal number of gap openings
Identitypercent of nucleotides or amino acids that are identical between the aligned query and database sequence
Scorealignment score
EValuenumber of hits or alignments that are expected to be seen by random chance with the same score or better
Sequencealigned database sequence

Examples

Basic Examples (2) 

Perform a sequence similarity search for a nucleotide sequence:

In[1]:=
ResourceFunction[
 "BLASTSearch"]["GCTAGGCCTGAGTCAGCATAGGTTGCTGGCCTTGGTGGGTGTTCTGAGGCTCTACCTGCTCCCCTCGGAA", {MaxItems -> 10}]
Out[1]=

Specify the program, gap costs, reward/penalty scores and species for the sequence similarity search:

In[2]:=
ResourceFunction[
 "BLASTSearch"]["TGAGTTTTTCTTAGGCAAGTAAGTGGCTTGGGACTTCGGGAGACAACCTTGTCAAGCACCTAATTGTGCC", {"Program" -> "megablast", "GapCosts" -> {0, 4}, "RewardPenaltyScores" -> {2, -3}, "Species" -> Entity["TaxonomicSpecies", "MusMusculus::y84t7"]}]
Out[2]=

Scope (2) 

Perform a sequence similarity search for a nucleotide sequence:

In[3]:=
query = "CTCAAAAGTCTAGAAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCA";
seqset = ResourceFunction["BLASTSearch"][query, {MaxItems -> 10}]
Out[4]=

Use the resource function DNAAlignmentPlot to visualize the alignment:

In[5]:=
ResourceFunction[
ResourceObject[<|"Name" -> "DNAAlignmentPlot", "ShortName" -> "DNAAlignmentPlot", "UUID" -> "0d48e3b9-eb38-4264-ad69-2f923925d24e", "ResourceType" -> "Function", "Version" -> "1.0.0", "Description" -> "Generate a visualization for DNA sequence alignment", "RepositoryLocation" -> URL[
     "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"], "SymbolName" -> "FunctionRepository`$ae31dc3229904954aa281908a58569b0`DNAAlignmentPlot", "FunctionLocation" -> CloudObject[
     "https://www.wolframcloud.com/obj/c1708b7a-f2c9-4d20-aec2-3b747a7ff356"]|>, ResourceSystemBase -> Automatic]][query, Normal@seqset[1, StringReplace[#Sequence, "-" -> ""] &]]
Out[5]=

Requirements

Wolfram Language 13.0 (December 2021) or above

Version History

  • 1.0.1 – 05 February 2025
  • 1.0.0 – 19 December 2024

Source Metadata

Related Resources

License Information