Function Repository Resource:

ImportFASTA

Source Notebook

Import FASTA data from the NCBI

Contributed by: Brendan Elli and Keiko Hirayama

ResourceFunction["ImportFASTA"][seqref,database]

imports FASTA data for the specified seqref from the NCBI database and returns its "Header" and "Sequence" elements combined into a list.

ResourceFunction["ImportFASTA"][seqref,database,format]

imports FASTA data for the specified seqref from the NCBI database and convert it to the specified format.

Details

"Nucleotide" and "Protein" are the supported values for database.
FASTA formatted nucleotide and protein sequences are retrieved from the databases provided by the NCBI (National Center for Biotechnology Information).
Nucleotide sequences can be queried by their NCBI Nucleotide Reference Sequence accession number, GenBank database nucleotide sequence accession number, RCSB PDB (Research Collaboratory for Structural Bioinformatics Protein Data Bank) accession number.
Protein sequences can be queried by their NCBI Protein Reference Sequence accession number, GenBank database protein sequence accession number, RCSB PDB accession number or UniProt (Universal Protein Resource) name or accession number.
ResourceFunction["ImportFASTA"][seqref] is equivalent to ResourceFunction["ImportFASTA"][seqref,"Nucleotide"].
The value for format can be any of the following:
"Data""Header" and "Sequence" elements combined in a list
"LabeledData"labeled sequence converted to a rule
"FASTA"FASTA format
"BioSequence"BioSequence format
ResourceFunction["ImportFASTA"][seqref, database] is equivalent to ResourceFunction["ImportFASTA"][seqref,database,"Data"].

Examples

Basic Examples (2) 

Import a simple NCBI Reference Sequence and give the raw header and sequence:

In[1]:=
Short[mitoc = ResourceFunction["ImportFASTA"]["NC_013993", "Nucleotide"], 5]
Out[1]=

Use the chaos game representation to visualize this genome:

In[2]:=
srules = {"U" -> "T", Except[Characters["ACGT"]] -> ""};
In[3]:=
ResourceFunction["FCGRImage"][StringReplace[mitoc[[2, 1]], srules], 7]
Out[3]=

Retrieve a protein sequence for a UniProt protein:

In[4]:=
ResourceFunction["ImportFASTA"]["TP53B_HUMAN", "Protein"]
Out[4]=

Scope (3) 

Get a result in FASTA format:

In[5]:=
ResourceFunction["ImportFASTA"]["JX869132.1", "Nucleotide", "FASTA"]
Out[5]=

Get a result as the labeled data:

In[6]:=
ResourceFunction[
 "ImportFASTA"]["DQ926868.1", "Nucleotide", "LabeledData"]
Out[6]=

Get a result as the BioSequence object:

In[7]:=
ResourceFunction[
 "ImportFASTA"]["S53156.1", "Nucleotide", "BioSequence"]
Out[7]=

Applications (2) 

Retrieve protein sequences for cytochrome C from various organisms:

In[8]:=
prot = ResourceFunction["ImportFASTA"][#, "Protein"] & /@ {"NP_001039526.1", "NP_001385227.1", "NP_001123442.1", "XP_069973776.1", "XP_064766840.1", "WP_276305918.1", "XP_059855190.1", "XP_068969445.1"}
Out[8]=

Use the PhylogeneticTreePlot resource function to generate the phylogenetic tree:

In[9]:=
ResourceFunction[
ResourceObject[<|"Name" -> "PhylogeneticTreePlot", "ShortName" -> "PhylogeneticTreePlot", "UUID" -> "562d05d8-fc55-4fe9-beb8-4e6746b1f1da", "ResourceType" -> "Function", "Version" -> "4.0.1", "Description" -> "Plot a dendrogram for a set of genome nucleotide sequences", "RepositoryLocation" -> URL[
     "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"], "SymbolName" -> "FunctionRepository`$22a295ca301946a0b4a3927b3f4ab819`PhylogeneticTreePlot", "FunctionLocation" -> CloudObject[
     "https://www.wolframcloud.com/obj/7a024d7d-ed87-4a84-8ab6-02b9992bde2b"]|>, ResourceSystemBase -> Automatic]][prot[[All, 2, 1]],
  Flatten@StringCases[prot[[All, 1, 1]], "[" ~~ sp__ ~~ "]" :> sp]]
Out[9]=

Version History

  • 2.1.0 – 17 January 2025
  • 2.0.0 – 18 December 2024
  • 1.0.0 – 10 July 2019

Related Resources

License Information