Function Repository Resource:

NCBIVirusGenomeData

Source Notebook

Retrieve virus genome data, including the associated sequence and metadata

Contributed by: Keiko Hirayama

ResourceFunction["NCBIVirusGenomeData"][species, "Dataset"]

returns the viral sequence dataset for specified species.

ResourceFunction["NCBIVirusGenomeData"][species, "Tabular"]

returns the viral sequence dataset for specified species in a Tabular format.

ResourceFunction["NCBIVirusGenomeData"][species, "Summary"]

returns the summary information for the viral dataset of specified species.

Details and Options

ResourceFunction["NCBIVirusGenomeData"] retrieves virus data provided by the NCBI (National Center for Biotechnology Information).
Species can be a "TaxonomicSpecies" entity or a "NCBITaxonomyID" in an ExternalIdentifier format.
NCBI recommends that users perform no more than three search requests per second and limit large data retrieval to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays.
By including an API key provided by NCBI, up to 10 requests per second are permitted by default.
The following options can be given:
"APIKey"Nonean API key provided by NCBI; by including a key, up to 10 requests per second are permitted by default
"CompleteOnly"Truelimiting to genomes designated as complete, as defined by the submitter.
"GeoLocation"Nonelimiting to genomes collected from the specified geographic location; entities of the types "Country","GeographicRegion" as well as US states of the type "AdministrativeDivision" are allowed.
"Host"Nonelimiting to genomes isolated from the specified host species; "TaxonomicSpecies" entity, "NCBITaxonomyID" in an ExternalIdentifier format, or common or scientific name of species is allowed.
"IncludeSequence"Noneincluding specified sequences formatted as BioSequence objects; allowed sequence types include: "Genome", "Protein", "CDS"
"PangolinClassification"Nonelimiting to SARS-CoV-2 genomes from the specified Pango lineage.
"RefSeqOnly"Falselimiting results to RefSeq genomes.
"ReleasedSince"Nonelimiting to genomes released on or after the specified date.
"UpdatedSince"Nonelimiting to genomes updated on or after the specified date.
ResourceFunction["NCBIVirusGenomeData"][species] is equivalent to ResourceFunction["NCBIVirusGenomeData"][specie, "Dataset"].

Examples

Basic Examples (2) 

Retrieve viral genome data for the Zika virus:

In[1]:=
ResourceFunction["NCBIVirusGenomeData"][
 Entity["TaxonomicSpecies", "ZikaVirus::y7m74"], "Dataset"]
Out[1]=

Obtain a Tabular form for genome data and protein sequences of the Ebola virus:

In[2]:=
ebola = ResourceFunction["NCBIVirusGenomeData"][
  Entity["TaxonomicSpecies", "EbolaVirus::5q9c8"], "Tabular", "IncludeSequence" -> "Protein"]
Out[2]=

Scope (2) 

Find the size of the dataset for the SARS-CoV-2 genomes:

In[3]:=
ResourceFunction["NCBIVirusGenomeData"][
 Entity["TaxonomicSpecies", "SevereAcuteRespiratorySyndromeCoronavirus2::f6fc3"], "Summary", "IncludeSequence" -> "Genome"]
Out[3]=

Selectively retrieve the SARS-CoV-2 genomes collected in the past two weeks in Connecticut:

In[4]:=
sarscov = ResourceFunction["NCBIVirusGenomeData"][
  Entity["TaxonomicSpecies", "SevereAcuteRespiratorySyndromeCoronavirus2::f6fc3"], "ReleasedSince" -> DatePlus[Today, -Quantity[14, "Days"]], "IncludeSequence" -> "Genome", "GeoLocation" -> Entity["AdministrativeDivision", {"Connecticut", "UnitedStates"}]]
Out[4]=

Use the "PhylogeneticTreePlot" function to plot a dendrogram for a set of retrieved genome sequences:

In[5]:=
ResourceFunction["PhylogeneticTreePlot"][
 Normal[sarscov[All, #Sequence["SequenceString"] &]], Normal[sarscov[All, "Accession"]]]
Out[5]=

Retrieve the Dengue virus genomes:

In[6]:=
dengue = ResourceFunction["NCBIVirusGenomeData"][
  ExternalIdentifier["NCBITaxonomyID", "12637", <|"Label" -> "Dengue virus"|>]]
Out[6]=

Color countries where genome samples were collected:

In[7]:=
GeoRegionValuePlot[dengue[All, "GeographicLocation"] // Tally]
Out[7]=

Options (8) 

CompleteOnly (1) 

Retrieve only the complete genomes:

In[8]:=
ResourceFunction["NCBIVirusGenomeData"][
 Entity["TaxonomicSpecies", "YellowFeverVirus::48p2c"], "CompleteOnly" -> True]
Out[8]=

GeoLocation (1) 

Retrieve the genomes collected in Asia:

In[9]:=
ResourceFunction["NCBIVirusGenomeData"][
 Entity["TaxonomicSpecies", "MeaslesMorbillivirus::45qq7"], "GeoLocation" -> Entity["GeographicRegion", "Asia"]]
Out[9]=

Host (1) 

Retrieve the Tomato yellow leaf curl virus genomes isolated from eggplants:

In[10]:=
ResourceFunction["NCBIVirusGenomeData"][
 Entity["TaxonomicSpecies", "TomatoYellowLeafCurlVirus::c36dt"], "Host" -> Entity["TaxonomicSpecies", "SolanumMelongena::9g9tf"]]
Out[10]=

IncludeSequence (1) 

Include coding DNA sequences:

In[11]:=
ResourceFunction["NCBIVirusGenomeData"][
 Entity["TaxonomicSpecies", "EbolaVirus::5q9c8"], "IncludeSequence" -> "CDS"]
Out[11]=

PangolinClassification (1) 

Retrieve SARS-CoV-2 genomes from the selected Pango lineage:

In[12]:=
ResourceFunction["NCBIVirusGenomeData"][
 Entity["TaxonomicSpecies", "SevereAcuteRespiratorySyndromeCoronavirus2::f6fc3"], "PangolinClassification" -> "LP.8.1"]
Out[12]=

RefSeqOnly (1) 

Retrieve the RefSeq genomes:

In[13]:=
ResourceFunction["NCBIVirusGenomeData"][
 Entity["TaxonomicSpecies", "HumanAlphaherpesvirus3::6yh3h"], "RefSeqOnly" -> True, "IncludeSequence" -> "Genome"]
Out[13]=

ReleasedSince (1) 

Retrieve genomes released in the past year:

In[14]:=
ResourceFunction["NCBIVirusGenomeData"][
 Entity["TaxonomicSpecies", "MonkeypoxVirus::9y6ry"], "ReleasedSince" -> DatePlus[Today, -Quantity[1, "Years"]]]
Out[14]=

UpdatedSince (1) 

Retrieve genomes updated in the past month:

In[15]:=
ResourceFunction["NCBIVirusGenomeData"][
 Entity["TaxonomicSpecies", "HepatitisBVirus::rc8p8"], "UpdatedSince" -> DatePlus[Today, -Quantity[1, "Months"]]]
Out[15]=

Properties and Relations (2) 

Retrieve RefSeq genome data for the Zika virus:

In[16]:=
zika = ResourceFunction["NCBIVirusGenomeData"][
  Entity["TaxonomicSpecies", "ZikaVirus::y7m74"], "RefSeqOnly" -> True]
Out[16]=

Use the "ImportFASTA" function to retrieve the reference sequence:

In[17]:=
ResourceFunction["ImportFASTA"][zika[1, "Accession"], "BioSequence"]
Out[17]=

Version History

  • 1.0.0 – 22 December 2025

Source Metadata

Related Resources

License Information