Function Repository Resource:

NCBIEntrezData

Source Notebook

Access biomedical data in the NCBI Entrez system

Contributed by: Keiko Hirayama

ResourceFunction["NCBIEntrezData"]["AllDatabases"]

provides a list of the names of all valid Entrez databases.

ResourceFunction["NCBIEntrezData"]["Version"]

returns the version supported for Entrez databases.

ResourceFunction["NCBIEntrezData"][database, "EInfo"]

provides information about the specified Entrez database.

ResourceFunction["NCBIEntrezData"][text, "ESearch"]

provides a list of Entrez unique identifier (UIDs) matching a text query.

ResourceFunction["NCBIEntrezData"][uid, "ESummary"]

retrieves summary data for the specified uid.

ResourceFunction["NCBIEntrezData"][uid, "EFetch"]

retrieves full data records for the specified uid.

ResourceFunction["NCBIEntrezData"][uid, "ELink"]

retrieves UIDs linked to an input uid in either the same or a different Entrez database.

ResourceFunction["NCBIEntrezData"][uid, "EPost"]

uploads the specified uid to the Entrez History server.

ResourceFunction["NCBIEntrezData"][cite, "ECitMatch"]

retrieves PubMed IDs (PMIDs) that correspond to a citation cite.

ResourceFunction["NCBIEntrezData"][term, "ESpell"]

provides spelling suggestions for a term within a single text query in a given database.

Details

ResourceFunction["NCBIEntrezData"] retrieves biology data accessible via the Entrez utilities (E-utilities) provided by the NCBI (National Center for Biotechnology Information).
NCBI recommends that users perform no more than three search requests per second and limit large data retrieval to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays.
By including an API key provided by NCBI, up to 10 requests per second are permitted by default.
Entrez utilities include:
"EInfo"provides information about the Entrez databases, including lists of indexing fields and available link names
"ESearch"provides a list of Entrez unique identifiers (UIDs) matching a text query
"ESummary"provides summaries of data associated with the input UIDs
"EFetch"provides full data associated with the input UIDs
"ELink"provides linked or related records associated with the input UIDs
"EPost"uploads a list of UIDs to the History server
"ECitMatch"provides PubMed IDs (PMIDs) that correspond to a set of input citation strings
"ESpell"provides spelling suggestions for terms within a single text query in a given database
In ResourceFunction["NCBIEntrezData"][uid,prop], uid may be a GenInfo Identifier (GI) number, an accession identifier or a mixed list of these forms. In ResourceFunction["NCBIEntrezData"][cite,"ECitMatch"], cite can be a single citation or a list.
The following options can be given:
"Database""pubmed"target database from which to retrieve data; options for "ESearch", "ESummary", "EFetch", "ELink", "EPost", "ESpell"; allowed values include: "pubmed","protein","nuccore","ipg","nucleotide","structure","genome","annotinfo","assembly","bioproject", "biosample","blastdbinfo","books","cdd","clinvar","gap","gapplus","grasp","dbvar","gene","gds","geoprofiles", "medgen","mesh","nlmcatalog","omim","orgtrack","pmc","proteinclusters","pcassay","protfam","pccompound", "pcsubstance","seqannot","snp","sra","taxonomy","biocollections","gtr"
"RetMode""Dataset" or "Text"format of the returned output; options for "ESummary", "EFetch", "ELink"; allowed values include: "Dataset", "Text", "XML"; For "EFetch", some databases, including "bioproject", "clinvar", "taxonomy", "snp", "sra", and "gtr", return "Dataset" instead of "Text" by default
"RetType"Nonetype of retrived data; options for "EFetch"; allowed values vary by database: all databases: "docsum", "uilist" gene: "gene_table" nuccore, nucleotide, protein: "acc", "fasta", "seqid", "BioSequence" pmc: "medline" pubmed: "medline", "abstract"
"UseHistory"Falseposting the UIDs resulting from the search operation onto the NCBI History server; options for "ESearch"; "UseHistory" must be set to True for "ESearch" to accept a WebEnv as input
"WebEnv"Noneweb environment string returned from a previous "ESearch", "EPost" or "ELink" call; when provided, NCBIEntrezData will append the results to the existing environment; "UseHistory" must be set to True for "ESearch" to accept a WebEnv as input; options for "ESearch", "ESummary", "EFetch";
"QueryKey"Nonequery key returned by a previous ESearch, EPost or ELink call; when provided, NCBIEntrezData will find the intersection of the set specified by QueryKey and the set retrieved by the query; options for "ESearch", "ESummary", "EFetch"
"RetStart"0sequential index of the first UID in the retrieved set to be shown in the output (default=0, corresponding to the first record of the entire set); options for "ESearch"
"RetMax"20total number of UIDs from the retrieved set to be shown in the output; options for "ESearch"
"Sort"Nonemethod used to sort UIDs in the output; options for "ESearch"; allowed values vary by database and include: gene: "relevance", "name" pubmed: "pub_date", "Author", "JournalName", "relevance"
"Field"Nonesearch field to limit a result of "ESearch"; allowed values vary by database and are found in the result of NCBIEntrezData[database, "EInfo"]
"IDType"Nonetype of identifier to return for sequence databases (nuccore, protein); if not specified, GenInfo Identifier (GI) numbers are returned; other value includes "acc", returning sequence accession numbers instead; options for "ESearch", "ELink"
"DateType"Nonetype of date used to limit a search; options for "ESearch", "ELink"; allowed values include: "mdat" (modification date), "pdat" (publication date), "edat" (Entrez date)
"RelDate"Noneinteger n limiting the result that has a date specified by "DateType" within the last n days; options for "ESearch", "ELink"
"MinDate"Nonedate range used to limit a search result of "ESearch" or "ELink" by the date specified by "DateType"; two parameters, "MinDate" and "MaxDate", must be used together to specify an arbitrary date range; the general date format is YYYY/MM/DD, and these variants are also allowed: YYYY, YYYY/MM
"MaxDate"Nonedate range used to limit a search result of "ESearch" or "ELink" by the date specified by "DateType"; two parameters, "MinDate" and "MaxDate", must be used together to specify an arbitrary date range; the general date format is YYYY/MM/DD, and these variants are also allowed: YYYY, YYYY/MM
"Strand""plus"strand of DNA to retrieve for "EFetch"; allowed values are "plus" or "minus"
"SeqStart"Noneinteger coordinate of the first sequence base to retrieve; options for "EFetch"
"SeqStop"Noneinteger coordinate of the last sequence base to retrieve; options for "EFetch"
"Complexity"Nonedata content to return, specified by integer; many sequence records are part of a larger data structure or "blob", and the complexity parameter determines how much of that blob to return; options for "EFetch"; allowed values include:
0 (entire blob), 1(bioseq), 2 (minimal bioseq-set), 3 (minimal nuc-prot), 4 (minimal pub-set)
"DatabaseFrom""pubmed"name of the database containing the input UIDs; options for "ELink"; if "Database" and "DatabaseFrom" are set to the same database value, then ELink will return computational neighbors within that database; for available computational neighbors see the full list of Entrez links
"CommandMode""neighbor"command mode specified which function "ELink" will perform; allowed values include: 
"neighbor": returns a set of UIDs in the "Database" linked to the input UIDs in the "DatabaseFrom" "neighbor_score": returns a set of UIDs in a database that are similar or related to the input UIDs, along with the the computed similarity scores "neighbor_history": functions similarly to neighbor, but stores the results to the NCBI History server and returns a Web environment string and query keys for each set of resulting UIDs "acheck": returns a list of all of the available links for the input UIDs "ncheck": checks for the existence of links between the input UIDs and other UIDs in the same database "lcheck": checks for the existence of external links for the set of input UIDs "llinks": checks for the existence of external links for the set of input UIDs, and returns URLs and provider attributes for all non-library providers "llinkslib": checks for the existence of external links for the set of input UIDs, and returns URLs and provider attributes for all providers "prlinks": returns the primary external link provider for each input UID
"LinkName"Nonename of the Entrez link used for "ELink" to retrieve; see the full list of Entrez links
"Term"Nonestring term used to limit the output set of "ELink" UIDs; the "Term" parameter only functions when "Database" and "DatabaseFrom" are set to the same database value
"Holding"Nonestring name of the external link provide for "ELink"; the "Holding" parameter only functions when "CommandMode" is set to "llinks" or "llinkslib"
APIKeyNonean API key provided by NCBI; by including an key, up to 10 requests per second are permitted by default
For the "ECitMatch" search, each input citation must be represented by a citation string in the following format, where the "your_key" value is an arbitrary label that may serve as a local identifier for the citation and it will be included in the output: "journal_title|year|volume|first_page|author_name|your_key|".

Examples

Basic Examples (6) 

Retrieve a list of all Entrez database names:

In[1]:=
ResourceFunction["NCBIEntrezData"]["AllDatabases"]
Out[1]=

Use EInfo and get information on assembly database:

In[2]:=
ResourceFunction["NCBIEntrezData"]["assembly", "EInfo"]
Out[2]=

Use ESearch and find PubMed articles on measles vaccination:

In[3]:=
ResourceFunction["NCBIEntrezData"]["measles vaccination", "ESearch"]
Out[3]=

Use ESummary and get summaries on selected proteins:

In[4]:=
ResourceFunction[
 "NCBIEntrezData"][{"15718680", "119703751"}, "ESummary", "Database" -> "protein"]
Out[4]=

Use EFetch and retrieve a nucleotide sequence in the BioSequence format:

In[5]:=
ResourceFunction[
 "NCBIEntrezData"]["5", "EFetch", {"Database" -> "nucleotide", "RetType" -> "BioSequence"}]
Out[5]=

Use ELink and retrieve the ID of a gene linked to a specified protein:

In[6]:=
ResourceFunction[
 "NCBIEntrezData"][15718680, "ELink", {"DatabaseFrom" -> "protein", "Database" -> "gene"}]
Out[6]=

Scope (3) 

Use EPost and upload a list of UIDs associated with articles on cancer studies:

In[7]:=
cancerpost = ResourceFunction[
  "NCBIEntrezData"][{"40048260", "40048237", "40048221", "40048212", "40048195", "40048190", "40048187", "40048163", "40048159", "40048143", "40048083", "40048052", "40048045", "40048042", "40048041", "40048036", "40048034", "40048031", "40048030", "40048028"}, "EPost"]
Out[7]=

Access the history server to call the previous search result and retrieve articles specifically related to lung cancer:

In[8]:=
ResourceFunction["NCBIEntrezData"]["lung cancer", "ESearch", "UseHistory" -> True, "QueryKey" -> cancerpost["QueryKey"], "WebEnv" -> cancerpost["WebEnv"]]
Out[8]=

Use ECitMatch and find PubMed article ID associated with the input citation string:

In[9]:=
pubmedid = ResourceFunction["NCBIEntrezData"][
  "science|1987|235|182|palmenberg ac|Art2|", "ECitMatch"]
Out[9]=

Get the abstract of the article:

In[10]:=
ResourceFunction["NCBIEntrezData"][
 pubmedid[[2]], "EFetch", {"RetType" -> "abstract"}]
Out[10]=

Use ESpell and find spelling suggestions for a term:

In[11]:=
ResourceFunction["NCBIEntrezData"]["athma", "ESpell"]
Out[11]=

Options (23) 

Database (1) 

Access the "gene" database and find genes associated with signal transduction:

In[12]:=
ResourceFunction["NCBIEntrezData"]["signal transduction", "ESearch", "Database" -> "gene"]
Out[12]=

RetMode (1) 

Retrieve related article information in the "XML" format:

In[13]:=
ResourceFunction["NCBIEntrezData"][139795111, "ELink", {"RetMode" -> "XML"}] // Short
Out[13]=

RetType (1) 

Retrieve a nucleotide sequence in FASTA format:

In[14]:=
ResourceFunction["NCBIEntrezData"][21614549, "EFetch", {"Database" -> "nucleotide", "RetType" -> "fasta"}] // Short
Out[14]=

UseHistory (1) 

Post the resulting UIDs onto the history server:

In[15]:=
ResourceFunction[
 "NCBIEntrezData"]["Parkinson's disease treatment", "ESearch", {"UseHistory" -> True}]
Out[15]=

WebEnv (1) 

Use the Web environment string returned from a previous "ESearch". NCBIEntrezData will append the results to the existing environment:

In[16]:=
ResourceFunction[
 "NCBIEntrezData"]["Alzheimer's disease treatment", "ESearch", {"UseHistory" -> True, "WebEnv" -> "MCID_67ca330f0575e56b50054e34"}]
Out[16]=

QueryKey (1) 

Use QueryKey returned from a previous "ESearch". NCBIEntrezData will find the intersection of the set specified by QueryKey and the set retrieved by the query:

In[17]:=
ResourceFunction[
 "NCBIEntrezData"]["genetic variation", "ESearch", {"UseHistory" -> True, "WebEnv" -> "MCID_67ca330f0575e56b50054e34", "QueryKey" -> 2}]
Out[17]=

RetStart (1) 

Retrieve relevant articles starting with the seventh item in the list:

In[18]:=
ResourceFunction[
 "NCBIEntrezData"]["PNAS", "ESearch", {"RetStart" -> 6}]
Out[18]=

RetMax (1) 

Retrieve total of 100 relevant articles:

In[19]:=
ResourceFunction[
 "NCBIEntrezData"]["vaccination", "ESearch", {"RetMax" -> 100}]
Out[19]=

Sort (1) 

Sort the results by JournalName:

In[20]:=
ResourceFunction[
 "NCBIEntrezData"]["vaccination", "ESearch", {"Sort" -> "JournalName"}]
Out[20]=

Field (1) 

Search relevant articles with the specified term in the title:

In[21]:=
ResourceFunction[
 "NCBIEntrezData"]["vaccination", "ESearch", {"Field" -> "Title"}]
Out[21]=

IDType (1) 

Return protein accession numbers associated with the term:

In[22]:=
ResourceFunction[
 "NCBIEntrezData"]["mitogen activated protein kinase", "ESearch", {"Database" -> "protein", "IDType" -> "acc"}]
Out[22]=

DateType (1) 

Return the relevant articles published in a specified date range:

In[23]:=
ResourceFunction[
 "NCBIEntrezData"]["signal transduction", "ESearch", {"MinDate" -> DateObject[{2025, 1, 1}, "Day"], "MaxDate" -> DateObject[{2025, 1, 31}, "Day"], "DateType" -> "pdat"}]
Out[23]=

RelDate (1) 

Return the relevant articles published in the last 30 days:

In[24]:=
ResourceFunction[
 "NCBIEntrezData"]["signal transduction", "ESearch", {"DateType" -> "pdat", "RelDate" -> 30}]
Out[24]=

MinDate/MaxDate (1) 

Return the relevant articles published between "MinDate" and "MaxDate":

In[25]:=
ResourceFunction[
 "NCBIEntrezData"]["signal transduction", "ESearch", {"DateType" -> "pdat", "MinDate" -> DateObject[{2024, 1}, "Month"], "MaxDate" -> DateObject[{2024, 12}, "Month"]}]
Out[25]=

Strand (1) 

Retrieve the minus strand of a nucleotide sequence in FASTA format:

In[26]:=
ResourceFunction["NCBIEntrezData"]["383209646", "EFetch", {"Database" -> "nucleotide", "RetType" -> "fasta", "Strand" -> "minus"}] // Short
Out[26]=

SeqStart/SeqStop (1) 

Retrieve the first 100 bases of a nucleotide sequence:

In[27]:=
ResourceFunction[
 "NCBIEntrezData"]["383209646", "EFetch", {"Database" -> "nucleotide",
   "RetType" -> "fasta", "SeqStart" -> 1, "SeqStop" -> 100}]
Out[27]=

Complexity (1) 

Retrieve the protein data with sequence information:

In[28]:=
ResourceFunction[
 "NCBIEntrezData"]["11154", "EFetch", {"Database" -> "protein", "Complexity" -> 1}]
Out[28]=

DatabaseFrom (1) 

Find proteins that links from a specified gene:

In[29]:=
ResourceFunction[
 "NCBIEntrezData"]["7157", "ELink", {"DatabaseFrom" -> "gene", "Database" -> "protein"}]
Out[29]=

CommandMode (1) 

Find related articles and retrieve computed similarity scores:

In[30]:=
ResourceFunction[
 "NCBIEntrezData"]["20210808", "ELink", {"CommandMode" -> "neighbor_score"}]
Out[30]=

LinkName (1) 

Find neighboring genes:

In[31]:=
ResourceFunction[
 "NCBIEntrezData"]["7158", "ELink", {"LinkName" -> "gene_gene_neighbors"}]
Out[31]=

Term (1) 

Find related articles that are supported by National Institutes of Health (NIH):

In[32]:=
ResourceFunction[
 "NCBIEntrezData"]["36766853", "ELink", {"Term" -> "NIH support"}]
Out[32]=

Holding (1) 

Find information from MedlinePlus for a specified article:

In[33]:=
ResourceFunction[
 "NCBIEntrezData"]["39112715", "ELink", {"CommandMode" -> "llinks", "Holding" -> "MEDPLUS"}]
Out[33]=

APIKey (1) 

By providing an API key provided by NCBI, up to 10 requests per second are achieved:

In[34]:=
ResourceFunction[
 "NCBIEntrezData"]["39112715", "ELink", {"APIKey" -> "examplekey"}]
Out[34]=

Properties and Relations (4) 

Find genes associated with homo sapiens pancreatic islet:

In[35]:=
pancreaticislet = ResourceFunction["NCBIEntrezData"]["Homo sapiens pancreatic islet", "ESearch", "Database" -> "gene", "RetMax" -> 100]
Out[35]=

Get summaries of genes:

In[36]:=
pancreaticisletGene = ResourceFunction["NCBIEntrezData"][Normal@pancreaticislet["IDList"], "ESummary", {"Database" -> "gene"}, "RetMax" -> 100]
Out[36]=

Find "SNP"s associated with those genes:

In[37]:=
pancreaticisletSNPs = ResourceFunction["NCBIEntrezData"][
  Normal[pancreaticisletGene[All, "uid"]], "ELink", {"DatabaseFrom" -> "gene", "Database" -> "snp"}]
Out[37]=

Use NCBIGenomicSNPData to get more information on a selected SNP:

In[38]:=
ResourceFunction["NCBIGenomicSNPData"][
 pancreaticisletSNPs[1, -1, -1, -1, 1]]
Out[38]=

Requirements

Wolfram Language 14.0 (January 2024) or above

Version History

  • 1.0.0 – 02 April 2025

Source Metadata

Related Resources

License Information