Function Repository Resource:

UCSCGenomeSequenceData

Source Notebook

Retrieve DNA sequences from the UCSC Genome Browser database

Contributed by: Keiko Hirayama

ResourceFunction["UCSCGenomeSequenceData"]["GenomeAssemblies"]

gives the dataset of available assembled genomes.

ResourceFunction["UCSCGenomeSequenceData"]["genome","chromosome"]

gives the DNA sequence from a specified chromosome in the genome.

ResourceFunction["UCSCGenomeSequenceData"]["genome","chromosome",{start,end}]

gives the DNA sequence from a specified chromosome with the part specified by the start and end positions in the genome.

ResourceFunction["UCSCGenomeSequenceData"]["species","chromosome"]

gives the DNA sequence from a specified chromosome in the latest available genome of a given species.

ResourceFunction["UCSCGenomeSequenceData"]["species","chromosome",{start,end}]

gives the DNA sequence from a specified chromosome with the part specified by the start and end positions in the latest available genome of a given species.

Details

The retrieved sequence is based on the assembled genomes in the UCSC Genome Browser database (Raney BJ, Barber GP, Benet-Pagès A, Casper J, Clawson H, Cline MS, Diekhans M, Fischer C, Navarro Gonzalez J, Hickey G et al., The UCSC Genome Browser database: 2024 update).
Selected "TaxonomicSpecies" entities can be used.
If the specified start position is larger than the end position, the DNA sequence of the complementary strand is returned.

Examples

Basic Examples (3) 

Retrieve available genome assembly information as a dataset:

In[1]:=
ResourceFunction["UCSCGenomeSequenceData"]["GenomeAssemblies"]
Out[1]=

Retrieve the DNA sequence from the chromosome in the specified human genome assembly:

In[2]:=
ResourceFunction[
 "UCSCGenomeSequenceData"]["hg38", "21", {31660001, 31660020}]
Out[2]=

Retrieve the DNA sequence from the chromosome in the latest available dog genome assembly:

In[3]:=
ResourceFunction["UCSCGenomeSequenceData"][
 Entity["TaxonomicSpecies", "CanisLupusFamiliaris::4t62p"], "X", {200001, 200100}]
Out[3]=

Scope (1) 

Repeated sequences are shown in lower-cased letters:

In[4]:=
ResourceFunction["UCSCGenomeSequenceData"]["dm6", "X", {15001, 15500}]
Out[4]=

Applications (3) 

Convert the retrieved DNA sequence to the BioSequence object:

In[5]:=
seq = ResourceFunction["UCSCGenomeSequenceData"]["hg38", "6", {33572552, 33572581}] // BioSequence
Out[5]=

Translate a DNA sequence into the corresponding peptide sequences:

In[6]:=
pep = BioSequenceTranslate[seq]
Out[6]=

Create a molecule and plot it:

In[7]:=
MoleculePlot[Molecule[pep]]
Out[7]=

Possible Issues (2) 

Retrieving the entire sequence for a chromosome may take a long time due to the size of data to download:

In[8]:=
ResourceFunction["UCSCGenomeSequenceData"]["hg38", "1"] // ByteCount // AbsoluteTiming
Out[8]=

If the requested end position extends past the end coordinate of the chromosome, the sequence limited to the size of the chromosome is returned:

In[9]:=
ResourceFunction[
 "UCSCGenomeSequenceData"]["hg19", "mt", {16000, 17000}]
Out[9]=

Requirements

Wolfram Language 13.0 (December 2021) or above

Version History

  • 1.0.0 – 10 April 2024

Source Metadata

Related Resources

License Information