Function Repository Resource:

EnsemblGenomeAssemblyConversion (1.0.0) current version: 1.0.1 »

Source Notebook

Convert chromosome coordinates on one genome assembly to another

Contributed by: Keiko Hirayama

ResourceFunction["EnsemblGenomeAssemblyConversion"][assembly1, assembly2,{chromosome,start,end}]

converts chromosome coordinates specified by the start and end positions on the genome assembly1 to assembly2.

Details and Options

EnsemblGenomeAssemblyConversion is based on Ensembl, which provides comparative genomics information.
Genome assembly is the complete DNA sequence of an organism reconstructed from the smaller segments of sequenced nucleotides.
EnsemblGenomeAssemblyConversion can be used to convert the genome positions from the archived assemblies to those of more recent assemblies, or vice versa.
Inputs to the EnsemblGenomeAssemblyConversion include the common names of the input and output genome assemblies, as well as the chromosome positions in the input assembly, specified by the chromosome name and integer-based start and end coordinates.
Chromosome names are given by integer numbers or short names such as "1", "X", "mt" for chromosome 1, X, mitochondrion, respectively.
The following option can be given:
"Species""human"species for which to query; following scientific names and common names of species are accepted: "Homo sapiens", "human", "Mus musculus", "mouse", "Danio rerio", "zebrafish", "Caenorhabditis elegans", "Saccharomyces cerevisiae"

Examples

Basic Examples (2) 

Convert the human chromosome coordinates on the previous genome assembly, GRCh37, to the latest assembly, GRCh38:

In[1]:=
ResourceFunction["EnsemblGenomeAssemblyConversion", ResourceVersion->"1.0.0"]["GRCh37", "GRCh38", {"X", 1000000,
   1000100}]
Out[1]=

Convert the zebrafish chromosome coordinates on the latest genome assembly, GRCz11, to the previous assembly, GRCz10:

In[2]:=
ResourceFunction["EnsemblGenomeAssemblyConversion", ResourceVersion->"1.0.0"]["GRCz11", "GRCz10", {10, 30000, 30100}, "Species" -> "Danio rerio"]
Out[2]=

Scope (5) 

Use the EnsemblPhenotype function to find the genomic variation location associated with dysosteosclerosis:

In[3]:=
genomefeature = ResourceFunction["EnsemblPhenotype"]["dysosteosclerosis", "GenomicFeatures"]
Out[3]=

Use the EnsemblGenomeAssemblyConversion to convert the variation associated chromosomal coordinates on the latest genome assembly to the previous one:

In[4]:=
precoord = ResourceFunction["EnsemblGenomeAssemblyConversion"]["GRCh38", "GRCh37", ToExpression@StringSplit[genomefeature[1, "Location"], ":" | "-"]]
Out[4]=

Retrieve the DNA sequence associated with the identified genomic variation. Use NCBIEntrezData function to get the nucleotide accession for human chromosome 10:

In[5]:=
chr10acc = ResourceFunction["NCBIEntrezData"]["human chromosome 10", "ESearch", "Database" -> "nucleotide", "IDType" -> "acc", "RetMax" -> 1]
Out[5]=

Use the ImportFASTA function to retrieve the DNA sequences from the latest as well as previous assemblies based on the converted coordinates:

In[6]:=
latestseq = StringTake[
  ResourceFunction["ImportFASTA"][chr10acc["IDList"][1], "BioSequence"],
   ToExpression@
   StringSplit[genomefeature[1, "Location"], ":" | "-"][[-2 ;;]]]
Out[6]=
In[7]:=
previousseq = StringTake[
  ResourceFunction["ImportFASTA"][
   StringSplit[chr10acc["IDList"][1], "."] /. {nc_String, ver_String} :> StringJoin[nc, ".", ToString[ToExpression[ver] - 1]], "BioSequence"], precoord]
Out[7]=

Find that sequences from both assemblies match:

In[8]:=
Diff[latestseq, previousseq]
Out[8]=

Requirements

Wolfram Language 14.0 (January 2024) or above

Version History

  • 1.0.1 – 21 April 2025
  • 1.0.0 – 17 April 2025

Source Metadata

Related Resources

License Information