Function Repository Resource:

EnsemblGenomeRegion

Source Notebook

Retrieve features that overlap a given genomic region

Contributed by: Keiko Hirayama

ResourceFunction["EnsemblGenomeRegion"][chromosome,{start,end}]

retrieve genomic features that overlap a given chromosome region specified by the start and end positions and return information in a Dataset format.

ResourceFunction["EnsemblGenomeRegion"][chromosome,{start,end},format]

retrieve genomic sequence that overlap a given chromosome region specified by the start and end positions in a specified format.

Details and Options

EnsemblGenomeRegion is based on Ensembl, which provides genomics information for various organisms.
Query regions are specified by the chromosome name and integer coordinates.
Maximum allowed length of the region to be requested is 5000000.
The human nucleotide sequences are returned by default.
EnsemblGenomeRegion[chromosome, {start, end}] is equivalent to EnsemblGenomeRegion[chromosome, {start, end}, "Dataset", "Feature"->"sequence"]
The value for format can be any of the following:
"Dataset"genomic features returned as a Dataset (default)
"BioSequence"genomic sequence returned as a BioSequence object; applicable to the "sequence" Feature
"FASTA"genomic sequence returned in FASTA format; applicable to the "sequence" Feature
The following option can be given:
"Species""Homo sapiens "species for which to query; genome assemblies of following species are supported: Homo sapiens (human), Mus musculus (mouse), Danio rerio (zebrafish), Caenorhabditis elegans (nematode), Saccharomyces cerevisiae (yeast)
"Assembly"Noneassembly version for which to query sequence; if not specified the latest available version is used
"Mask"Noneoption to query the sequence masked for repeat sequences; "Hard" will mask all repeats as N's and "Soft" will mask repeats as lower cased characters
"Strand"1strand of the nucleotide sequence to retrieve; allowed values are 1 or -1
"Feature"Nonetype of genomic feature to retrieve; list of multiple values are also accepted; allowed values include: "sequence", "band","gene","transcript","cds","exon","repeat","simple","misc","variation","somatic_variation", "structural_variation","somatic_structural_variation","constrained","regulatory","motif","mane"
"BioType"Nonefunctional classification of "gene" or "transcript" features to fetch; allowed value includes "protein_coding"
"DBType""core"database type to retrieve features from; allowed values include: "core", "otherfeatures"
"SOTerm"NoneSequence Ontology term to restrict the variants found
"TrimDownstream"Falsewhether to return features which overlap the downstream end of the region
"TrimUpstream"Falsewhether to return features which overlap the upstream end of the region
"VariantSet"Noneshort name of a set to restrict the variants found such as "ClinVar" and "ph_uniprot"; list of short names are found here

Examples

Basic Examples (2) 

Retrieve a nucleotide sequence for a specified region of human chromosome 13:

In[1]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["13", {32315077, 32315080}]
Out[1]=

Get a result as the BioSequence object:

In[2]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["13", {32315077, 32315080}, "BioSequence"]
Out[2]=

Find genes that overlap a specified region of human chromosome 17:

In[3]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["17", {7000000, 7001000}, "Feature" -> "gene"]
Out[3]=

Scope (3) 

Find genomic variations that overlap a specified region of human chromosome 6:

In[4]:=
variations = ResourceFunction[
  "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["6", {26090950, 26090960}, "Feature" -> {"variation"}]
Out[4]=

Use the NCBIGenomicSNPData resource function to retrieve more information on a selected variation:

In[5]:=
ResourceFunction[
ResourceObject[<|"Name" -> "NCBIGenomicSNPData", "ShortName" -> "NCBIGenomicSNPData", "UUID" -> "6f2d5756-cc3b-42f3-932f-bc6d163c3291", "ResourceType" -> "Function", "Version" -> "1.1.0", "Description" -> "Retrieve information on reference SNPs from the NCBI database", "RepositoryLocation" -> URL[
     "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"], "SymbolName" -> "FunctionRepository`$e00201f1a4fc41339cdf633e9bc60ebe`NCBIGenomicSNPData", "FunctionLocation" -> CloudObject[
     "https://www.wolframcloud.com/obj/e9678ccc-563b-485b-aa0d-3c4f4cd9de18"]|>, ResourceSystemBase -> Automatic]][variations[1, 1]]
Out[5]=

Find its clinical significance:

In[6]:=
ResourceFunction["NCBIGenomicSNPData"][
 variations[1, 1], "ClinicalSignificance"]
Out[6]=

Options (11) 

Species (1) 

Use the Species option to specify the organism of the genomics feature:

In[7]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["15", {20000001, 20000100}, "Species" -> "Bos taurus"]
Out[7]=

Assembly (1) 

Use the Assembly option to specify the version of the genomics assembly:

In[8]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["1", {150001, 150100}, "Assembly" -> "NCBI36"]
Out[8]=

Mask (1) 

Use the Mask option to retrieve the masked genome sequence where repeats are shown as lower cased characters:

In[9]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["1", {150001, 150100}, "Mask" -> "Soft"]
Out[9]=

Strand (1) 

Use the Strand option to retrieve the complementary DNA sequence:

In[10]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["1", {150001, 150100}, "Strand" -> -1]
Out[10]=

Feature (1) 

Use the Feature option to selectively retrieve bands and transcripts associated with the given genomic region:

In[11]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["7", {140400001, 140500000}, "Feature" -> {"band", "transcript"}]
Out[11]=

BioType (1) 

Use the BioType option to retrieve protein coding genes associated with the given genomic region:

In[12]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["17", {7000001, 7100000}, "Feature" -> "gene",
  "BioType" -> "protein_coding"]
Out[12]=

DBType (1) 

Use the DBType option to retrieve additional gene features associated with the given genomic region:

In[13]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["17", {7000101, 7000200}, "Feature" -> "gene",
  "DBType" -> "otherfeatures"]
Out[13]=

SOTerm (1) 

Use the SOTerm option to retrieve missense variants (SO:0001583) associated with the given genomic region:

In[14]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["17", {7000001, 7001000}, "Feature" -> "variation", "SOTerm" -> "SO:0001583"]
Out[14]=

TrimDownstream (1) 

Use the TrimUpstream option to retrieve genes that overlap with the given genomic region, but not with the downstream region:

In[15]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["17", {7000001, 7020000}, "Feature" -> "gene",
  "TrimDownstream" -> True]
Out[15]=

TrimUpstream (1) 

Use the TrimUpstream option to retrieve genes that overlap with the given genomic region, but not with the upstream region:

In[16]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["17", {7000001, 7020000}, "Feature" -> "gene",
  "TrimUpstream" -> True]
Out[16]=

VariantSet (1) 

Use the VariantSet option to retrieve variants with ClinVar annotation associated with the given genomic region:

In[17]:=
ResourceFunction[
 "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["17", {7700001, 7701000}, "Feature" -> "variation", "VariantSet" -> "ClinVar"]
Out[17]=

Applications (3) 

Find regulatory regions of the human chromosome 1:

In[18]:=
regulatory = ResourceFunction[
  "EnsemblGenomeRegion", ResourceSystemBase -> "https://www.wolframcloud.com/obj/resourcesystem/api/1.0"]["1", {1, 1000000}, "Feature" -> "regulatory"]
Out[18]=

Group regions by the type of regulatory features:

In[19]:=
regregbytype = regulatory[GroupBy["Description"], All, {#"Start", #"End"} &]
Out[19]=

Visualize regulatory regions using the circular diagram illustrating their chromosome positions:

In[20]:=
regulatoryregionbytypecol[col_, regdat_] := regdat[[1]] -> Riffle[Style[{Subtract[#[[2]], #[[1]]], 1}, GrayLevel[.2]] & /@ Transpose[{Prepend[regdat[[2]][[All, 2]], 0], Append[regdat[[2]][[All, 1]], 1000000]}], Style[{Subtract @@ Reverse[#], 1}, col] & /@ regdat[[2]]]
In[21]:=
regulatoryregioncolored = MapThread[
   regulatoryregionbytypecol[#1, #2] &, {{RGBColor[0.6, 0.2, 0.6], RGBColor[1, 0, 0], RGBColor[1, 0.5, 0], RGBColor[1, 1, 0]}, Normal@Normal@regregbytype}];
In[22]:=
Show[SectorChart[regulatoryregioncolored[[1, 2]], SectorSpacing -> .1,
   ChartBaseStyle -> EdgeForm[None], Epilog -> {Text[
     Style[regulatoryregioncolored[[1, 1]], GrayLevel[.9], 10], {0, 4 - .2}], MapIndexed[
     Text[Style[#1, GrayLevel[.9], 10], {0, #2[[1]] - .8/#2[[1]]}] &, regulatoryregioncolored[[2 ;; 4, 1]]]}, PolarAxes -> {True, False}, SectorOrigin -> {{Pi/2, Automatic}, 3.4},
   PlotRange -> Full], SectorChart[regulatoryregioncolored[[2 ;; 4, 2]], SectorSpacing -> .1, ChartBaseStyle -> EdgeForm[None], PlotRange -> Full, SectorOrigin -> {{Pi/2, Automatic}, 0.1}]]
Out[22]=

Requirements

Wolfram Language 13.0 (December 2021) or above

Version History

  • 1.0.1 – 21 April 2025
  • 1.0.0 – 17 April 2025

Source Metadata

Related Resources

License Information