Function Repository Resource:

BioMoleculeIDRs

Source Notebook

Compute the intrinsically disordered regions (IDRs) in a protein

Contributed by: Soutick Saha

ResourceFunction["BioMoleculeIDRs"][protein]

computes the intrinsically disordered regions in a protein.

ResourceFunction["BioMoleculeIDRs"][protein,"AveragePredictionConfidence"]

computes the averaged confidence of prediction in a protein based on which disorder is computed.

ResourceFunction["BioMoleculeIDRs"][protein,"BioSequences"]

returns the sequences in a protein for the intrinsically disordered regions.

Details and Options

Intrinsically Disordered Regions (IDRs) of proteins are the regions that lack a fixed or ordered three-dimensional structure.

The function works on computationally predicted protein structures and computes the IDRs based on the averaged confidence of prediction.

The confidence values range from 0 to 100 with higher values indicating better prediction accuracy.

Whether a given residue (amino acid for proteins) belongs to a disordered region is determined by the Mean of the prediction confidence for 20 residues to the left and right of the residue, along with the residue itself. This averaging length and function can be changed using the "AveragingLength" and "AveragingFunction" Options respectively.

Residues with average confidence of prediction >80, < 70 and between 70 and 80 are initially labelled as folded, disordered and gap regions, respectively. Next, folded and disordered regions shorter than 10 residues are reclassified as gaps.

The initial confidence cutoff for determining the disordered or folded regions can be changed using the "ConfidenceCutoff" Option.

Finally, gap regions are redefined as disordered if (i) they have disordered regions on both sides or (ii) are terminal and preceded or followed by a disordered region.

The default minimum length of a disordered or folded region is 10. This can be changed using the "MinIDRLength" option.

protein can be an ExternalIdentifier, BioSequence, BioMolecule or String.

BioSequence must be a continuous peptide sequence of 400 amino acids or fewer.

IDRs of a BioSequence corresponds to the disorder in the predicted protein structure with the same sequence.

When protein is an ExternalIdentifier then its "type" is "UniProtKBAccessionNumber" and the id is the UniProt ID of predicted structures in the AlphaFold Protein Structure Database.

When protein is a String, then it can be either a MGnify Protein ID, corresponding to structures in the ESM Metagenomic Atlas, or a filename.

Proteins should be predicted structures when using filename or BioMolecule.

The output is an Association containing the disordered regions of a biomolecule with Keys as the chain labels.

Structures from the ESM Metagenomic Atlas or AlphaFold Protein Structure Database are monomeric and have only one chain with chain label "A". So currently, in most of the outputs, the only key present is "A".

BioMoleculeIDRs has the following options:

"ConfidenceCutoff"

<|"Disordered"→70.,"Folded"→80.|>

confidence cutoff below(above) which regions are initially marked as disordered(folded)

"MinIDRLength"

minimum length of a disordered region

"AveragingLength"

number of residues before and after the target residue for computing the average confidence of prediction

"AveragingFunction"

Mean

averaging function for confidence scores

"FoldedRegions"

False

when True data for the folded regions are included

BioMoleculeIDRs is based on Tesei, G., Trolle, A.I., Jonsson, N. et al. Conformational ensembles of the human intrinsically disordered proteome. Nature 626, 897–904 (2024). https://doi.org/10.1038/s41586-023-07004-5

Examples

Basic Examples (4)

Compute the disordered regions of of a protein where the Keys are the chain labels:

In[1]:=

Out[1]=

We can also compute the average confidence of prediction for a structure:

In[2]:=

Out[2]=

And visualize the confidence of prediction:

In[3]:=

Out[3]=

We can also obtain the amino acid sequences for the disordered regions in the protein:

In[4]:=

Out[4]=

Scope (3)

Compute the disordered regions of a protein structure predicted from a peptide BioSequence:

In[5]:=

Out[5]=

Compute the disordered regions of a protein structure from the ESM Metagenomic Atlas:

In[6]:=

Out[6]=

Compute the disordered regions of a BioMolecule:

In[7]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/1f3846f7-651b-4d63-9751-641196fb1690"]

Out[7]=

Options (11)

ConfidenceCutoff (2)

Set different confidence cutoffs to compute disordered regions:

In[8]:=

Out[8]=

With more restricted cutoffs the regions are shorter:

In[9]:=

Out[9]=

MinIDRLength (2)

Set two different minimum lengths for disordered regions and see the difference in the output:

In[10]:=

Out[10]=

With a larger value for minimum IDR length, one of the smaller disordered regions is no longer present:

In[11]:=

Out[11]=

AveragingLength (1)

Set the minimum length for averaging the confidence of prediction:

In[12]:=

Out[12]=

AveragingFunction (4)

Set functions like Median, Min, Max to obtain a different confidence measure for the residue and thereby slightly different disordered regions. Use Mean as the averaging function:

In[13]:=

Out[13]=

Use Median as the averaging function for confidence:

In[14]:=

Out[14]=

The IDRs are longer when Min is used as the averaging function since this makes the averaged confidence for every residue take the minimum value when averaging:

In[15]:=

Out[15]=

For the exact opposite reason IDRs are shorter when Max is used as the averaging function:

In[16]:=

Out[16]=

FoldedRegions (2)

Include data for folded regions:

In[17]:=

Out[17]=

This is also applicable when we want to obtain "BioSequences":

In[18]:=

Out[18]=

Applications (2)

Here are some common UniProt IDs of human proteins that have a large amount of disorder:

In[19]:=

Compute the disordered regions for these proteins:

In[20]:=

Out[20]=

Neat Examples (1)

Visualize the disordered regions and average confidence of prediction of a protein from the ESM Metagenomic Atlas:

In[21]:=

$id = "MGYP002143454457"; cf = "Rainbow"; conf = ResourceFunction["BioMoleculeIDRs"][id, "AveragePredictionConfidence"]; bm = ServiceExecute["ESMAtlas", "PredictedStructure", {"MGnifyID" -> id}]; cols = Map[ColorData[cf], (0.01*conf), {2}]; keys = Flatten[ Table[{(Keys@cols)[[i]], #} & /@ Range@Length@(Values@cols)[[i]], {i, Length@cols}], 1]; colRules = Thread[keys -> Flatten@Values@cols]; idr = ResourceFunction["BioMoleculeIDRs"][id]; validIDRs = DeleteCases[idr, _ {}]; pos = If[SameQ[validIDRs, <||>], {}, MapApply[Range, idr]]; disorderColor = Lighter@Red; foldedColor = Lighter@Gray; colorRules = If[SameQ[pos, {}], {_ -> foldedColor}, Join[Map[(# -> disorderColor) &, Flatten[KeyValueMap[Thread[{#1, Flatten@#2}] &, pos], 1]], {_ -> foldedColor}]]; {BioMoleculePlot3D[bm, ColorRules -> colorRules, PlotLegends -> SwatchLegend[{disorderColor, foldedColor}, {"Disordered", "Folded"}, LegendMarkers -> "Bubble"]], BioMoleculePlot3D[bm, ColorRules -> colRules, PlotLegends -> BarLegend[{cf, {0, 100}}, LegendLabel -> Placed["Avg. prediction confidence", Left, Rotate[#, 90 Degree] &]]]}$

Out[34]=

Publisher

WolframChemistry

Version History

1.0.0 – 11 June 2025

Source Metadata

Citation:
- Tesei, G., Trolle, A.I., Jonsson, N. et al. Conformational ensembles of the human intrinsically disordered proteome. Nature 626, 897–904 (2024). https://doi.org/10.1038/s41586-023-07004-5

Related Resources

License Information

This work is licensed under a Creative Commons Attribution 4.0 International License

Wolfram Function Repository

BioMoleculeIDRs

Details and Options