# Wolfram Function Repository

Instant-use add-on functions for the Wolfram Language

Function Repository Resource:

Plot a dendrogram for a set of genome nucleotide sequences

Contributed by:
Daniel Lichtblau

ResourceFunction["PhylogeneticTreePlot"][ gives a dendrogram plot that groups the genetic sequences |

ResourceFunction["PhylogeneticTreePlot"] uses an alignment-free method to compare pairs of sequences.

Input sequences should be strings comprised of the standard nucleotide letters {A,C,G,T}. Lowercase letters are also allowed. The character U is allowed and is replaced by T. All other characters, such as N, are removed.

ResourceFunction["PhylogeneticTreePlot"] uses Dendrogram and accepts all options for that function.

Most default option settings agree with those of Dendrogram.

ResourceFunction["PhylogeneticTreePlot"] creates the dendrogram from vectors that are derived via dimensional reductions of the input genetic sequences.

By default, each sequence is first converted to an image using the Frequency Chaos Game Representation (FCGR).

FCGR images are reduced in dimension using a Fourier Cosine Transform (FCT). A further dimensional reduction is done using the Singular Value Decomposition (SVD) on vectors comprised of the flattened FCT matrices.

An explanation may be found in the articles noted in the Related Links and Source Metadata sections.

The dimensional reduction described above has certain parameters that in principle might be changed. ResourceFunction["PhylogeneticTreePlot"] uses a fixed set of values for these that has been seen to perform fairly well in practice.

In order to produce a reasonable grouping, ResourceFunction["PhylogeneticTreePlot"] requires moderately long genetic sequences containing at least a few thousand nucleotides.

The SVD step of the dimension reduction, when applied to vectors that came from a set of very similar genome sequences, will tend to give a result where the reduced vectors are dominated by a large first value that is approximately equal across the set. This tends to distort the distances between genomes. The option "IgnoreOutlier" (default: Automatic) is provided to address this. The automatic behavior is to remove the first components whenever the largest singular value is at least ten times the size of the next largest.

ResourceFunction["PhylogeneticTreePlot"] takes an option "VectorMethod". The Automatic default converts sequences to vectors using the methodology described above. A user-defined method can be specified with this option. One possibility is to use the resource function StringsToVectors.

Obtain genes for a type of breast cancer from the entity store:

In[1]:= |

Out[361]= |

In[362]:= |

Out[362]= |

Obtain the nucleotide sequences and species names:

In[363]:= |

Out[363]= |

Plot the phylogenetic tree derived from these genes:

In[364]:= |

Out[364]= |

One can force PhylogeneticTreePlot to ignore the largest singular value in the dimension-reduction:

In[365]:= |

Out[365]= |

This example was introduced in the referenced article "A novel fast vector method for genetic sequence comparison" by Li, He, He and Yau and also used in "Alignment-free genomic sequence comparison using FCGR and signal processing".

Show an example that has appeared in literature on alignment-free methods for clustering gene sequences:

In[366]:= |

Out[369]= |

Use the resource function ImportFASTA to download the genetic sequences from GenBank for the previously shown accession identifiers:

In[370]:= |

Out[371]= |

Plot the phylogenetic tree:

In[372]:= |

Out[372]= |

Use a non-default method for clustering:

In[373]:= |

Out[373]= |

Use the default n-gram based method of the resource function StringsToVectors to convert the sequences:

In[374]:= |

Out[374]= |

We now apply this to samples of the novel coronavirus SARS-CoV-2, which is the virus responsible for the COVID-19 pandemic. For purposes of getting better comparisons, we remove trailing A nucleotides (their presence or absence seems to be a vagary of the sequencing protocols). We also restrict to a tighter size range that still contains most of the fully sequenced samples. For purposes of visualization we will color the labels by location and date of sample collection.

Obtain data from the Wolfram Data Repository:

In[375]:= |

Remove trailing A nucleotides and retain only a narrow size range:

In[376]:= |

Create labels colored both by collection locations and dates:

In[377]:= |

Show the phylogenetic tree:

In[378]:= |

Out[378]= |

This example illustrates the importance of clipping off the first singular value in the dimension reduction, in cases where it is much larger than the next one. In effect one component will dominate the rest, in a way that will tend to make the distances less meaningful. We show this now by forcing the large value to be retained. The result is a tree that is unbalanced in terms of branch lengths:

In[379]:= |

Out[379]= |

The implementation was changed in April 2020 to use the System` context Dendrogram instead of HierarchicalClustering`DendrogramPlot. This necessitated small changes to the argument structure and also means there are some changes in the functional behavior. For example, Dendrogram did not at that time properly handle CosineDistance as a DistanceFunction setting. We give the prior implementation here as a convenience for those who might have preferred it:

In[380]:= |

Set up the mammal benchmark:

In[381]:= |

Out[381]= |

Now show the old method, using the cosine distance:

In[382]:= |

Out[382]= |

- 4.0.1 – 22 March 2024
- 4.0.0 – 09 April 2020
- 3.0.0 – 18 February 2020
- 2.0.0 – 31 December 2019
- 1.0.0 – 20 August 2019

This work is licensed under a Creative Commons Attribution 4.0 International License