Details and Options
ResourceFunction["PhylogeneticTreePlot"] uses an alignment-free method to compare pairs of sequences.
Input sequences should be strings comprised of the standard nucleotide letters {A,C,G,T}. Lowercase letters are also allowed. The character U is allowed and is replaced by T. All other characters, such as N, are removed.
ResourceFunction["PhylogeneticTreePlot"] uses
Dendrogram and accepts all options for that function.
Most default option settings agree with those of
Dendrogram.
ResourceFunction["PhylogeneticTreePlot"] creates the dendrogram from vectors that are derived via dimensional reductions of the input genetic sequences.
Each sequence is first converted to an image using the Frequency Chaos Game Representation (FCGR).
FCGR images are reduced in dimension using a Fourier Cosine Transform (FCT). A further dimensional reduction is done using the Singular Value Decomposition (SVD) on vectors comprised of the flattened FCT matrices.
An explanation may be found in the articles noted in the Related Links and Source Metadata sections.
The dimensional reduction described above has certain parameters that in principle might be changed. ResourceFunction["PhylogeneticTreePlot"] uses a fixed set of values for these that has been seen to perform fairly well in practice.
In order to produce a reasonable grouping, ResourceFunction["PhylogeneticTreePlot"] requires moderately long genetic sequences containing at least a few thousand nucleotides.
The SVD step of the dimension reduction, when applied to vectors that came from a set of very similar genome sequences, will tend to give a result where the reduced vectors are dominated by a large first value that is approximately equal across the set. This tends to distort the distances between genomes. The option
"IgnoreOutlier" (default:
Automatic) is provided to address this. The automatic behavior is to remove the first components whenever the largest singular value is at least ten times the size of the next largest.