Genomic comparisons (2)
Some viral genomes in the FASTA database:
A list of corresponding genome identifiers in FASTA format:
Import these using the resource function ImportFASTA; this takes a bit of time:
Get the genomic sequence strings:
Create vectors for these sequences:
Show a dendrogram using different colors for the distinct viral types:
We get a similar dendrogram from the "GenomeFCGR" method:
We use a standard test set of mammal species:
Import the sequences from the FASTA site:
Form vectors from these genome strings:
Set up data for a phylogenetic tree:
Show the dendrogram obtained from this encoding of the genomes into vectors:
The genome FCGR encoding gives a similar grouping:
Find five nearest neighbors for each species:
Use the resource function MultidimensionalScaling to reduce to three dimensions for visualization:
Authorship identification (3)
We can use string vectors for determining when substrings might have a similar source.
Obtain several English texts from ExampleData:
Split into sets of substrings of length 2000:
Create vectors of length 80 for these strings:
Split into odd and even numbered substrings for purposes of training and testing a classifier:
Train a neural net to classify the different texts:
Check that 99% of the test strings are classified correctly:
Authorship of nineteenth century French novels. Data is from the reference Understanding and explaining Delta measures for authorship attribution.The importing step can take as much as a few minutes.
Import 75 French novels with three by each of 25 authors, clipping the size of the larger ones in order to work with strings of comparable lengths:
Partition each into substrings of 10000 characters and split into training (two novels per author) and test (one per author) sets:
Create vectors of length 200:
A simple classifier associates 82% of the substrings in the test set with the correct author:
A simple neural net associates 89% with the actual author:
We apply string vector similarity to the task of authorship attribution. We use as an example the Federalist Papers, where authorship of several was disputed for around 175 years. The individual authors were Alexander Hamilton, James Madison and John Jay. Due to happenstance of history (Hamilton may have recorded his claims of authorship in haste, with his duel with Aaron Burr encroaching), twelve were claimed by both Hamilton and Madison. The question was largely settled in the 1960's (see Author Notes). Nevertheless this is generally acknowledged as a difficult test for authorship attribution methods. We use some analysis of n-gram strings to show one method of analysis.
Import data, split into individual essays and remove author names and boilerplate common to most or all articles:
Separate out the three essays jointly attributed to Hamilton and Madison, as well as the five written by John Jay, and show the numbering for the disputed set:
We split each essay into substrings of length 2000 and create vectors of length 80, using a larger-than-default setting for the number of n-grams to consider:
Remove from consideration several essays of known authorship, as well as the last chunk from each remaining essay of known authorship; these provide two validation sets for which we will assess the quality of the classifier:
Since Hamilton wrote far more essays than Madison, we removed far more of his from the training set and now check that the contributions of the two authors to the training set are not terribly different in size (that is, within a factor of 1.5):
The two sets reduced to three dimensions, with red dots for Hamilton's strings and blue dots for Madison's, do not appear to separate nicely:
A different method gives a result that looks better but still does not show an obvious linear space separation:
Using nearest neighbors (as done for the genome set examples) of these vectors to classify authorship is prone to failure, so instead we train a neural net, with numeric outcomes of 1 for Hamilton authorship and 2 for Madison:
Now assess correctness percentage on the first validation set (withheld essays), as well as possible bias in incorrectness:
Do likewise for the second validation set (chunks withheld from the training group essays):
Both exceed 90% correct, and both show around a 10% inclination in each incorrect direction (that is, Hamilton authorship assessed as Madison or vice versa).
Assess authorship on the substrings in the test set (from the twelve disputed essays):
Tally these:
We redo this experiment, this time using the "TextFGCR" method for conversion to numeric vectors:
Repeat processing through creating a neural net classifier:
Repeat the first validation check:
Repeat the second validation check:
Both are just under 90%, and we observe this time there is a one-in-three (5/15) tendency in the first set to mistake actual Madison essays (where the first value is 2) as Hamilton's.
Again assess authorship on the substrings in the test set (from the twelve disputed essays):
And tally these:
These results are consistent with the consensus opinion supporting Madison authorship, or at least primary authorship, on the twelve disputed essays.