Wolfram Function Repository
Instant-use add-on functions for the Wolfram Language
Function Repository Resource:
Convert a set of text strings to numeric vectors
ResourceFunction["StringsToVectors"][strings] converts strings to a list of numeric vectors. | |
ResourceFunction["StringsToVectors"][strings,m] converts strings to vectors of length m. |
Create 10 strings of text:
In[1]:= | ![]() |
Out[1]= | ![]() |
Convert them to vectors:
In[2]:= | ![]() |
Out[2]= | ![]() |
Start with four groups of five similar strings, shuffled into a randomized order:
In[3]:= | ![]() |
Create vectors for these 20 strings:
In[4]:= | ![]() |
Out[4]= | ![]() |
Further reduce the dimension so the four distinct groupings can be clustered and visualized in three dimensions:
In[5]:= | ![]() |
Out[5]= | ![]() |
Work with a set of three groups of four similar strings each, where not all strings have the same length:
In[6]:= | ![]() |
Out[7]= | ![]() |
Create a set of vectors each with 12 elements:
In[8]:= | ![]() |
Out[8]= | ![]() |
Cluster into three groups:
In[9]:= | ![]() |
Out[9]= | ![]() |
To see the clustering makes sense, we use Nearest to show that each has three fairly close neighbors, with the fourth closest being notably further than the first three:
In[10]:= | ![]() |
Out[10]= | ![]() |
Now extract the set of three closest neighbors for each string according to this vector encoding:
In[11]:= | ![]() |
Out[11]= | ![]() |
Instead use the "TextFCGR" method on this set of strings:
In[12]:= | ![]() |
Out[12]= | ![]() |
Again find neighbor sets using Nearest:
In[13]:= | ![]() |
Out[13]= | ![]() |
Again extract the set of three closest neighbors using this vector encoding:
In[14]:= | ![]() |
Out[14]= | ![]() |
Verify that these give the same sets of closest neighbors:
In[15]:= | ![]() |
Out[15]= | ![]() |
A utility function that creates similar random strings each time it is invoked:
In[16]:= | ![]() |
Create eight sets of strings of various lengths, in randomized order:
In[17]:= | ![]() |
Create vectors of dimension 6:
In[18]:= | ![]() |
Show a dendrogram produced by these vectors:
In[19]:= | ![]() |
Out[19]= | ![]() |
We obtain a very similar dendrogram, with the same low-level groupings, from the "TextFCGR" method:
In[20]:= | ![]() |
Out[6]= | ![]() |
We also obtain a dendrogram,with the same low-level groupings from the strings themselves, albeit more slowly since the distances take more time to compute:
In[21]:= | ![]() |
Out[21]= | ![]() |
Create three sets of random strings:
In[22]:= | ![]() |
Out[23]= | ![]() |
Reduce dimension to 3 using the default and also normalizing the result of DimensionReduce with the setting Method→"LatentSemanticAnalysis":
In[24]:= | ![]() |
Out[25]= | ![]() |
These agree to close approximation, up to signs:
In[26]:= | ![]() |
Out[26]= | ![]() |
Get genomic data for the BRCA2 gene in different species:
In[27]:= | ![]() |
Out[3]= | ![]() |
A dendrogram from the default conversion to vectors places Gallus gallus (jungle fowl) and Canis lupus familiaris (dogs) closer than Pan troglodytes (chimpanzees) to Homo sapiens (humans), with the chimpanzees not far from Danio rerio (zebrafish):
In[28]:= | ![]() |
Out[20]= | ![]() |
These strings contain many "N" characters, and results will improve if we preprocess to remove them:
In[29]:= | ![]() |
Recompute and show a phylogenetic tree with a more plausible placement:
In[30]:= | ![]() |
Out[8]= | ![]() |
Note that the "GenomeFCGR" method does not suffer unduly from presence of "N" characters since it removes them:
In[31]:= | ![]() |
Out[10]= | ![]() |
Create a set of 30 strings, comprised of three subsets of 10 similar strings:
In[32]:= | ![]() |
Out[2]= | ![]() |
Convert the strings to vectors:
In[33]:= | ![]() |
Out[33]= | ![]() |
Quite clearly the vectors obtained by default split into the three expected classes:
In[34]:= | ![]() |
Out[34]= | ![]() |
Using Method→"GenomeFCGR" on arbitrary (that is, non-genome) string sets can give quite bad results:
In[35]:= | ![]() |
Out[35]= | ![]() |
Some viral genomes in the FASTA database:
In[36]:= | ![]() |
A list of corresponding genome identifiers in FASTA format:
In[37]:= | ![]() |
Import these using the resource function ImportFASTA; this takes a bit of time:
In[38]:= | ![]() |
Out[38]= | ![]() |
Get the genomic sequence strings:
In[39]:= | ![]() |
Create vectors for these sequences:
In[40]:= | ![]() |
Show a dendrogram using different colors for the distinct viral types:
In[41]:= | ![]() |
Out[42]= | ![]() |
We get a similar dendrogram from the "GenomeFCGR" method:
In[43]:= | ![]() |
Out[44]= | ![]() |
We use a standard test set of mammal species:
In[45]:= | ![]() |
Import the sequences from the FASTA site:
In[46]:= | ![]() |
Out[46]= | ![]() |
Form vectors from these genome strings:
In[47]:= | ![]() |
Set up data for a phylogenetic tree:
In[48]:= | ![]() |
Out[31]= | ![]() |
Show the dendrogram obtained from this encoding of the genomes into vectors:
In[49]:= | ![]() |
Out[49]= | ![]() |
The genome FCGR encoding gives a similar grouping:
In[50]:= | ![]() |
Out[14]= | ![]() |
Compare to the resource function PhylogeneticTreePlot:
In[51]:= | ![]() |
Out[51]= | ![]() |
Find five nearest neighbors for each species:
In[52]:= | ![]() |
Out[53]= | ![]() |
Use the resource function MultidimensionalScaling to reduce to three dimensions for visualization:
In[54]:= | ![]() |
Out[54]= | ![]() |
We can use string vectors for determining when substrings might have a similar source.
Obtain several English texts from ExampleData:
In[55]:= | ![]() |
Split into sets of substrings of length 2000:
In[56]:= | ![]() |
Create vectors of length 80 for these strings:
In[57]:= | ![]() |
Out[58]= | ![]() |
Split into odd and even numbered substrings for purposes of training and testing a classifier:
In[59]:= | ![]() |
Train a neural net to classify the different texts:
In[60]:= | ![]() |
Check that 99% of the test strings are classified correctly:
In[61]:= | ![]() |
Out[63]= | ![]() |
Authorship of nineteenth century French novels. Data is from the reference Understanding and explaining Delta measures for authorship attribution.The importing step can take as much as a few minutes.
Import 75 French novels with three by each of 25 authors, clipping the size of the larger ones in order to work with strings of comparable lengths:
In[64]:= | ![]() |
Out[11]= | ![]() |
Partition each into substrings of 10000 characters and split into training (two novels per author) and test (one per author) sets:
In[65]:= | ![]() |
Create vectors of length 200:
In[66]:= | ![]() |
Out[67]= | ![]() |
A simple classifier associates around 80% of the substrings in the test set with the correct author:
In[68]:= | ![]() |
Out[37]= | ![]() |
A simple neural net associates nearly 90% with the actual author:
In[69]:= | ![]() |
Out[41]= | ![]() |
We apply string vector similarity to the task of authorship attribution. We use as an example the Federalist Papers, where authorship of several was disputed for around 175 years. The individual authors were Alexander Hamilton, James Madison and John Jay. Due to happenstance of history (Hamilton may have recorded his claims of authorship in haste, with his duel with Aaron Burr encroaching), twelve were claimed by both Hamilton and Madison. The question was largely settled in the 1960's (see Author Notes). Nevertheless this is generally acknowledged as a difficult test for authorship attribution methods. We use some analysis of n-gram strings to show one method of analysis.
Import data, split into individual essays and remove author names and boilerplate common to most or all articles:
In[70]:= | ![]() |
Separate out the three essays jointly attributed to Hamilton and Madison, as well as the five written by John Jay, and show the numbering for the disputed set:
In[71]:= | ![]() |
Out[75]= | ![]() |
We split each essay into substrings of length 2000 and create vectors of length 80, using a larger-than-default setting for the number of n-grams to consider:
In[76]:= | ![]() |
Out[80]= | ![]() |
Remove from consideration several essays of known authorship, as well as the last chunk from each remaining essay of known authorship; these provide two validation sets for which we will assess the quality of the classifier:
In[81]:= | ![]() |
Since Hamilton wrote far more essays than Madison, we removed far more of his from the training set and now check that the contributions of the two authors to the training set are not terribly different in size (that is, within a factor of 1.5):
In[82]:= | ![]() |
Out[82]= | ![]() |
The two sets reduced to three dimensions, with red dots for Hamilton's strings and blue dots for Madison's, do not appear to separate nicely:
In[83]:= | ![]() |
Out[84]= | ![]() |
A different method gives a result that looks better but still does not show an obvious linear space separation:
In[85]:= | ![]() |
Out[87]= | ![]() |
Using nearest neighbors (as done for the genome set examples) of these vectors to classify authorship is prone to failure, so instead we train a neural net, with numeric outcomes of 1 for Hamilton authorship and 2 for Madison:
In[88]:= | ![]() |
Now assess correctness percentage on the first validation set (withheld essays), as well as possible bias in incorrectness:
In[89]:= | ![]() |
Out[90]= | ![]() |
Do likewise for the second validation set (chunks withheld from the training group essays):
In[91]:= | ![]() |
Out[92]= | ![]() |
Both exceed 90% correct, and both show around a 10% inclination in each incorrect direction (that is, Hamilton authorship assessed as Madison or vice versa).
Assess authorship on the substrings in the test set (from the twelve disputed essays):
In[93]:= | ![]() |
Out[93]= | ![]() |
Tally these:
In[94]:= | ![]() |
Out[94]= | ![]() |
We redo this experiment, this time using the "TextFGCR" method for conversion to numeric vectors:
In[95]:= | ![]() |
Out[95]= | ![]() |
Repeat processing through creating a neural net classifier:
In[96]:= | ![]() |
Repeat the first validation check:
In[97]:= | ![]() |
Out[98]= | ![]() |
Repeat the second validation check:
In[99]:= | ![]() |
Out[100]= | ![]() |
Both are just under 90%, and we observe this time there is a one-in-three (5/15) tendency in the first set to mistake actual Madison essays (where the first value is 2) as Hamilton's.
Again assess authorship on the substrings in the test set (from the twelve disputed essays):
In[101]:= | ![]() |
Out[101]= | ![]() |
And tally these:
In[102]:= | ![]() |
Out[102]= | ![]() |
These results are consistent with the consensus opinion supporting Madison authorship, or at least primary authorship, on the twelve disputed essays.
Wolfram Language 13.0 (December 2021) or above
This work is licensed under a Creative Commons Attribution 4.0 International License