# Wolfram Function Repository

Instant-use add-on functions for the Wolfram Language

Function Repository Resource:

Convert a set of text strings to numeric vectors

Contributed by:
Daniel Lichtblau

ResourceFunction["StringsToVectors"][ converts | |

ResourceFunction["StringsToVectors"][ converts |

ResourceFunction["StringsToVectors"] is intended for comparisons and other manipulations of sets of strings. It supports what are called alignment-free methods for string comparison, clustering and similar analyses on sets of strings.

Numeric vectors produced by ResourceFunction["StringsToVectors"] will all have the same dimension and nonzero vectors will be normalized to length 1. Typical usage of ResourceFunction["StringsToVectors"] will produce vectors of modest length, in the range tens-to-hundreds range.

Input strings need not have the same lengths.

The output is suitable for creating a NearestFunction or Dendrogram. Also it can be fed to FindClusters. The output can also be used to create a classifier function.

ResourceFunction["StringsToVectors"][*strings*,Automatic] is equivalent to ResourceFunction["StringsToVectors"][*strings*].

A given string will typically have a different vector representation when it appears in different string sets, even if all settings to ResourceFunction["StringsToVectors"] are the same. In other words, results are dependent on the full set of input strings. A consequence is that results from different calls to ResourceFunction["StringsToVectors"] cannot be compared to one another.

Strings with very different lengths can have similar vectors when using the frequency-based methods supported by ResourceFunction["StringsToVectors"]. If this is not desired, one can post-process the results, for example multiplying each vector by the length of the corresponding string.

ResourceFunction["StringsToVectors"] takes a Method option. The default, "NGrams", creates vectors based on frequencies of common *n*-grams (where *n* defaults to 4). Alternatively one can use the "TextFCGR" or "GenomeFCGR" methods.

"GenomeFCGR" assumes that the characters are the ones typically found in a genomic sequence, that is, from the set {"A","C","G","T","U","N"} (in upper, lower or mixed case). All occurrences of "N", as well any other characters not in this list, are discarded. Occurrences of "U" are replaced by "T", as is standard for genomic studies that use FCGR.

The "TextFCGR" method requires input in an alphabet consistent with English, that is, containing letters "a-z" along with numerals and punctuation. It uses preprocessing to remove diacritics and convert to lower case. Characters that fall outside the supported set are discarded.

StringToVectors takes a "DimensionReductionMethod" option, with default setting Automatic. ResourceFunction["StringsToVectors"][*strings*,Infinity] will override any "DimensionReductionMethod"setting and perform no reduction in dimension. This will also be the case if the target dimension is greater than the length or number of the input vectors.

With the setting "DimensionReductionMethod"→*func*, the vectors *vecs* produced by any method are post-processed all together as *func*[*vecs*].

When no second argument is specified, the length of the vectors produced will depend on the method used as well as whether dimension reduction is used.

The "NGrams" method can be given a list of suboptions "NGramLength", "RetainCount", "DiscardCount" and IgnoreCase. "NGramLength" gives the size of the *n*-grams to consider, with the default being 4. "RetainCount" specifies the maximum number of different *n*-grams to consider. The default is 2000. If there are more distinct *n*-grams present in the union of all the input strings, only the most common ones up to that count will be considered. Initially vectors are produced that have this length, with values being the relative frequencies of the respective *n*-grams in each input string. If the second argument to StringsToVectors is less than this, then dimension reduction will be used on the resulting vectors to reduce to that length. Otherwise it will reduce vectors to length 32. "DiscardCount" tells how many of the most commonly occurring *n*-grams to discard from further consideration. The default is 0. A nondefault setting can be useful in cases where one or a few *n*-grams are so common in the input strings that they will make the resulting vectors difficult to distinguish. The IgnoreCase suboption defaults to True, in which case letter characters are converted to lower case before *n*-grams are extracted.

If a particular string contains none of the commonest *n*-grams used when the default "NGrams" method is applied, the resulting vector will be all zeros. A consequence is that all such resulting vectors will appear identical in any further processing, e.g. they will cluster together.

The "TextFCGR" method can be given a suboption "Log2Dimension"→*d* (default 7). The Frequency Chaos Game Representation is used to create a square matrix of dimension *2*^{d}×*2*^{d}. The resulting matrices are flattened to vectors. Then dimension reduction is applied.This will use the singular value decomposition when "DimensionReductionMethod" is Automatic. One can alternatively set it to a specific method, or to None if the optional second argument is not given. For this method, when the resulting vector size is either unspecified or Automatic, the default is length 800.

The "GenomeFCGR" method can also be given the suboption "Log2Dimension", again with default 7. Again the Frequency Chaos Game Representation is used to create a square matrix for each input string. A second step reduces dimension using FourierCosTransform. Another suboption, "FCTDimension" (default 32), specifies the how many of the lowest frequency components to retain. Effectively this coarsens the FCGR, if one regards it as an image array. One can retain the full array of frequencies by setting the "FCTDimension" suboption to Infinity. After this step the resulting matrices are flattened to vectors. Then dimension reduction is applied in the same manner as for the "TextFCGR" described above, in this case however producing vectors of length 32 unless the second argument for vector length is given.

Both the "TextFCGR" and "GenomeFCGR" methods take a Boolean-valued suboption "Centered" (default False). When set to True, one more row and column are added to the FCGR array in order to produce a more symmetric array. In practice this does not seem to improve results in terms of how the vectors cluster, hence the default is False. See documentation for the resource function FCGRImage for further information about this suboption.

The methods and hyperparameter values used for dimension reduction in the automatic settings for the respective FCGR-based methods comes from empirical work.

Details regarding "TextFCGR" settings can be found in the Chaos game representation for authorship attribution reference.

See the Alignment-free genomic sequence comparison using FCGR and signal processing reference for details on the "GenomeFCGR" method.

Create 10 strings of text:

In[1]:= |

Out[1]= |

Convert them to vectors:

In[2]:= |

Out[2]= |

Start with four groups of five similar strings, shuffled into a randomized order:

In[3]:= |

Create vectors for these 20 strings:

In[4]:= |

Out[4]= |

Further reduce the dimension so the four distinct groupings can be clustered and visualized in three dimensions:

In[5]:= |

Out[5]= |

Work with a set of three groups of four similar strings each, where not all strings have the same length:

In[6]:= |

Out[7]= |

Create a set of vectors each with 12 elements:

In[8]:= |

Out[8]= |

Cluster into three groups:

In[9]:= |

Out[9]= |

To see the clustering makes sense, we use Nearest to show that each has three fairly close neighbors, with the fourth closest being notably further than the first three:

In[10]:= |

Out[10]= |

Now extract the set of three closest neighbors for each string according to this vector encoding:

In[11]:= |

Out[11]= |

Instead use the "TextFCGR" method on this set of strings:

In[12]:= |

Out[12]= |

Again find neighbor sets using Nearest:

In[13]:= |

Out[13]= |

Again extract the set of three closest neighbors using this vector encoding:

In[14]:= |

Out[14]= |

Verify that these give the same sets of closest neighbors:

In[15]:= |

Out[15]= |

A utility function that creates similar random strings each time it is invoked:

In[16]:= |

Create eight sets of strings of various lengths, in randomized order:

In[17]:= |

Create vectors of dimension 6:

In[18]:= |

Show a dendrogram produced by these vectors:

In[19]:= |

Out[19]= |

We obtain a very similar dendrogram, with the same low-level groupings, from the "TextFCGR" method:

In[20]:= |

Out[6]= |

We also obtain a dendrogram,with the same low-level groupings from the strings themselves, albeit more slowly since the distances take more time to compute:

In[21]:= |

Out[21]= |

Create three sets of random strings:

In[22]:= |

Out[23]= |

Reduce dimension to 3 using the default and also normalizing the result of DimensionReduce with the setting Method→"LatentSemanticAnalysis":

In[24]:= |

Out[25]= |

These agree to close approximation, up to signs:

In[26]:= |

Out[26]= |

Get genomic data for the BRCA2 gene in different species:

In[27]:= |

Out[3]= |

A dendrogram from the default conversion to vectors places *Gallus gallus* (jungle fowl) and *Canis lupus familiaris* (dogs) closer than *Pan troglodytes* (chimpanzees) to *Homo sapiens* (humans), with the chimpanzees not far from *Danio rerio* (zebrafish):

In[28]:= |

Out[20]= |

These strings contain many "N" characters, and results will improve if we preprocess to remove them:

In[29]:= |

Recompute and show a phylogenetic tree with a more plausible placement:

In[30]:= |

Out[8]= |

Note that the "GenomeFCGR" method does not suffer unduly from presence of "N" characters since it removes them:

In[31]:= |

Out[10]= |

Create a set of 30 strings, comprised of three subsets of 10 similar strings:

In[32]:= |

Out[2]= |

Convert the strings to vectors:

In[33]:= |

Out[33]= |

Quite clearly the vectors obtained by default split into the three expected classes:

In[34]:= |

Out[34]= |

Using Method→"GenomeFCGR" on arbitrary (that is, non-genome) string sets can give quite bad results:

In[35]:= |

Out[35]= |

Some viral genomes in the FASTA database:

In[36]:= |

A list of corresponding genome identifiers in FASTA format:

In[37]:= |

Import these using the resource function ImportFASTA; this takes a bit of time:

In[38]:= |

Out[38]= |

Get the genomic sequence strings:

In[39]:= |

Create vectors for these sequences:

In[40]:= |

Show a dendrogram using different colors for the distinct viral types:

In[41]:= |

Out[42]= |

We get a similar dendrogram from the "GenomeFCGR" method:

In[43]:= |

Out[44]= |

We use a standard test set of mammal species:

In[45]:= |

Import the sequences from the FASTA site:

In[46]:= |

Out[46]= |

Form vectors from these genome strings:

In[47]:= |

Set up data for a phylogenetic tree:

In[48]:= |

Out[31]= |

Show the dendrogram obtained from this encoding of the genomes into vectors:

In[49]:= |

Out[49]= |

The genome FCGR encoding gives a similar grouping:

In[50]:= |

Out[14]= |

Compare to the resource function PhylogeneticTreePlot:

In[51]:= |

Out[51]= |

Find five nearest neighbors for each species:

In[52]:= |

Out[53]= |

Use the resource function MultidimensionalScaling to reduce to three dimensions for visualization:

In[54]:= |

Out[54]= |

We can use string vectors for determining when substrings might have a similar source.

Obtain several English texts from ExampleData:

In[55]:= |

Split into sets of substrings of length 2000:

In[56]:= |

Create vectors of length 80 for these strings:

In[57]:= |

Out[58]= |

Split into odd and even numbered substrings for purposes of training and testing a classifier:

In[59]:= |

Train a neural net to classify the different texts:

In[60]:= |

Check that 99% of the test strings are classified correctly:

In[61]:= |

Out[63]= |

Authorship of nineteenth century French novels. Data is from the reference Understanding and explaining Delta measures for authorship attribution.The importing step can take as much as a few minutes.

Import 75 French novels with three by each of 25 authors, clipping the size of the larger ones in order to work with strings of comparable lengths:

In[64]:= |

Out[11]= |

Partition each into substrings of 10000 characters and split into training (two novels per author) and test (one per author) sets:

In[65]:= |

Create vectors of length 200:

In[66]:= |

Out[67]= |

A simple classifier associates around 80% of the substrings in the test set with the correct author:

In[68]:= |

Out[37]= |

A simple neural net associates nearly 90% with the actual author:

In[69]:= |

Out[41]= |

We apply string vector similarity to the task of authorship attribution. We use as an example the Federalist Papers, where authorship of several was disputed for around 175 years. The individual authors were Alexander Hamilton, James Madison and John Jay. Due to happenstance of history (Hamilton may have recorded his claims of authorship in haste, with his duel with Aaron Burr encroaching), twelve were claimed by both Hamilton and Madison. The question was largely settled in the 1960's (see Author Notes). Nevertheless this is generally acknowledged as a difficult test for authorship attribution methods. We use some analysis of *n*-gram strings to show one method of analysis.

Import data, split into individual essays and remove author names and boilerplate common to most or all articles:

In[70]:= |

Separate out the three essays jointly attributed to Hamilton and Madison, as well as the five written by John Jay, and show the numbering for the disputed set:

In[71]:= |

Out[75]= |

We split each essay into substrings of length 2000 and create vectors of length 80, using a larger-than-default setting for the number of *n*-grams to consider:

In[76]:= |

Out[80]= |

Remove from consideration several essays of known authorship, as well as the last chunk from each remaining essay of known authorship; these provide two validation sets for which we will assess the quality of the classifier:

In[81]:= |

Since Hamilton wrote far more essays than Madison, we removed far more of his from the training set and now check that the contributions of the two authors to the training set are not terribly different in size (that is, within a factor of 1.5):

In[82]:= |

Out[82]= |

The two sets reduced to three dimensions, with red dots for Hamilton's strings and blue dots for Madison's, do not appear to separate nicely:

In[83]:= |

Out[84]= |

A different method gives a result that looks better but still does not show an obvious linear space separation:

In[85]:= |

Out[87]= |

Using nearest neighbors (as done for the genome set examples) of these vectors to classify authorship is prone to failure, so instead we train a neural net, with numeric outcomes of 1 for Hamilton authorship and 2 for Madison:

In[88]:= |

Now assess correctness percentage on the first validation set (withheld essays), as well as possible bias in incorrectness:

In[89]:= |

Out[90]= |

Do likewise for the second validation set (chunks withheld from the training group essays):

In[91]:= |

Out[92]= |

Both exceed 90% correct, and both show around a 10% inclination in each incorrect direction (that is, Hamilton authorship assessed as Madison or vice versa).

Assess authorship on the substrings in the test set (from the twelve disputed essays):

In[93]:= |

Out[93]= |

Tally these:

In[94]:= |

Out[94]= |

We redo this experiment, this time using the "TextFGCR" method for conversion to numeric vectors:

In[95]:= |

Out[95]= |

Repeat processing through creating a neural net classifier:

In[96]:= |

Repeat the first validation check:

In[97]:= |

Out[98]= |

Repeat the second validation check:

In[99]:= |

Out[100]= |

Both are just under 90%, and we observe this time there is a one-in-three (5/15) tendency in the first set to mistake actual Madison essays (where the first value is 2) as Hamilton's.

Again assess authorship on the substrings in the test set (from the twelve disputed essays):

In[101]:= |

Out[101]= |

And tally these:

In[102]:= |

Out[102]= |

These results are consistent with the consensus opinion supporting Madison authorship, or at least primary authorship, on the twelve disputed essays.

- Authorship Forensics from Twitter Data
- Automated Authorship Verification: Did We Really Write Those Blogs We Said We Wrote?
- Florida Spring Break 2021: February COVID-19 Data Forecasts the March of the Variants
- From sequenced SARS-CoV-2 genomes to a phylogenetic tree
- Finding and analyzing a COVID subvariant in Australia

Wolfram Language 13.0 (December 2021) or above

This work is licensed under a Creative Commons Attribution 4.0 International License