Function Repository Resource:

StringsToVectors

Source Notebook

Convert a set of text strings to numeric vectors

Contributed by: Daniel Lichtblau

ResourceFunction["StringsToVectors"][strings]

converts strings to a list of numeric vectors.

ResourceFunction["StringsToVectors"][strings,m]

converts strings to vectors of length m.

Details and Options

ResourceFunction["StringsToVectors"] is intended for comparisons and other manipulations of sets of strings. It supports what are called alignment-free methods for string comparison, clustering and similar analyses on sets of strings.

Numeric vectors produced by ResourceFunction["StringsToVectors"] will all have the same dimension and nonzero vectors will be normalized to length 1. Typical usage of ResourceFunction["StringsToVectors"] will produce vectors of modest length, in the range tens-to-hundreds range.

Input strings need not have the same lengths.

The output is suitable for creating a NearestFunction or Dendrogram. Also it can be fed to FindClusters. The output can also be used to create a classifier function.

ResourceFunction["StringsToVectors"][strings,Automatic] is equivalent to ResourceFunction["StringsToVectors"][strings].

A given string will typically have a different vector representation when it appears in different string sets, even if all settings to ResourceFunction["StringsToVectors"] are the same. In other words, results are dependent on the full set of input strings. A consequence is that results from different calls to ResourceFunction["StringsToVectors"] cannot be compared to one another.

Strings with very different lengths can have similar vectors when using the frequency-based methods supported by ResourceFunction["StringsToVectors"]. If this is not desired, one can post-process the results, for example multiplying each vector by the length of the corresponding string.

ResourceFunction["StringsToVectors"] takes a Method option. The default, "NGrams", creates vectors based on frequencies of common n-grams (where n defaults to 4). Alternatively one can use the "TextFCGR" or "GenomeFCGR" methods.

"GenomeFCGR" assumes that the characters are the ones typically found in a genomic sequence, that is, from the set {"A","C","G","T","U","N"} (in upper, lower or mixed case). All occurrences of "N", as well any other characters not in this list, are discarded. Occurrences of "U" are replaced by "T", as is standard for genomic studies that use FCGR.

The "TextFCGR" method requires input in an alphabet consistent with English, that is, containing letters "a-z" along with numerals and punctuation. It uses preprocessing to remove diacritics and convert to lower case. Characters that fall outside the supported set are discarded.

StringToVectors takes a "DimensionReductionMethod" option, with default setting Automatic. ResourceFunction["StringsToVectors"][strings,Infinity] will override any "DimensionReductionMethod"setting and perform no reduction in dimension. This will also be the case if the target dimension is greater than the length or number of the input vectors.

With the setting "DimensionReductionMethod"→func, the vectors vecs produced by any method are post-processed all together as func[vecs].

When no second argument is specified, the length of the vectors produced will depend on the method used as well as whether dimension reduction is used.

The "NGrams" method can be given a list of suboptions "NGramLength", "RetainCount", "DiscardCount" and IgnoreCase. "NGramLength" gives the size of the n-grams to consider, with the default being 4. "RetainCount" specifies the maximum number of different n-grams to consider. The default is 2000. If there are more distinct n-grams present in the union of all the input strings, only the most common ones up to that count will be considered. Initially vectors are produced that have this length, with values being the relative frequencies of the respective n-grams in each input string. If the second argument to StringsToVectors is less than this, then dimension reduction will be used on the resulting vectors to reduce to that length. Otherwise it will reduce vectors to length 32. "DiscardCount" tells how many of the most commonly occurring n-grams to discard from further consideration. The default is 0. A nondefault setting can be useful in cases where one or a few n-grams are so common in the input strings that they will make the resulting vectors difficult to distinguish. The IgnoreCase suboption defaults to True, in which case letter characters are converted to lower case before n-grams are extracted.

If a particular string contains none of the commonest n-grams used when the default "NGrams" method is applied, the resulting vector will be all zeros. A consequence is that all such resulting vectors will appear identical in any further processing, e.g. they will cluster together.

The "TextFCGR" method can be given a suboption "Log2Dimension"→d (default 7). The Frequency Chaos Game Representation is used to create a square matrix of dimension 2^d×2^d. The resulting matrices are flattened to vectors. Then dimension reduction is applied.This will use the singular value decomposition when "DimensionReductionMethod" is Automatic. One can alternatively set it to a specific method, or to None if the optional second argument is not given. For this method, when the resulting vector size is either unspecified or Automatic, the default is length 800.

The "GenomeFCGR" method can also be given the suboption "Log2Dimension", again with default 7. Again the Frequency Chaos Game Representation is used to create a square matrix for each input string. A second step reduces dimension using FourierCosTransform. Another suboption, "FCTDimension" (default 32), specifies the how many of the lowest frequency components to retain. Effectively this coarsens the FCGR, if one regards it as an image array. One can retain the full array of frequencies by setting the "FCTDimension" suboption to Infinity. After this step the resulting matrices are flattened to vectors. Then dimension reduction is applied in the same manner as for the "TextFCGR" described above, in this case however producing vectors of length 32 unless the second argument for vector length is given.

Both the "TextFCGR" and "GenomeFCGR" methods take a Boolean-valued suboption "Centered" (default False). When set to True, one more row and column are added to the FCGR array in order to produce a more symmetric array. In practice this does not seem to improve results in terms of how the vectors cluster, hence the default is False. See documentation for the resource function FCGRImage for further information about this suboption.

The methods and hyperparameter values used for dimension reduction in the automatic settings for the respective FCGR-based methods comes from empirical work.

Details regarding "TextFCGR" settings can be found in the Chaos game representation for authorship attribution reference.

See the Alignment-free genomic sequence comparison using FCGR and signal processing reference for details on the "GenomeFCGR" method.

Examples

Basic Examples (2)

Create 10 strings of text:

In[1]:=

Out[1]=

Convert them to vectors:

In[2]:=

Out[2]=

Start with four groups of five similar strings, shuffled into a randomized order:

In[3]:=

Create vectors for these 20 strings:

In[4]:=

Out[4]=

Further reduce the dimension so the four distinct groupings can be clustered and visualized in three dimensions:

In[5]:=

Out[5]=

Scope (9)

Work with a set of three groups of four similar strings each, where not all strings have the same length:

In[6]:=

Out[7]=

Create a set of vectors each with 12 elements:

In[8]:=

Out[8]=

Cluster into three groups:

In[9]:=

Out[9]=

To see the clustering makes sense, we use Nearest to show that each has three fairly close neighbors, with the fourth closest being notably further than the first three:

In[10]:=

Out[10]=

Now extract the set of three closest neighbors for each string according to this vector encoding:

In[11]:=

Out[11]=

Instead use the "TextFCGR" method on this set of strings:

In[12]:=

Out[12]=

Again find neighbor sets using Nearest:

In[13]:=

Out[13]=

Again extract the set of three closest neighbors using this vector encoding:

In[14]:=

Out[14]=

Verify that these give the same sets of closest neighbors:

In[15]:=

Out[15]=

Properties and Relations (2)

A utility function that creates similar random strings each time it is invoked:

In[16]:=

$createStringTable[len_Integer, strsize_Integer, changeprop_, clen_Integer, ccount_Integer] := Module[{intchars, atoz, initstr, str, newstr, stringtable, elem, rnd, rndposns, rndsubstrings, rpos}, atoz = Flatten[{ToCharacterCode["a"], ToCharacterCode["z"]}]; intchars = RandomInteger[atoz, strsize]; initstr = StringJoin[FromCharacterCode[intchars]]; str = initstr; rnd = RandomChoice[{changeprop, 1 - changeprop} -> {True, False}, len]; stringtable = Table[If[rnd[[j]], rndposns = RandomSample[Range[strsize - clen + 1], ccount]; intchars = RandomInteger[atoz, {ccount, clen}]; rndsubstrings = (StringJoin[FromCharacterCode[#1]] &) /@ intchars; newstr = str; Do[rpos = rndposns[[k]]; newstr = StringReplacePart[newstr, rndsubstrings[[k]], {rpos, rpos + clen - 1}];, {k, ccount}]; str = newstr;]; str, {j, len}]]$

Create eight sets of strings of various lengths, in randomized order:

In[17]:=

SeedRandom[1234];
strs = RandomSample[
Flatten[Table[
createStringTable[4, 500 + Round[30*RandomReal[{-1, 1}]], .99, 2,
3], {8}]]];

Create vectors of dimension 6:

In[18]:=

Show a dendrogram produced by these vectors:

In[19]:=

Out[19]=

We obtain a very similar dendrogram, with the same low-level groupings, from the "TextFCGR" method:

In[20]:=

Out[6]=

We also obtain a dendrogram,with the same low-level groupings from the strings themselves, albeit more slowly since the distances take more time to compute:

In[21]:=

Out[21]=

Create three sets of random strings:

In[22]:=

$createStringTable[len_Integer, strsize_Integer, changeprop_, clen_Integer, ccount_Integer] := Module[{intchars, atoz, initstr, str, newstr, stringtable, elem, rnd, rndposns, rndsubstrings, rpos}, atoz = Flatten[{ ToCharacterCode["a"], ToCharacterCode["z"]}]; intchars = RandomInteger[ atoz, strsize]; initstr = StringJoin[ FromCharacterCode[ intchars]]; str = initstr; rnd = RandomChoice[{changeprop, 1 - changeprop} -> {True, False}, len]; stringtable = Table[ If[ Part[rnd, j], rndposns = RandomSample[ Range[strsize - clen + 1], ccount]; intchars = RandomInteger[ atoz, {ccount, clen}]; rndsubstrings = Map[StringJoin[ FromCharacterCode[#]]& , intchars]; newstr = str; Do[ rpos = Part[rndposns, k]; newstr = StringReplacePart[newstr, Part[rndsubstrings, k], {rpos, rpos + clen - 1}]; Null, {k, ccount}]; str = newstr; Null]; str, {j, len}]] Short[strs = RandomSample[ Join[createStringTable[40, 400, .99, 2, 3], createStringTable[40, 300, .99, 2, 3], createStringTable[40, 500, .99, 2, 3]]]]$

Out[23]=

Reduce dimension to 3 using the default and also normalizing the result of DimensionReduce with the setting Method→"LatentSemanticAnalysis":

In[24]:=

svecs1 = ResourceFunction["StringsToVectors"][strs, 3];
Short[svecs2 = Map[#/Norm[#] &, ResourceFunction["StringsToVectors"][strs, 3, "DimensionReductionMethod" -> (DimensionReduce[#, 3, Method -> "LatentSemanticAnalysis"] &)]]]

Out[25]=

These agree to close approximation, up to signs:

In[26]:=

Out[26]=

Possible Issues (2)

Get genomic data for the BRCA2 gene in different species:

In[27]:=

brca2 = SemanticInterpretation["brca2 gene", AmbiguityFunction -> All];
bcGenes = brca2[[1]];
refseqs = Through[bcGenes["ReferenceSequence"]];
species = Through[bcGenes["Species"]]

Out[3]=

A dendrogram from the default conversion to vectors places Gallus gallus (jungle fowl) and Canis lupus familiaris (dogs) closer than Pan troglodytes (chimpanzees) to Homo sapiens (humans), with the chimpanzees not far from Danio rerio (zebrafish):

In[28]:=

vecs = ResourceFunction["StringsToVectors"][refseqs];
dd = Dendrogram[vecs -> species, Left, AspectRatio -> 1.1, ImageSize -> 300, PlotLabel -> "BRCA2 phylogenetic tree", DistanceFunction -> CosineDistance]

Out[20]=

These strings contain many "N" characters, and results will improve if we preprocess to remove them:

In[29]:=

Recompute and show a phylogenetic tree with a more plausible placement:

In[30]:=

vecsNoN = ResourceFunction["StringsToVectors"][refseqsNoN];
dd = Dendrogram[vecsNoN -> species, Left, AspectRatio -> 1.1, ImageSize -> 300, PlotLabel -> "BRCA2 phylogenetic tree", DistanceFunction -> CosineDistance]

Out[8]=

Note that the "GenomeFCGR" method does not suffer unduly from presence of "N" characters since it removes them:

In[31]:=

vecsG = ResourceFunction["StringsToVectors"][refseqs, Method -> "GenomeFCGR"];
Dendrogram[vecsG -> species, Left, AspectRatio -> 1.1, ImageSize -> 300, PlotLabel -> "BRCA phylogenetic tree", DistanceFunction -> CosineDistance]

Out[10]=

Create a set of 30 strings, comprised of three subsets of 10 similar strings:

In[32]:=

$createStringTable[len_Integer, strsize_Integer, changeprop_, clen_Integer, ccount_Integer] := Module[{intchars, atoz, initstr, str, newstr, stringtable, elem, rnd, rndposns, rndsubstrings, rpos}, atoz = Flatten[{ ToCharacterCode["a"], ToCharacterCode["z"]}]; intchars = RandomInteger[ atoz, strsize]; initstr = StringJoin[ FromCharacterCode[ intchars]]; str = initstr; rnd = RandomChoice[{changeprop, 1 - changeprop} -> {True, False}, len]; stringtable = Table[ If[ Part[rnd, j], rndposns = RandomSample[ Range[strsize - clen + 1], ccount]; intchars = RandomInteger[ atoz, {ccount, clen}]; rndsubstrings = Map[StringJoin[ FromCharacterCode[#]]& , intchars]; newstr = str; Do[ rpos = Part[rndposns, k]; newstr = StringReplacePart[newstr, Part[rndsubstrings, k], {rpos, rpos + clen - 1}]; Null, {k, ccount}]; str = newstr; Null]; str, {j, len}]] Short[strs = Join[createStringTable[10, 400, .99, 2, 3], createStringTable[10, 300, .99, 2, 3], createStringTable[10, 500, .99, 2, 3]]]$

Out[2]=

Convert the strings to vectors:

In[33]:=

Out[33]=

Quite clearly the vectors obtained by default split into the three expected classes:

In[34]:=

Out[34]=

Using Method→"GenomeFCGR" on arbitrary (that is, non-genome) string sets can give quite bad results:

In[35]:=

Out[35]=

Applications (5)

Genomic comparisons (2)

Some viral genomes in the FASTA database:

In[36]:=

$speciesViral = { Sequence[ "A/turkey/Ontario/FAV110-4/2009(H1N1)", "A/mallard/Nova Scotia/00088/2010(H1N1)", "A/thick-billed murre/Canada/1871/2011(H1N1)", "A/duck/Guangxi/030D/2009(H1N1)", "A/mallard/France/691/2002(H1N1)", "A/duck/Hokkaido/w73/2007(H1N1)", "A/pintail/Miyagi/1472/2008(H1N1)", "A/mallard/Korea/KNU YP09/2009(H1N1)", "A/mallard/Maryland/352/2002(H1N1)", "A/mallard/Maryland/26/2003(H1N1)", "A/dunlin/Alaska/44421-660/2008(H1N1)", "A/mallard/Minnesota/Sg-00620/2008(H1N1)", "A/turkey/Virginia/4135/2014(H1N1)", "A/chicken/Eastern China/XH222/2008(H5N1)", "A/duck/Eastern China/JS017/2009(H5N1)", "A/chicken/Yunnan/chuxiong01/2005(H5N1)", "A/chicken/Germany/R3234/2007(H5N1)", "A/domestic duck/Germany/R1772/2007(H5N1)", "A/wild bird/Hong Kong/07035-1/2011(H5N1)", "A/Chicken/Hong Kong/822.1/01 (H5N1)", "A/chicken/Miyazaki/10/2011(H5N1)", "A/chicken/Korea/es/2003(H5N1)", "A/mandarin duck/Korea/K10-483/2010(H5N1)", "A/turkey/VA/505477-18/2007(H5N1)", "A/American black duck/NB/2538/2007(H7N3)", "A/American black duck/New Brunswick/02490/2007(H7N3)", "A/American green-winged teal/California/44242-906/2007(H7N3)", "A/avian/Delaware Bay/226/2006(H7N3)", "A/chicken/British Columbia/GSC_human_B/04(H7N3)", "A/chicken/Rizhao/713/2013(H7N9)", "A/chicken/Jiangsu/1021/2013(H7N9)", "A/duck/Jiangxi/3096/2009(H7N9)", "A/wild duck/Korea/SH19-47/2010(H7N9)", "A/turkey/Minnesota/1/1988(H7N9)", "A/mallard/Minnesota/AI09-3770/2009(H7N9)", "A/mallard/Postdam/178-4/1983(H2N2)", "A/duck/Hong Kong/319/1978(H2N2)", "A/emperor goose/Alaska/44297-260/2007(H2N2)"]};$

A list of corresponding genome identifiers in FASTA format:

In[37]:=

genomesViral = {"HM370969", "CY138562", "CY149630", "KC608160", "AM157358", "AB470663", "AB546159", "HQ897966", "EU026046", "FJ357114", "GQ411894", "CY140047", "KM244078", "HQ185381", "HQ185383", "EU635875", "FM177121", "AM914017", "KF572435", "AF509102", "AB684161", "EF541464", "JF699677", "GU186511", "EU500854", "CY129336", "CY076231", "CY039321", "AY646080", "KF259734", "KF938945", "KF259688", "KC609801", "CY014788", "CY186004", "DQ017487", "CY005540", "JX081142"};

Import these using the resource function ImportFASTA; this takes a bit of time:

In[38]:=

Out[38]=

Get the genomic sequence strings:

In[39]:=

Create vectors for these sequences:

In[40]:=

Show a dendrogram using different colors for the distinct viral types:

In[41]:=

stringColor[str_String] := Which[
StringContainsQ[str, "H1N1"], Red,
StringContainsQ[str, "H5N1"], Blue,
StringContainsQ[str, "H7N3"], Pink,
StringContainsQ[str, "H7N9"], Green,
StringContainsQ[str, "H2N2"], Black]
speciesViralColored = Map[Style[#, stringColor[#]] &, speciesViral];
Dendrogram[vecs -> speciesViralColored, Left, AspectRatio -> 1.2, ImageSize -> 700, PlotLabel -> "Influenza A phylogenetic tree", DistanceFunction -> CosineDistance]

Out[42]=

We get a similar dendrogram from the "GenomeFCGR" method:

In[43]:=

vecsG = ResourceFunction["StringsToVectors"][sequencesViral, Method -> "GenomeFCGR"];
Dendrogram[vecsG -> speciesViralColored, Left, AspectRatio -> 1.2, ImageSize -> 700, PlotLabel -> "Influenza A phylogenetic tree", DistanceFunction -> CosineDistance]

Out[44]=

We use a standard test set of mammal species:

In[45]:=

$genomesMammal = {"V00662", "D38116", "D38113", "D38114", "X99256", "Y18001", "AY863426", "D38115", "NC_002083", "U20753", "U96639", "AJ002189", "AF010406", "AF533441", "V00654", "AY488491", "EU442884", "EF551003", "EF551002", "X97336", "Y07726", "DQ402478", "AF303110", "AF303111", "EF212882", "AJ001588", "X88898", "NC_002764", "AJ238588", "AJ001562", "X72204", "NC_005268", "NC_007441", "NC_008830", "NC_001788", "NC_001321", "NC_005270", "NC_001640", "NC_005275", "NC_006931", "NC_010640"};$

Import the sequences from the FASTA site:

In[46]:=

Out[46]=

Form vectors from these genome strings:

In[47]:=

Set up data for a phylogenetic tree:

In[48]:=

$speciesMammal = {"Human", "Pygmy chimpanzee", "Common chimpanzee", "Gorilla", "Gibbon", "Baboon", "Vervet monkey", "Bornean orangutan", "Sumatran orangutan", "Cat", "Dog", "Pig", "Sheep", "Goat", "Cow", "Buffalo", "Wolf", "Tiger", "Leopard", "Indian rhinoceros", "White rhinoceros", "Black bear", "Brown bear", "Polar bear", "Giant panda", "Rabbit", "Hedgehog", "Macaca thibet", "Squirrel", "Dormouse", "Blue whale", "Bowhead whale", "Chiru", "Common warthog", "Donkey", "Fin whale", "Gray whale", "Horse", "Indus river dolphin", "North pacific right whale", "Taiwan serow"}; stringColor[str_String] := Which[ StringContainsQ[str, {"Human", "chimpanzee", "Gorilla", "orangutan"}], Red, StringContainsQ[str, {"Gibbon", "Baboon", "monkey", "thibet"}], Orange, StringContainsQ[str, {"Tiger", "Leopard", "Cat"}], Green, StringContainsQ[str, {"Dog", "Wolf"}], Lighter[Green], StringContainsQ[ str, {"Cow", "Buffalo", "Goat", "Sheep", "Chiru", "serow"}], Darker[Brown], StringContainsQ[str, "rhino"], Lighter[Brown], StringContainsQ[str, {"bear", "panda"}], Black, StringContainsQ[str, {"mouse", "Squirrel", "Rabbit"}], Gray, StringContainsQ[str, {"Pig", "warthog"}], Darker[Blue], StringContainsQ[str, {"Donkey", "Horse"}], Blue, StringContainsQ[str, {"whale", "dolphin"}], Purple, StringContainsQ[str, "Hedgehog"], Darker[Gray]] speciesMammalColored = Map[Style[#, stringColor[#]] &, speciesMammal]$

Out[31]=

Show the dendrogram obtained from this encoding of the genomes into vectors:

In[49]:=

Out[49]=

The genome FCGR encoding gives a similar grouping:

In[50]:=

fcgrvecs = ResourceFunction["StringsToVectors"][sequencesMammal, Method -> "GenomeFCGR"];
Dendrogram[fcgrvecs -> speciesMammalColored, Left, AspectRatio -> 1.1,
ImageSize -> 600, PlotLabel -> "Mammalian phylogenetic tree", DistanceFunction -> CosineDistance]

Out[14]=

Compare to the resource function PhylogeneticTreePlot:

In[51]:=

Out[51]=

Find five nearest neighbors for each species:

In[52]:=

Out[53]=

Use the resource function MultidimensionalScaling to reduce to three dimensions for visualization:

In[54]:=

ListPointPlot3D[
MapApply[Labeled, Transpose[{ResourceFunction["MultidimensionalScaling"][ngramvecs, 3], speciesMammalColored}]]]

Out[54]=

Authorship identification (3)

We can use string vectors for determining when substrings might have a similar source.

Obtain several English texts from ExampleData:

In[55]:=

textnames = {"AeneidEnglish", "AliceInWonderland", "BeowulfModern", "CodeOfHammurabiEnglish", "DonQuixoteIEnglish", "FaustI", "GenesisKJV", "Hamlet", "OnTheNatureOfThingsEnglish", "OriginOfSpecies", "PlatoMenoEnglish", "PrideAndPrejudice", "ShakespearesSonnets"};
texts = Map[ExampleData[{"Text", #}] &, textnames];
tlen = Length[textnames];

Split into sets of substrings of length 2000:

In[56]:=

substrs = Map[StringPartition[#, 2000] &, texts];
slens = Map[Length, substrs];
sposns = Map[{1, 0} + # &, Partition[Join[{0}, Accumulate[slens]], 2, 1]];

Create vectors of length 80 for these strings:

In[57]:=

vecsize = 80;
Timing[textVectors0 = ResourceFunction["StringsToVectors"][Flatten@substrs, vecsize];]
textVectors = Map[textVectors0[[Apply[Span, #]]] &, sposns];

Out[58]=

Split into odd and even numbered substrings for purposes of training and testing a classifier:

In[59]:=

trainVectors = textVectors[[All, 1 ;; -1 ;; 2]];
testVectors = textVectors[[All, 2 ;; -1 ;; 2]];
labeledTrain = Flatten[MapIndexed[#1 -> #2[[1]] &, trainVectors, {2}]];
labeledTest = Flatten[MapIndexed[{#1, #2[[1]]} &, testVectors, {2}], 1];

Train a neural net to classify the different texts:

In[60]:=

net = NetChain[{1000, Tanh, tlen, Tanh, SoftmaxLayer[]}, "Input" -> vecsize, "Output" -> NetDecoder[{"Class", Range[tlen]}]]; trained = NetTrain[net, RandomSample@labeledTrain, MaxTrainingRounds -> 500];

Check that 99% of the test strings are classified correctly:

In[61]:=

$results = Map[{#[[2]], trained[#[[1]]]} &, labeledTest]; tallied = Tally[results]; {tallied, N[Total[Cases[tallied, {{a_, a_}, n_} :> n]]]/Length[labeledTest]}$

Out[63]=

Authorship of nineteenth century French novels. Data is from the reference Understanding and explaining Delta measures for authorship attribution.The importing step can take as much as a few minutes.

Import 75 French novels with three by each of 25 authors, clipping the size of the larger ones in order to work with strings of comparable lengths:

In[64]:=

$lynxF0 = Import[ "https://github.com/cophi-wue/refcor/tree/master/French"]; lynxF1 = StringDelete[lynxF0, ":" | "\""]; lynxF2 = StringSplit[lynxF1, "tree"][[2]]; lynxF3 = StringSplit[lynxF2, "items"][[2]]; lynxF4 = StringSplit[lynxF3, "name" | "path"][[2 ;; -1 ;; 2]]; lynxF5 = Map[StringReplace[#, "," -> ""] &, lynxF4]; textlinksF = Map[StringJoin[{"https://raw.githubusercontent.com/cophi-wue/refcor/master/French/", #}] &, lynxF5]; AbsoluteTiming[novelStringsF = Table[ str0 = Import[textlinksF[[j]]]; str1 = StringTake[str0, UpTo[400000]]; str = StringDrop[StringDrop[str1, 5000], -2000], {j, Length[lynxF5]}];]$

Out[11]=

Partition each into substrings of 10000 characters and split into training (two novels per author) and test (one per author) sets:

In[65]:=

sublen = 10000;
allTrainWritings0 = Riffle[novelStringsF[[2 ;; -1 ;; 3]], novelStringsF[[3 ;; -1 ;; 3]]];
allTestWritings0 = novelStringsF[[1 ;; -1 ;; 3]];
allTrainWritings1 = Map[StringPartition[#, sublen] &, allTrainWritings0];
allTestWritings1 = Map[StringPartition[#, sublen] &, allTestWritings0];
trainlens = Map[Total, Partition[Map[Length, allTrainWritings1], 2]];
testlens = Map[Length, allTestWritings1];
allTrainWritings = Flatten[allTrainWritings1];
allTestWritings = Flatten[allTestWritings1];
trainlabels = Flatten@MapIndexed[ConstantArray[#2[[1]], #1] &, trainlens];
testlabels = Flatten@MapIndexed[ConstantArray[#2[[1]], #1] &, testlens];

Create vectors of length 200:

In[66]:=

vlen = 200;
AbsoluteTiming[
vecs = ResourceFunction["StringsToVectors"][
Join[allTrainWritings, allTestWritings], vlen];]
{trainvecs, testvecs} = TakeDrop[vecs, Length[allTrainWritings]];

Out[67]=

A simple classifier associates around 80% of the substrings in the test set with the correct author:

In[68]:=

$cfunc = Classify[RandomSample[Thread[trainvecs -> trainlabels]], Method -> {"LogisticRegression", "OptimizationMethod" -> "LBFGS"}, "FeatureExtractor" -> None, PerformanceGoal -> "Quality"]; res = Transpose[{testlabels, cfunc[testvecs]}]; Length[Cases[res, {a_, a_}]]/N[Length[res]]$

Out[37]=

A simple neural net associates nearly 90% with the actual author:

In[69]:=

$numauthors = Length[testlens]; trainset = RandomSample[Thread[trainvecs -> trainlabels]]; net = NetChain[{500, Ramp, Tanh, numauthors, Tanh, SoftmaxLayer[]}, "Input" -> Length[trainvecs[[1]]], "Output" -> NetDecoder[{"Class", Range[numauthors]}]]; trained = NetTrain[net, trainset, MaxTrainingRounds -> 200, RandomSeeding -> Automatic, Method -> {"ADAM", "Beta1" -> .9}]; results = Transpose[{testlabels, Map[trained, testvecs]}]; Length[Cases[results, {a_, a_}]]/N[Length[res]]$

Out[41]=

We apply string vector similarity to the task of authorship attribution. We use as an example the Federalist Papers, where authorship of several was disputed for around 175 years. The individual authors were Alexander Hamilton, James Madison and John Jay. Due to happenstance of history (Hamilton may have recorded his claims of authorship in haste, with his duel with Aaron Burr encroaching), twelve were claimed by both Hamilton and Madison. The question was largely settled in the 1960's (see Author Notes). Nevertheless this is generally acknowledged as a difficult test for authorship attribution methods. We use some analysis of n-gram strings to show one method of analysis.

Import data, split into individual essays and remove author names and boilerplate common to most or all articles:

In[70]:=

fedpap = Import["https://www.gutenberg.org/files/18/18-0.txt", "Text"];
fedpap2 = StringDrop[fedpap, 1 ;; 7780];
fedpap3 = StringSplit[fedpap2, "FEDERALIST"][[2 ;; -2]];
fedpap4 = Delete[fedpap3, 71];
fedpap4[[-1]] = StringDrop[fedpap4[[-1]], -170 ;; -1];
fedpap5 = Map[StringReplace[#, "JAY" | "MADISON" | "HAMILTON" | "PUBLIUS" -> ""] &, fedpap4];
fedpap6 = Map[StringReplace[#, "From the New York Packet." | "To the People of the State of New York:" | "No." | "For the Independent Journal." -> ""] &, fedpap5];

Separate out the three essays jointly attributed to Hamilton and Madison, as well as the five written by John Jay, and show the numbering for the disputed set:

In[71]:=

rng = Range[Length[fedpap4]];
authorHandM = Map[StringCases[#, "HAMILTON" ~~ Whitespace ~~ "AND" ~~ Whitespace ~~ "MADISON"] &, fedpap4];
authorHandMPosns = Complement[rng, Flatten[Position[authorHandM, {}, {1}]]];
authorHorM = Map[StringCases[#, "HAMILTON" ~~ Whitespace ~~ "OR" ~~ Whitespace ~~ "MADISON"] &, fedpap4];
authorHorMPosns = Union[{58}, Complement[rng, Flatten[Join[{58}, Position[authorHorM, {}, {1}]]]]]
authorH = Map[StringCases[#, "HAMILTON"] &, fedpap4];
authorHPosns = Complement[rng, Join[Flatten[Position[authorH, {}, {1}]], authorHorMPosns, authorHandMPosns]];
authorM = Map[StringCases[#, "MADISON"] &, fedpap4];
authorMPosns = Complement[rng, Join[Flatten[Position[authorM, {}, {1}]], authorHorMPosns, authorHandMPosns]];
authorJ = Map[StringCases[#, "JAY"] &, fedpap4];
authorJPosns = Complement[rng, Flatten[Position[authorJ, {}, {1}]]];
valRule = Flatten[Map[Thread, MapThread[
Rule, {{authorHandMPosns, authorHorMPosns, authorHPosns, authorMPosns, authorJPosns}, {"HandM", "HorM", "H", "M", "J"}}]]];
authorValues = rng /. valRule /. {"H" -> 1, "M" -> 2};

Out[75]=

We split each essay into substrings of length 2000 and create vectors of length 80, using a larger-than-default setting for the number of n-grams to consider:

In[76]:=

strsize = 2000;
substrings = Map[StringPartition[#, strsize] &, fedpap6];
slens = Map[Length, substrings];
vecsize = 80;
Timing[textVectors0 = ResourceFunction["StringsToVectors"][Flatten@substrings, vecsize, Method -> {"NGrams", "RetainCount" -> 4000}];]
sposns = Map[{1, 0} + # &, Partition[Join[{0}, Accumulate[slens]], 2, 1]];
textVectors = Map[textVectors0[[Apply[Span, #]]] &, sposns];

Out[80]=

Remove from consideration several essays of known authorship, as well as the last chunk from each remaining essay of known authorship; these provide two validation sets for which we will assess the quality of the classifier:

In[81]:=

noHorMlist = Delete[rng, Map[List, authorHorMPosns, {1}]];
trainingSet = Complement[noHorMlist,
Join[authorMPosns[[1 ;; -1 ;; 10]], authorHPosns[[1 ;; -1 ;; 2]], authorJPosns, authorHandMPosns]];
trainingSetLabels = authorValues[[trainingSet]];
trainingVectors = Map[#[[1 ;; -2]] &, textVectors[[trainingSet]]];
testSet = authorHorMPosns;
testSetLabels = authorValues[[testSet]];
testSetVectors = textVectors[[testSet]];
validationSet1 = Complement[rng, Join[testSet, trainingSet, authorJPosns, authorHandMPosns]];
validationSet1Labels = authorValues[[validationSet1]];
validationSet1Vectors = textVectors[[validationSet1]];
validationSet2 = trainingSet;
validationSet2Labels = authorValues[[validationSet2]];
validationSet2Vectors = Map[#[[-1 ;; -1]] &, textVectors[[validationSet2]]];
alltrainingVectors = Apply[Join, trainingVectors];
alltrainingLabels = Flatten[Table[
ConstantArray[trainingSetLabels[[j]], Length[trainingVectors[[j]]]], {j, Length[trainingVectors]}]];
allvalidationSet1Vectors = Apply[Join, validationSet1Vectors]; allvalidationSet1Labels = Flatten[Table[
ConstantArray[validationSet1Labels[[j]], Length[validationSet1Vectors[[j]]]], {j, Length[validationSet1Vectors]}]];
allvalidationSet2Vectors = Apply[Join, validationSet2Vectors]; allvalidationSet2Labels = Flatten[Table[
ConstantArray[validationSet2Labels[[j]], Length[validationSet2Vectors[[j]]]], {j, Length[validationSet2Vectors]}]];
alltestSetVectors = Apply[Join, testSetVectors]; alltestSetLabels = Flatten[Table[
ConstantArray[testSetLabels[[j]], Length[testSetVectors[[j]]]], {j,
Length[testSetVectors]}]];

Since Hamilton wrote far more essays than Madison, we removed far more of his from the training set and now check that the contributions of the two authors to the training set are not terribly different in size (that is, within a factor of 1.5):

In[82]:=

Out[82]=

The two sets reduced to three dimensions, with red dots for Hamilton's strings and blue dots for Madison's, do not appear to separate nicely:

In[83]:=

stringColor[1] = Red;
stringColor[2] = Blue;
dimred = DimensionReduce[alltrainingVectors, 3];
pts = Table[
Style[dimred[[j]], stringColor[alltrainingLabels[[j]]]], {j, Length[alltrainingLabels]}];
ListPointPlot3D[pts]

Out[84]=

A different method gives a result that looks better but still does not show an obvious linear space separation:

In[85]:=

dimred = DimensionReduce[alltrainingVectors, 3, Method -> "MultidimensionalScaling"];
pts = Table[
Style[dimred[[j]], stringColor[alltrainingLabels[[j]]]], {j, Length[alltrainingLabels]}];
ListPointPlot3D[pts]

Out[87]=

Using nearest neighbors (as done for the genome set examples) of these vectors to classify authorship is prone to failure, so instead we train a neural net, with numeric outcomes of 1 for Hamilton authorship and 2 for Madison:

In[88]:=

trainSet = Rule[alltrainingVectors, alltrainingLabels];
net = NetChain[{1000, Tanh, 2, Tanh, SoftmaxLayer[]}, "Input" -> Length[alltrainingVectors[[1]]], "Output" -> NetDecoder[{"Class", Range[2]}]]; trained = NetTrain[net, trainSet, MaxTrainingRounds -> 500];

Now assess correctness percentage on the first validation set (withheld essays), as well as possible bias in incorrectness:

In[89]:=

$results1 = Transpose[{allvalidationSet1Labels, Map[trained, allvalidationSet1Vectors]}]; tallied1 = Tally[results1]; correct1 = Length[Cases[results1, {a_, a_}]]; {N[correct1/Length[results1]], tallied1}$

Out[90]=

Do likewise for the second validation set (chunks withheld from the training group essays):

In[91]:=

$results2 = Transpose[{allvalidationSet2Labels, Map[trained, allvalidationSet2Vectors]}]; tallied2 = Tally[results2]; correct2 = Length[Cases[results2, {a_, a_}]]; {N[correct2/Length[results2]], tallied2}$

Out[92]=

Both exceed 90% correct, and both show around a 10% inclination in each incorrect direction (that is, Hamilton authorship assessed as Madison or vice versa).

Assess authorship on the substrings in the test set (from the twelve disputed essays):

In[93]:=

Out[93]=

Tally these:

In[94]:=

Out[94]=

We redo this experiment, this time using the "TextFGCR" method for conversion to numeric vectors:

In[95]:=

Out[95]=

Repeat processing through creating a neural net classifier:

In[96]:=

Repeat the first validation check:

In[97]:=

Out[98]=

Repeat the second validation check:

In[99]:=

Out[100]=

Both are just under 90%, and we observe this time there is a one-in-three (5/15) tendency in the first set to mistake actual Madison essays (where the first value is 2) as Hamilton's.

Again assess authorship on the substrings in the test set (from the twelve disputed essays):

In[101]:=

Out[101]=

And tally these:

In[102]:=

Out[102]=

These results are consistent with the consensus opinion supporting Madison authorship, or at least primary authorship, on the twelve disputed essays.

Requirements

Wolfram Language 13.0 (December 2021) or above

Version History

1.0.2 – 22 March 2024
1.0.1 – 11 March 2024
1.0.0 – 06 March 2024

Source Metadata

Citation:
- Jonas S. Almeida, João A. Carriço, António Maretzek, Peter A. Noble and Madilyn Fletcher. Analysis of genomic sequences by Chaos Game Representation. Bioinformatics 17(5):429–37, 2001. DOI: 10.1093/bioinformatics/17.5.429
- Seth J. Chandler, Emily A. Muenster and Daniel Lichtblau. Who really wrote Martin v. Hunter's Lessee?: Applying a machine learning ensemble to uncover a Chief Justice's unethical behavior. ICAIL '23: Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law 62-70, 2023. DOI: 10.1145/3594536.3595118
- Patrick J. Deschavanne, Alain Giron, Joseph Vilain, Guillaume Fagot and Bernard Fertil. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Molecular Biology and Evolution 16(10):1391–1399, 1999. DOI: 10.1093/oxfordjournals.molbev.a026048
- Stefan Evert, Thomas Proisl, Fotis Jannidis, Isabella Reger, Steffen Pielström, Christof Schöch and Thorsten Vitt. Understanding and explaining Delta measures for authorship attribution. Digital Scholarship in the Humanities 32(suppl_2)ii4–ii16, 2017. DOI: 10.1093/llc/fqx023
- H. Joel Jeffrey. Chaos game representation of gene structure. Nucleic Acids Research 18(8):2163–2170, 1990. DOI: 10.1093/nar/18.8.2163
- Rallis Karamichalis, Lila Kari, Stavros Konstantinidis and Steffen Kopecki. An investigation into inter- and intragenomic variations of graphic genomic signatures. BMC Bioinformatics 16:246, 2015. DOI: 10.1186/s12859-015-0655-4
- Daniel Lichtblau. Alignment-free genomic sequence comparison using FCGR and signal processing. BMC Bioinformatics 20:742, 2019. DOI: 10.1186/s12859-019-3330-3
- Daniel Lichtblau and Catalin Stoean.Text Documents Encoding Through Images for Authorship Attribution. Proceedings of Statistical Language and Speech Processing - 6th International Conference (SLSP 2018), Lecture Notes in Computer Science 11171:178-189, 2018. DOI: 10.1007/978-3-030-00810-9_16
- Daniel Lichtblau and Catalin Stoean. Chaos game representation for authorship attribution. Artificial Intelligence 317:103858, 2023. DOI: 10.1016/j.artint.2023.103858
- Hannah Franziska Löchel and Dominik Heider. Chaos game representation and its applications in bioinformatics. Computational and Structural Biotechnology Journal 19:6263-6271, 2021. DOI: 10.1016/j.csbj.2021.11.008

Related Resources

Author Notes

The methods used in here come from alignment-free genomic studies and also from authorship identification code I have developed over several years. The methods specific to authorship attribution have been largely in joint work with Catalin Stoean.

The example of the disputed Federalist Papers authorship has a long history. The original definitive work, resolving all in favor of Madison authorship, is the 1963 paper Inference in an Authorship Problem by Frederick Mosteller and David Wallace. A nice account can be found in this Priceonomics blog by Ben Christopher. A less accurate analysis by myself can be found in this Wolfram blog. The inaccuracies were due to unwitting use of a problematic data source, not to the underlying methodology (my error, admittedly a serious one, was in failing to notice line spacing discrepancies that affected the outcomes). The version imported in the application example here is safe for actual use.

I thank Thomas Proisl and the other authors of Understanding and explaining Delta measures for authorship attribution for making their benchmark test data publicly available on GitHub and also for discussing their work with me in email.

It would be useful to have an on-line version of StringsToVectors, that is, one that can convert new strings after having converted an initial batch. This is something I might add in the future.

License Information

This work is licensed under a Creative Commons Attribution 4.0 International License

Wolfram Function Repository

StringsToVectors

Details and Options

Examples

Basic Examples (2)

Scope (9)

Properties and Relations (2)

Possible Issues (2)

Applications (5)

Genomic comparisons (2)

Authorship identification (3)

Related Links

Requirements

Version History

Source Metadata

Related Resources

Author Notes

License Information

StringsToVectors

Details and Options

Examples

Basic Examples (2)

Scope (9)

Properties and Relations (2)

Possible Issues (2)

Applications (5)

Genomic comparisons (2)

Authorship identification (3)

Related Links

Requirements

Version History

Source Metadata

Related Resources

Related Symbols

Author Notes

License Information