AntonAntonov/TriesWithFrequencies

Using Tries for Markov chain text generation

Introduction	Language model adherence verification
Ingest text	Possible extensions
Markov chain construction and utilization	References
Using larger n-grams

Introduction

In this notebook we discuss the derivation and utilization of

Markov chains

for random text generation. The Markov chains are computed, represented, and utilized with the data structure

Tries with frequencies

, [AA2, AAp1, AAv1]. (

Tries

are also known as "prefix trees.")

We can say that for a given text we use Tries with frequencies to derive language models, and we generate new, plausible text using those language models.

In a previous article, [AA1], I discussed how text generation with Markov chains can be implemented with sparse arrays.

Remark: Tries with frequencies can be also implemented with WL's

Tree

structure. The implementation in [AAp1] relies only on

Association

, though. Further, during the Wolfram Technology Conference 2022 I think I successfully convinced

Ian Ford

to implement Tries with frequencies using Tree. (Ideally, at some point his implementation would be more scalable and faster than my implementation.)

Remark: We can say that this notebook provides examples of making language models that are (much) simpler than Chat GPT's models, as mentioned by Stephen Wolfram in [SWv1]. With these kind of models we can easily generate sentences, like,

The house ate the lettuce. or The chair was happy.

, [SWv1].

Load the paclet

In[97]:=

Needs["AntonAntonov`TriesWithFrequencies`"]

Ingest text

Get text:

In[98]:=

text=ExampleData[{"Text","OriginOfSpecies"}];StringLength[text]

Out[99]=

893121

Number of words in the text:

In[100]:=

Length[TextWords[text]]

Out[100]=

150151

Turn the text into separate words and punctuation characters:

In[101]:=

AbsoluteTiming[lsWords=TextCases[text,"Word"|"Punctuation"];]

Out[101]=

{2.6172,Null}

Here is the number of words and punctuation characters extracted:

In[102]:=

Length[lsWords]

Out[102]=

168083

Here is a sample :

In[103]:=

Take[lsWords,{1200,1220}]

Out[103]=

{possible,,,and,,,what,is,equally,or,more,important,,,we,shall,see,how,great,is,the,power,of,man}

Here is the plot that shows Pareto principle adherence to the word counts:

In[104]:=

ResourceFunction["ParetoPrinciplePlot"][Tally[Flatten[StringCases[ToLowerCase[lsWords],LetterCharacter..]]]〚All,2〛,ImageSize500]

Out[104]=

Remark: We see typical Pareto principle manifestation: 10% of the unique words correspond to 80% of all words in the text.

Markov chain construction and utilization

Here we create a trie using 2-grams:

In[105]:=

AbsoluteTimingtrWords2=

TrieCreate

[Partition[ToLowerCase[lsWords],2,1]];

Out[105]=

{6.25625,Null}

Example of random n-gram generation:

In[106]:=

SeedRandom[989];

TrieRandomChoice

[trWords2]

Out[107]=

{$TrieRoot,was,deposited}

Markov chain generation:

In[108]:=

NestJoin#,Rest@

TrieRandomChoice



TrieSubTrie

[trWords2,{Last[#]}]&,{"every","author"},30

Out[108]=

{every,author,makes,more,perfectly,kept,alive,;,and,fertility,affords,any,part,of,any,two,varieties,,,the,dorking,fowl,),from,a,formation,,,more,ancient,progenitor,,,by,several}

Here we modify the code line above to generate and splice random n-grams until 10 sentences are obtained:

In[109]:=

maxSentences=10;aRes=NestWhilengram=Rest@

TrieRandomChoice



TrieSubTrie

[trWords2,{Last[#Words]}];"Words"Join[#Words,ngram],"Sentences"If[ngram〚-1〛".",#Sentences+1,#Sentences]&,"Words"{"every","author"},"Sentences"0,#Sentences<maxSentences&;

Remark: We consider a complete sentence to finish with period ("."). Hence, the stopping criteria "counts" how many times "." has been reached with the generated n-grams.

Here we make the generated text more "presentable":

In[111]:=

res1=StringReplace[StringRiffle[aRes["Words"]," "]," "~~x:PunctuationCharacterx];res2=StringReplace[Capitalize[res1],". "~~x:LetterCharacter". "<>Capitalize[x]]

Out[112]=

Every author makes me to warn the beetle which has interested me by the adjoining cells and the little altered condition of centuries, that monarch. Aphides; other and new sub-genera, are preserved or three genera with its whole i presume that before applying the species may be said to so have been effected is published can be surmounted by the same species should never be quite unknown differences which are eminently liable to ducks and insignificant cause for the several stages of the beetles is certain species. But lyell's metaphor, sub-orders, but groups. Nor can be blown to those of which can clearly represented in checking immigration of the individuals during the lost and depends on trees together, when crossed; and animals, relate to cause, and that classification of our domestic breeds to long course preposterous; and various breeds, which include only slightly modified and coadaptation. It may be proved by the wonderful sort of the discovery. Fossil species will have been broadly confessed by whirlwinds in the commencement of classification than of structural differences in its general in which the work alternately on the world they differ little free intercrossing. But would have been rendered sterile; the habit of squirrels; and analogy whether species, why the animals; but when run selection, between 500 feet. Now happens that they have evidence given in relation to vary, as highly favourable that the revolution in the future work, differ considerably modified, etc., for them. This has, and we can be separated sexes are of species; but will not to touch; but we see in cases, and isolated mountain-ranges and widely implies in comparison, can understand, and northward or pebbles; in order to their becoming acclimatised: for life, plants, is, for the objection which have large territory we see no one form will be thus left modified by hosts of nutriment from flower; though generally be said to a remark, by some time. Within a suspicion existing in regard to escape serving for seeds in some are the other modifications of life, by diminished, we can be strengthened; it is always been already been more abstract.

Using larger n-grams

Let us repeat the text generation above with larger n-grams (e.g. 4-grams).

Here generate the n-gram trie:

In[113]:=

n=4;AbsoluteTimingtrWords=

TrieCreate

[Partition[ToLowerCase[lsWords],n,1]];

Out[114]=

{7.49459,Null}

Example of random n-gram generation:

In[115]:=

TrieRandomChoice

[trWords]

Out[115]=

{$TrieRoot,have,reached,or,even}

Generate text:

In[116]:=

startPhrase=StringSplit["given time"];maxSentences=10;aRes=NestWhilengram=Rest@

TrieRandomChoice



TrieSubTrie

[trWords,{Last[#Words]}];"Words"Join[#Words,ngram],"Sentences"If[ngram〚-1〛".",#Sentences+1,#Sentences]&,"Words"startPhrase,"Sentences"0,#Sentences<maxSentences&;res1=StringReplace[StringRiffle[aRes["Words"]," "]," "~~x:PunctuationCharacterx];res2=StringReplace[Capitalize[res1],". "~~x:LetterCharacter". "<>Capitalize[x]]

Remark: Using longer n-grams generally produces longer sentences since the probability to "reach" the period punctuation character with longer n-grams becomes smaller.

Language model adherence verification

Let us convince ourselves that the function TrieRandomChoice produces words with distribution that correspond to the trie it is invoked upon.

Here we make a trie for a certain set of words:

Here we generate random words using the trie above and make a new trie with them:

Here is a comparison table between the two tries:

We see that the probabilities at the nodes are very similar. It is expected that with large number of generation words nearly the same probabilities to be obtained.

Word phrases example

Possible extensions

Possible extensions include the following:

◼

Finding Part Of Speech (POS) label for each word and making "generalized" sequences of POS labels.

◼

Those kind of POS-based language models can be combined the with the "regular", word-based ones in variety of ways.

◼

One such way is to use a POS-based model as a censurer of a word-based model.

◼

Another is to use a POS-based model to generate POS sequences, and then "fill-in" those sequences with actual words.

◼

N-gram-based predictions can be used to do phrase completions in (specialized) search engines.

◼

That can be especially useful if the phrases belong to a certain Domain Specific Language (DSL).
(And there is large enough collection of search queries with that DSL.)

◼

Instead of words any sequential data can be used.

◼

See [AAv1] for an application to predicting driving trips destinations.

◼

Certain business simulation models can be done with Trie-based sequential models.

References

Wolfram Language Paclet Repository

Word phrases example

Articles

Packages

Videos