We can say that for a given text we use Tries with frequencies to derive language models, and we generate new, plausible text using those language models.
In a previous article, [AA1], I discussed how text generation with Markov chains can be implemented with sparse arrays.
Remark: Tries with frequencies can be also implemented with WL's
to implement Tries with frequencies using Tree. (Ideally, at some point his implementation would be more scalable and faster than my implementation.)
Remark: We can say that this notebook provides examples of making language models that are (much) simpler than Chat GPT's models, as mentioned by Stephen Wolfram in [SWv1]. With these kind of models we can easily generate sentences, like,
Remark: We consider a complete sentence to finish with period ("."). Hence, the stopping criteria "counts" how many times "." has been reached with the generated n-grams.
Here we make the generated text more "presentable":
Remark: Using longer n-grams generally produces longer sentences since the probability to "reach" the period punctuation character with longer n-grams becomes smaller.
Language model adherence verification
Let us convince ourselves that the function TrieRandomChoice produces words with distribution that correspond to the trie it is invoked upon.
Here we make a trie for a certain set of words:
Here we generate random words using the trie above and make a new trie with them:
Here is a comparison table between the two tries:
We see that the probabilities at the nodes are very similar. It is expected that with large number of generation words nearly the same probabilities to be obtained.
Word phrases example
Possible extensions
Possible extensions include the following:
◼
Finding Part Of Speech (POS) label for each word and making "generalized" sequences of POS labels.
◼
Those kind of POS-based language models can be combined the with the "regular", word-based ones in variety of ways.
◼
One such way is to use a POS-based model as a censurer of a word-based model.
◼
Another is to use a POS-based model to generate POS sequences, and then "fill-in" those sequences with actual words.
◼
N-gram-based predictions can be used to do phrase completions in (specialized) search engines.
◼
That can be especially useful if the phrases belong to a certain Domain Specific Language (DSL). (And there is large enough collection of search queries with that DSL.)
◼
Instead of words any sequential data can be used.
◼
See [AAv1] for an application to predicting driving trips destinations.
◼
Certain business simulation models can be done with Trie-based sequential models.