In this tech note we show how to find most frequent prefixes (or infixes) in a large collection of words. (For example, the English dictionary words "known" in WL.)
Load the paclet
In[13]:=
Needs["AntonAntonov`TriesWithFrequencies`"]
Get all words from a dictionary (~93,000):
In[14]:=
allWords=DictionaryLookup["*"];allWords//Length
Out[15]=
92518
Trie creation and shrinking:
In[34]:=
AbsoluteTimingtr=
TrieCreateBySplit
[allWords];trShrunk=
TrieShrink
[tr];
Out[34]=
{7.18016,Null}
Here are the node statistics of the original and shrunk tries:
In[35]:=
TrieNodeCounts
[tr]
Out[35]=
total224937,internal160090,leaves64847
In[36]:=
TrieNodeCounts
[trShrunk]
Out[36]=
total115504,internal50657,leaves64847
Find the infixes that have more than three characters and appear more than