Function Repository Resource:

BulgarianStem

Source Notebook

Find stems of Bulgarian words

Contributed by: Anton Antonov

ResourceFunction["BulgarianStem"][word]

gives a stem for the word word.

ResourceFunction["BulgarianStem"][words]

gives a stem for each of the words in the list words.

Details

The algorithm of ResourceFunction["BulgarianStem"] is based on 130,000 stem rules.
The stems are essentially suffix replacements.
The largest suffixes are attempted first.
The character cases of the argument word are preserved in the stem.
Words without recognized suffixes by ResourceFunction["BulgarianStem"] are returned unchanged.
ResourceFunction["BulgarianStem"] takes arguments that allow control and monitoring of the stem rules that are applied.
There are three sets of rules that can be obtained with the argument "AllStemRulesWithCounts". The sets are kept in an association. The integers 1, 2, 3 are used as keys.
Stem rules are loaded with the (sub-value) function ResourceFunction["BulgarianStem"]["SetRules"[id,minCount]].
ResourceFunction["BulgarianStem"]["SetRules"[All,0]] loads all stem rules from all sets.
The (sub-value) function ResourceFunction["BulgarianStem"]["FetchRules"[id,minCount]] can be used to experiment with stem rules.
The function "SetRules" uses "FetchRules" in order to set the value of ResourceFunction["BulgarianStem"]["CurrentRules"].

Examples

Basic Examples (2) 

Here is a stem or the word "качество":

In[1]:=
ResourceFunction["BulgarianStem"]["качество"]
Out[1]=

Here are the stems for a list of words:

In[2]:=
ResourceFunction["BulgarianStem"]["SetRules"[2, 2]];
ResourceFunction[
 "BulgarianStem"][{"качество", "рандевуто", "линейността", "кафяво"}]
Out[3]=

Scope (3) 

The stem rules currently used by BulgarianStem can be retrieved with the argument "CurrentRules"; here is a sample of the current rules:

In[4]:=
SeedRandom[32];
RandomSample[ResourceFunction["BulgarianStem"]["CurrentRules"], 6]
Out[2]=

Words without recognized suffixes by BulgarianStem are returned unchanged:

In[5]:=
ResourceFunction["BulgarianStem"]["factor"]
Out[5]=

The symbol BulgarianStem is overloaded—it takes arguments that allow the control and monitoring of the stem rules that are applied. There are three sets of rules.

The following command sets up the use of the third set with each rule having a frequency (count) of at least 2:

In[6]:=
ResourceFunction["BulgarianStem"]["SetRules"[3, 2]];

Here is the number of rules (which were just set):

In[7]:=
Length[ResourceFunction["BulgarianStem"]["CurrentRules"]]
Out[7]=

Here is a sample of the rules:

In[8]:=
SeedRandom[32];
RandomSample[ResourceFunction["BulgarianStem"]["CurrentRules"], 6]
Out[9]=

Here are stems of the list of words above using the newly set rules:

In[10]:=
ResourceFunction[
 "BulgarianStem"][{"качество", "рандевуто", "линейността", "кафяво"}]
Out[10]=

Here we restore the default stem rules:

In[11]:=
ResourceFunction["BulgarianStem"]["SetRules"[Automatic]];

Applications (2) 

Finding word stems is one of the fundamental procedures in information retrieval.

Take Bulgarian text from Wikipedia:

In[12]:=
textAZlatarov = WikipediaData["Asen Zlatarov", Language -> {\!\(\*
NamespaceBox["LinguisticAssistant",
DynamicModuleBox[{Typeset`query$$ = "English", Typeset`boxes$$ = TemplateBox[{"\"English\"", 
RowBox[{"Entity", "[", 
RowBox[{"\"Language\"", ",", "\"English::385w8\""}], "]"}], "\"Entity[\\\"Language\\\", \\\"English::385w8\\\"]\"", "\"language\""}, "Entity"], Typeset`allassumptions$$ = {{"type" -> "Clash", "word" -> "English", "template" -> "Assuming \"English\" is a language${separator}Use as ${desc} or ${desc} instead", "count" -> "3", "pulldown" -> "false", "default" -> "{\"C\", \"English\"} -> {\"Language\", \"dflt\"}", "Values" -> {{"name" -> "Language", "desc" -> "a language", "input" -> "{\"C\", \"English\"} -> {\"Language\"}"}, {"name" -> "GivenName", "desc" -> "a given name", "input" -> "{\"C\", \"English\"} -> {\"GivenName\"}"}, {"name" -> "Surname", "desc" -> "a surname", "input" -> "{\"C\", \"English\"} -> {\"Surname\"}"}}}},
           Typeset`assumptions$$ = {}, Typeset`open$$ = {1, 2}, Typeset`querystate$$ = {"Online" -> True, "Allowed" -> True,
            "mparse.jsp" -> 0.698978, "Messages" -> {}}}, 
DynamicBox[ToBoxes[
AlphaIntegration`LinguisticAssistantBoxes["", 4, Automatic, 
Dynamic[Typeset`query$$], 
Dynamic[Typeset`boxes$$], 
Dynamic[Typeset`allassumptions$$], 
Dynamic[Typeset`assumptions$$], 
Dynamic[Typeset`open$$], 
Dynamic[Typeset`querystate$$]], StandardForm],
ImageSizeCache->{166.25, {8.125, 17.125}},
TrackedSymbols:>{Typeset`query$$, Typeset`boxes$$, Typeset`allassumptions$$, Typeset`assumptions$$, Typeset`open$$, Typeset`querystate$$}],
DynamicModuleValues:>{},
UndoTrackedVariables:>{Typeset`open$$}],
BaseStyle->{"Deploy"},
DeleteWithContents->True,
Editable->False,
SelectWithContents->True]\) -> \!\(\*
NamespaceBox["LinguisticAssistant",
DynamicModuleBox[{Typeset`query$$ = "Bulgarian", Typeset`boxes$$ = TemplateBox[{"\"Bulgarian\"", 
RowBox[{"Entity", "[", 
RowBox[{"\"Language\"", ",", "\"Bulgarian::xmr5j\""}], "]"}], "\"Entity[\\\"Language\\\", \\\"Bulgarian::xmr5j\\\"]\"", "\"language\""}, "Entity"], Typeset`allassumptions$$ = {{"type" -> "Clash", "word" -> "Bulgarian", "template" -> "Assuming \"Bulgarian\" is a language${separator}Use as ${desc} or ${desc} instead", "count" -> "3", "pulldown" -> "false", "default" -> "{\"C\", \"Bulgarian\"} -> {\"Language\", \"dflt\"}", "Values" -> {{"name" -> "Language", "desc" -> "a language", "input" -> "{\"C\", \"Bulgarian\"} -> {\"Language\"}"}, {"name" -> "Country", "desc" -> "a country", "input" -> "{\"C\", \"Bulgarian\"} -> {\"Country\"}"}, {"name" -> "Alphabet", "desc" -> "an alphabet", "input" -> "{\"C\", \"Bulgarian\"} -> {\"Alphabet\"}"}}}},
           Typeset`assumptions$$ = {}, Typeset`open$$ = {1, 2}, Typeset`querystate$$ = {"Online" -> True, "Allowed" -> True,
            "mparse.jsp" -> 0.913959, "Messages" -> {}}}, 
DynamicBox[ToBoxes[
AlphaIntegration`LinguisticAssistantBoxes["", 4, Automatic, 
Dynamic[Typeset`query$$], 
Dynamic[Typeset`boxes$$], 
Dynamic[Typeset`allassumptions$$], 
Dynamic[Typeset`assumptions$$], 
Dynamic[Typeset`open$$], 
Dynamic[Typeset`querystate$$]], StandardForm],
ImageSizeCache->{183.25, {8.125, 17.125}},
TrackedSymbols:>{Typeset`query$$, Typeset`boxes$$, Typeset`allassumptions$$, Typeset`assumptions$$, Typeset`open$$, Typeset`querystate$$}],
DynamicModuleValues:>{},
UndoTrackedVariables:>{Typeset`open$$}],
BaseStyle->{"Deploy"},
DeleteWithContents->True,
Editable->False,
SelectWithContents->True]\)}];

Here we get the words from the text:

In[13]:=
words = Select[
   TextWords@
    ToLowerCase@textAZlatarov, ! MemberQ[ResourceFunction["LinguaStopwords"]["Bulgarian"], #] &];

Here we find the number of occurrences of each word and show the words with the largest counts:

In[14]:=
TakeLargest[GroupBy[words, Identity, Length], 20]
Out[14]=

Here we stem the words, find the number of occurrences of each word stem and show the stems with the largest counts:

In[15]:=
TakeLargest[
 GroupBy[ResourceFunction["BulgarianStem"]@words, Identity, Length], 20]
Out[15]=

Consider the following random job titles in Bulgarian:

In[16]:=
SeedRandom[32];
lsTitles = ResourceFunction["RandomPretentiousJobTitle"]["Bulgarian", 4]
Out[13]=

Here is a list of tables that show the words of the job titles and their corresponding stems:

In[17]:=
Map[Labeled[
   Dataset[Transpose[{TextWords@#1, ResourceFunction["BulgarianStem"]@TextWords@#1}]][All, AssociationThread[{"Word", "Stemmed Form"}, #] &], #1, Top] &, lsTitles]
Out[17]=

Properties and Relations (4) 

Here is the number of rules in each of the three sets of rules:

In[18]:=
Length /@ ResourceFunction["BulgarianStem"]["AllStemRulesWithCounts"]
Out[18]=

Here are stem rule samples:

In[19]:=
ResourceFunction["GridTableForm"][
 RandomSample[#, 6] & /@ ResourceFunction["BulgarianStem"]["AllStemRulesWithCounts"]]
Out[19]=

The current set of stem rules can be obtained with the argument "CurrentRules":

In[20]:=
Length@ResourceFunction["BulgarianStem"]["CurrentRules"]
Out[20]=

Each stem rule has an associated count (or frequency); here is the minimum count of the current rules:

In[21]:=
ResourceFunction["BulgarianStem"]["CurrentMinCount"]
Out[21]=

Here is the number of current rules:

In[22]:=
Length@ResourceFunction["BulgarianStem"]["CurrentRules"]
Out[22]=

Here is the string length of the replacement values of the current stem rules:

In[23]:=
Tally[StringLength /@ Values@ResourceFunction["BulgarianStem"]["CurrentRules"]]
Out[23]=

Here is a histogram of the lengths of the suffixes that are replaced with the current stem rules:

In[24]:=
Histogram[
 StringLength /@ Keys@ResourceFunction["BulgarianStem"]["CurrentRules"], PlotRange -> All]
Out[24]=

The function WordStem gives stems of English words:

In[25]:=
WordStem["healthy"]
Out[25]=

Here is a corresponding call of BulgarianStem:

In[26]:=
ResourceFunction["BulgarianStem"]["здравеняк"]
Out[26]=

Here is a list of tables that show the words of job titles in English and their corresponding stems (analogous to the list of tables above):

In[27]:=
SeedRandom[32];
Map[Labeled[
   Dataset[Transpose[{TextWords@#1, WordStem@TextWords@#1}]][All, AssociationThread[{"Word", "Stemmed Form"}, #] &], #1, Top] &, ResourceFunction["RandomPretentiousJobTitle"]["English", 4]]
Out[26]=

The words of a text can be obtained with StringSplit or TextWords and then given to BulgarianStem:

In[28]:=
ResourceFunction["BulgarianStem"]@
 StringSplit[
  "Покълването на посевите се очаква с търпение, пиене и сланина.", RegularExpression["\\W+"]]
Out[28]=

Neat Examples (3) 

Here is some English text:

In[29]:=
text = "This party should get started.";

Here are the stems of the words in the English text:

In[30]:=
WordStem@TextWords@text
Out[30]=

Here we translate the English text into Bulgarian text, extract the words and stem them:

In[31]:=
ResourceFunction["BulgarianStem"]@
 TextWords@Echo@TextTranslation[text, "English" -> "Bulgarian"]
Out[31]=

Publisher

Anton Antonov

Version History

  • 1.0.0 – 22 April 2022

Source Metadata

Related Resources

Author Notes

The original article and some of the packages refer to the values of replacement rules as "left contexts."

License Information