Function Repository Resource:

BulgarianStem

Source Notebook

Find stems of Bulgarian words

Contributed by: Anton Antonov

ResourceFunction["BulgarianStem"][word]

gives a stem for the word word.

ResourceFunction["BulgarianStem"][words]

gives a stem for each of the words in the list words.

Details

The algorithm of ResourceFunction["BulgarianStem"] is based on ≈130,000 stem rules.

The stems are essentially suffix replacements.

The largest suffixes are attempted first.

The character cases of the argument word are preserved in the stem.

Words without recognized suffixes by ResourceFunction["BulgarianStem"] are returned unchanged.

ResourceFunction["BulgarianStem"] takes arguments that allow control and monitoring of the stem rules that are applied.

There are three sets of rules that can be obtained with the argument "AllStemRulesWithCounts". The sets are kept in an association. The integers 1, 2, 3 are used as keys.

Stem rules are loaded with the (sub-value) function ResourceFunction["BulgarianStem"]["SetRules"[id,minCount]].

ResourceFunction["BulgarianStem"]["SetRules"[All,0]] loads all stem rules from all sets.

The (sub-value) function ResourceFunction["BulgarianStem"]["FetchRules"[id,minCount]] can be used to experiment with stem rules.

The function "SetRules" uses "FetchRules" in order to set the value of ResourceFunction["BulgarianStem"]["CurrentRules"].

Examples

Basic Examples (2)

Here is a stem or the word "качество":

In[1]:=

Out[1]=

Here are the stems for a list of words:

In[2]:=

Out[3]=

Scope (3)

The stem rules currently used by BulgarianStem can be retrieved with the argument "CurrentRules"; here is a sample of the current rules:

In[4]:=

Out[2]=

Words without recognized suffixes by BulgarianStem are returned unchanged:

In[5]:=

Out[5]=

The symbol BulgarianStem is overloaded—it takes arguments that allow the control and monitoring of the stem rules that are applied. There are three sets of rules.

The following command sets up the use of the third set with each rule having a frequency (count) of at least 2:

In[6]:=

Here is the number of rules (which were just set):

In[7]:=

Out[7]=

Here is a sample of the rules:

In[8]:=

Out[9]=

Here are stems of the list of words above using the newly set rules:

In[10]:=

Out[10]=

Here we restore the default stem rules:

In[11]:=

Applications (2)

Finding word stems is one of the fundamental procedures in information retrieval.

Take Bulgarian text from Wikipedia:

In[12]:=

$textAZlatarov = WikipediaData["Asen Zlatarov", Language -> {\!\(\* NamespaceBox["LinguisticAssistant", DynamicModuleBox[{Typeset`query$$ = "English", Typeset`boxes$$ = TemplateBox[{"\"English\"", RowBox[{"Entity", "[", RowBox[{"\"Language\"", ",", "\"English::385w8\""}], "]"}], "\"Entity[\\\"Language\\\", \\\"English::385w8\\\"]\"", "\"language\""}, "Entity"], Typeset`allassumptions$$ = {{"type" -> "Clash", "word" -> "English", "template" -> "Assuming \"English\" is a language${separator}Use as ${desc} or ${desc} instead", "count" -> "3", "pulldown" -> "false", "default" -> "{\"C\", \"English\"} -> {\"Language\", \"dflt\"}", "Values" -> {{"name" -> "Language", "desc" -> "a language", "input" -> "{\"C\", \"English\"} -> {\"Language\"}"}, {"name" -> "GivenName", "desc" -> "a given name", "input" -> "{\"C\", \"English\"} -> {\"GivenName\"}"}, {"name" -> "Surname", "desc" -> "a surname", "input" -> "{\"C\", \"English\"} -> {\"Surname\"}"}}}}, Typeset`assumptions$$ = {}, Typeset`open$$ = {1, 2}, Typeset`querystate$$ = {"Online" -> True, "Allowed" -> True, "mparse.jsp" -> 0.698978, "Messages" -> {}}}, DynamicBox[ToBoxes[ AlphaIntegration`LinguisticAssistantBoxes["", 4, Automatic, Dynamic[Typeset`query$$], Dynamic[Typeset`boxes$$], Dynamic[Typeset`allassumptions$$], Dynamic[Typeset`assumptions$$], Dynamic[Typeset`open$$], Dynamic[Typeset`querystate$$]], StandardForm], ImageSizeCache->{166.25, {8.125, 17.125}}, TrackedSymbols:>{Typeset`query$$, Typeset`boxes$$, Typeset`allassumptions$$, Typeset`assumptions$$, Typeset`open$$, Typeset`querystate$$}], DynamicModuleValues:>{}, UndoTrackedVariables:>{Typeset`open$$}], BaseStyle->{"Deploy"}, DeleteWithContents->True, Editable->False, SelectWithContents->True]\) -> \!\(\* NamespaceBox["LinguisticAssistant", DynamicModuleBox[{Typeset`query$$ = "Bulgarian", Typeset`boxes$$ = TemplateBox[{"\"Bulgarian\"", RowBox[{"Entity", "[", RowBox[{"\"Language\"", ",", "\"Bulgarian::xmr5j\""}], "]"}], "\"Entity[\\\"Language\\\", \\\"Bulgarian::xmr5j\\\"]\"", "\"language\""}, "Entity"], Typeset`allassumptions$$ = {{"type" -> "Clash", "word" -> "Bulgarian", "template" -> "Assuming \"Bulgarian\" is a language${separator}Use as ${desc} or ${desc} instead", "count" -> "3", "pulldown" -> "false", "default" -> "{\"C\", \"Bulgarian\"} -> {\"Language\", \"dflt\"}", "Values" -> {{"name" -> "Language", "desc" -> "a language", "input" -> "{\"C\", \"Bulgarian\"} -> {\"Language\"}"}, {"name" -> "Country", "desc" -> "a country", "input" -> "{\"C\", \"Bulgarian\"} -> {\"Country\"}"}, {"name" -> "Alphabet", "desc" -> "an alphabet", "input" -> "{\"C\", \"Bulgarian\"} -> {\"Alphabet\"}"}}}}, Typeset`assumptions$$ = {}, Typeset`open$$ = {1, 2}, Typeset`querystate$$ = {"Online" -> True, "Allowed" -> True, "mparse.jsp" -> 0.913959, "Messages" -> {}}}, DynamicBox[ToBoxes[ AlphaIntegration`LinguisticAssistantBoxes["", 4, Automatic, Dynamic[Typeset`query$$], Dynamic[Typeset`boxes$$], Dynamic[Typeset`allassumptions$$], Dynamic[Typeset`assumptions$$], Dynamic[Typeset`open$$], Dynamic[Typeset`querystate$$]], StandardForm], ImageSizeCache->{183.25, {8.125, 17.125}}, TrackedSymbols:>{Typeset`query$$, Typeset`boxes$$, Typeset`allassumptions$$, Typeset`assumptions$$, Typeset`open$$, Typeset`querystate$$}], DynamicModuleValues:>{}, UndoTrackedVariables:>{Typeset`open$$}], BaseStyle->{"Deploy"}, DeleteWithContents->True, Editable->False, SelectWithContents->True]\)}];$