Wolfram Research

BPEmb Subword Embeddings Trained on Wikipedia Data

Represent words or subwords as vectors

Released in 2017, this collection of models combines BPE tokenization and the Global Vectors (GloVe) method to create subword embeddings for 275 languages with various dimensions and vocabulary sizes, all trained on Wikipedia. They can be used out of the box as a basis to train NLP applications or just for generic BPE text segmentation. Since all digits were mapped to 0 before training, these models map all digits except 0 to the unkown token.

Training Set Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["BPEmb Subword Embeddings Trained on Wikipedia Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["BPEmb Subword Embeddings Trained on Wikipedia Data", \
"ParametersInformation"]
Out[2]=

Pick a non-default model by specifying the parameters:

In[3]:=
NetModel[{"BPEmb Subword Embeddings Trained on Wikipedia Data", 
  "Language" -> "Chinese"}]
Out[3]=

Pick a non-default untrained net:

In[4]:=
NetModel[{"BPEmb Subword Embeddings Trained on Wikipedia Data", 
  "Language" -> "Chinese"}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

For each token, the net produces a vector of features:

In[5]:=
embeddings = 
 NetModel["BPEmb Subword Embeddings Trained on Wikipedia Data"][
  "hello world"]
Out[5]=

Obtain the dimensions of the vectors:

In[6]:=
Dimensions[embeddings]
Out[6]=

Visualize the embeddings:

In[7]:=
MatrixPlot[embeddings]
Out[7]=

Use the embedding layer inside a NetChain:

In[8]:=
chain = NetChain[{NetModel[
    "BPEmb Subword Embeddings Trained on Wikipedia Data"], 
   LongShortTermMemoryLayer[10]}]
Out[8]=

BPE tokenization

The BPE tokenization can be extracted from the model as a NetEncoder:

In[9]:=
bpe = NetExtract[
  NetModel["BPEmb Subword Embeddings Trained on Wikipedia Data"], 
  "Input"]
Out[9]=

The encoder segments the input string into words and subwords using BPE tokenization and outputs integer codes for each token:

In[10]:=
codes = bpe["Electroencephalogram is a compound word"]
Out[10]=

Obtain the tokens. Rare words are usually split up into subwords:

In[11]:=
NetExtract[bpe, "Tokens"][[codes]]
Out[11]=

Write a function to tokenize a string in any language:

In[12]:=
tokenizeBPE[language_, sentence_] := With[
   {tokenizer = 
     NetExtract[
      NetModel[{"BPEmb Subword Embeddings Trained on Wikipedia Data", 
        "Language" -> language, "VocabularySize" -> 10000}], "Input"]},
   tokenizer[["Tokens"]][[tokenizer[sentence]]]
   ];

Compare tokenizations in various languages:

In[13]:=
sentences = <|
   "English" -> "Electroencephalogram is a compound word",
   "Spanish" -> "¡Electroencefalograma es una palabra compuesta",
   "German" -> 
    "Elektroenzephalogramm ist ein zusammengesetztes wort",
   "French" -> "Électroencéphalogramme est un mot composé",
   "Italian" -> "Elettroencefalogramma è una parola composta",
   "Russian" -> "Электроэнцефалограмма - сложное слово",
   "Bengali" -> "ইলেক্ট্রোয়েন্ফালোগ্রাম একটি যৌগিক শব্দদ"
   |>;
Grid[KeyValueMap[{#1, TextElement@tokenizeBPE[#1, #2]} &, sentences], 
 Dividers -> All]
Out[14]=

Feature visualization

Create two lists of related words:

In[15]:=
animals = {"Alligator", "Ant", "Bear", "Bee", "Bird", "Camel", "Cat", 
   "Cheetah", "Chicken", "Chimpanzee", "Cow", "Crocodile", "Deer", 
   "Dog", "Dolphin", "Duck", "Eagle", "Elephant", "Fish", "Fly"};
In[16]:=
fruits = {"Apple", "Apricot", "Avocado", "Banana", "Blackberry", 
   "Blueberry", "Cherry", "Coconut", "Cranberry", "Grape", "Turnip", 
   "Mango", "Melon", "Papaya", "Peach", "Pineapple", "Raspberry", 
   "Strawberry", "Ribes", "Fig"};

Visualize relationships between the words using the net as a feature extractor:

In[17]:=
FeatureSpacePlot[Join[animals, fruits], 
 FeatureExtractor -> 
  NetModel["BPEmb Subword Embeddings Trained on Wikipedia Data"]]
Out[17]=

Net information

Inspect the number of parameters of all arrays in the net:

In[18]:=
NetInformation[
 NetModel["BPEmb Subword Embeddings Trained on Wikipedia Data"], \
"ArraysElementCounts"]
Out[18]=

Obtain the layer type counts:

In[19]:=
NetInformation[
 NetModel["BPEmb Subword Embeddings Trained on Wikipedia Data"], \
"LayerTypeCounts"]
Out[19]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[20]:=
jsonPath = 
 Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], 
  NetModel["BPEmb Subword Embeddings Trained on Wikipedia Data"], 
  "MXNet"]
Out[20]=

Export also creates a net.params file containing parameters:

In[21]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[21]=

Get the size of the parameter file:

In[22]:=
FileByteCount[paramPath]
Out[22]=

Requirements

Wolfram Language 12.0 (April 2019) or above

Resource History

Reference