BPEmb Subword Embeddings Trained on Wikipedia Data

Represent words or subwords as vectors

Released in 2017, this collection of models combines BPE tokenization and the Global Vectors (GloVe) method to create subword embeddings for 275 languages with various dimensions and vocabulary sizes, all trained on Wikipedia. They can be used out of the box as a basis to train NLP applications or just for generic BPE text segmentation. Since all digits were mapped to 0 before training, these models map all digits except 0 to the unkown token.

Number of models: 2,520

Training Set Information

Wikipedia Database Dumps for several languages.

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=

$NetModel["BPEmb Subword Embeddings Trained on Wikipedia Data", \ "ParametersInformation"]$

Out[2]=

Pick a non-default model by specifying the parameters:

In[3]:=

Out[3]=

Pick a non-default untrained net:

In[4]:=

Out[4]=

Basic usage

For each token, the net produces a vector of features:

In[5]:=

Out[5]=

Obtain the dimensions of the vectors:

In[6]:=

Out[6]=

Visualize the embeddings:

In[7]:=

Out[7]=

Use the embedding layer inside a NetChain:

In[8]:=

Out[8]=

BPE tokenization

The BPE tokenization can be extracted from the model as a NetEncoder:

In[9]:=

Out[9]=

The encoder segments the input string into words and subwords using BPE tokenization and outputs integer codes for each token:

In[10]:=

Out[10]=

Obtain the tokens. Rare words are usually split up into subwords:

In[11]:=

Out[11]=

Write a function to tokenize a string in any language:

In[12]:=

$tokenizeBPE[language_, sentence_] := With[ {tokenizer = NetExtract[ NetModel[{"BPEmb Subword Embeddings Trained on Wikipedia Data", "Language" -> language, "VocabularySize" -> 10000}], "Input"]}, tokenizer[["Tokens"]][[tokenizer[sentence]]] ];$

Compare tokenizations in various languages:

In[13]:=

sentences = <|
"English" -> "Electroencephalogram is a compound word",
"Spanish" -> "¡Electroencefalograma es una palabra compuesta",
"German" -> "Elektroenzephalogramm ist ein zusammengesetztes wort",
"French" -> "Électroencéphalogramme est un mot composé",
"Italian" -> "Elettroencefalogramma è una parola composta",
"Russian" -> "Электроэнцефалограмма - сложное слово",
"Bengali" -> "ইলেক্ট্রোয়েন্ফালোগ্রাম একটি যৌগিক শব্দদ"
|>;
Grid[KeyValueMap[{#1, TextElement@tokenizeBPE[#1, #2]} &, sentences], Dividers -> All]

Out[14]=

Feature visualization

Create two lists of related words:

In[15]:=

animals = {"Alligator", "Ant", "Bear", "Bee", "Bird", "Camel", "Cat", "Cheetah", "Chicken", "Chimpanzee", "Cow", "Crocodile", "Deer", "Dog", "Dolphin", "Duck", "Eagle", "Elephant", "Fish", "Fly"};

In[16]:=

fruits = {"Apple", "Apricot", "Avocado", "Banana", "Blackberry", "Blueberry", "Cherry", "Coconut", "Cranberry", "Grape", "Turnip", "Mango", "Melon", "Papaya", "Peach", "Pineapple", "Raspberry", "Strawberry", "Ribes", "Fig"};

Visualize relationships between the words using the net as a feature extractor:

In[17]:=

Out[17]=

Net information

Inspect the number of parameters of all arrays in the net:

In[18]:=

$NetInformation[ NetModel["BPEmb Subword Embeddings Trained on Wikipedia Data"], \ "ArraysElementCounts"]$

Out[18]=

Obtain the layer type counts:

In[19]:=

$NetInformation[ NetModel["BPEmb Subword Embeddings Trained on Wikipedia Data"], \ "LayerTypeCounts"]$

Out[19]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[20]:=

Out[20]=

Export also creates a net.params file containing parameters:

In[21]:=

Out[21]=

Get the size of the parameter file:

In[22]:=

Out[22]=

Construction Notebook

Download Construction Notebook

Open in Wolfram Cloud

Requirements

Wolfram Language 12.0 (April 2019) or above

Resource History

Date Created: 10 December 2018
Latest Update: 26 April 2019

Reference

B. Heinzerling, M. Strube, "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages," arXiv:1710.02187 (2017)
Available from: https://github.com/bheinzerling/bpemb
Rights: MIT License