Wolfram Neural Net Repository
Immediate Computable Access to Neural Net Models
Represent text as a sequence of vectors
This model is also available through the built-in function FindTextualAnswer
Released in 2018, Bidirectional Encoder Representations from Transformers (BERT) is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right contexts in all layers. This model can be fine tuned with an additional output layer to create state-of-the art models for a wide range of tasks. It uses bidirectional self-attention, often referred to as a "transformer encoder".
Trained size: 436 MB | Number of models: 7
Accuracy of the Base-Uncased and Large-Uncased models for various natural language inference tasks:
Get the pre-trained net:
In[1]:= | ![]() |
Out[1]= | ![]() |
This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:
In[2]:= | ![]() |
Out[2]= | ![]() |
Pick a non-default net by specifying the parameters:
In[3]:= | ![]() |
Out[3]= | ![]() |
Pick a non-default uninitialized net:
In[4]:= | ![]() |
Out[4]= | ![]() |
Given a piece of text, the BERT net produces a sequence of feature vectors of size 768, which corresponds to the sequence of input words or subwords:
In[5]:= | ![]() |
Obtain dimensions of the embeddings:
In[6]:= | ![]() |
Out[6]= | ![]() |
Visualize the embeddings:
In[7]:= | ![]() |
Out[7]= | ![]() |
Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:
In[8]:= | ![]() |
Out[9]= | ![]() |
For each input subword token, the encoder yields a pair of indices that corresponds to the token index in the vocabulary and the index of the sentence within the list of input sentences:
In[10]:= | ![]() |
Out[10]= | ![]() |
The list of tokens always starts with special token index 102, which corresponds to the classification index. Also the special token index 103 is used as a separator between the different text segments. Each subword token is also assigned a positional index:
In[11]:= | ![]() |
Out[11]= | ![]() |
A lookup is done to map these indices to numeric vectors of size 768:
In[12]:= | ![]() |
Out[13]= | ![]() |
For each subword token, these three embeddings are combined by summing elements with ThreadingLayer:
In[14]:= | ![]() |
Out[14]= | ![]() |
The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:
In[15]:= | ![]() |
Out[15]= | ![]() |
The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads”:
In[16]:= | ![]() |
Out[16]= | ![]() |
Each head uses an AttentionLayer at its core:
In[17]:= | ![]() |
Out[17]= | ![]() |
BERT uses self-attention, where the embedding of a given subword depends on the full input text. The following figure compares self-attention (lower left) to other types of connectivity patterns that are popular in deep learning:
Define a sentence embedding that takes the last feature vector from BERT subword embeddings (as an arbitrary choice):
In[18]:= | ![]() |
Out[18]= | ![]() |
Define a list of sentences in two broad categories (food and music):
In[19]:= | ![]() |
Precompute the embeddings for a list of sentences:
In[20]:= | ![]() |
Visualize the similarity between the sentences using the net as a feature extractor:
In[21]:= | ![]() |
Out[21]= | ![]() |
Get a text-processing dataset:
In[22]:= | ![]() |
View a random sample of the dataset:
In[23]:= | ![]() |
Out[23]= | ![]() |
Precompute the BERT vectors for the training and the validation datasets (if available, GPU is highly recommended):
In[24]:= | ![]() |
Define a network to classify the sequences of subword embeddings, using a max-pooling strategy:
In[25]:= | ![]() |
Out[25]= | ![]() |
Train the network on the precomputed BERT vectors:
In[26]:= | ![]() |
Out[26]= | ![]() |
Check the classification error rate on the validation data:
In[27]:= | ![]() |
Out[27]= | ![]() |
Let’s compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors for the training and the validation dataset:
In[28]:= | ![]() |
Train the classifier on the precomputed GloVe vectors:
In[29]:= | ![]() |
Out[29]= | ![]() |
Compare the results obtained with GPT and with GloVe:
In[30]:= | ![]() |
Out[30]= | ![]() |
Inspect the number of parameters of all arrays in the net:
In[31]:= | ![]() |
Out[32]= | ![]() |
Obtain the total number of parameters:
In[33]:= | ![]() |
Out[34]= | ![]() |
Obtain the layer type counts:
In[35]:= | ![]() |
Out[36]= | ![]() |
Display the summary graphic:
In[37]:= | ![]() |
Out[38]= | ![]() |
Wolfram Language 12.1 (March 2020) or above