BERT Trained on BookCorpus and Wikipedia Data

Represent text as a sequence of vectors

This model is also available through the built-in function FindTextualAnswer

Released in 2018, Bidirectional Encoder Representations from Transformers (BERT) is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right contexts in all layers. This model can be fine tuned with an additional output layer to create state-of-the art models for a wide range of tasks. It uses bidirectional self-attention, often referred to as a "transformer encoder".

Trained size: 436 MB | Number of models: 7

Training Set Information

BookCorpus, a dataset consisting of 11,038 unpublished books from 16 different genres and 2,500 million words from text passages of Wikipedia.

Performance

Accuracy of the Base-Uncased and Large-Uncased models for various natural language inference tasks:

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=

Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=

Out[3]=

Pick a non-default uninitialized net:

In[4]:=

Out[4]=

Basic usage

Given a piece of text, the BERT net produces a sequence of feature vectors of size 768, which corresponds to the sequence of input words or subwords:

In[5]:=

Obtain dimensions of the embeddings:

In[6]:=

Out[6]=

Visualize the embeddings:

In[7]:=

Out[7]=

Transformer architecture

Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:

In[8]:=

net = NetModel[{"BERT Trained on BookCorpus and Wikipedia Data", "InputType" -> "ListOfStrings"}];
netencoder = NetExtract[net, "Input"]

Out[9]=

For each input subword token, the encoder yields a pair of indices that corresponds to the token index in the vocabulary and the index of the sentence within the list of input sentences:

In[10]:=

Out[10]=

The list of tokens always starts with special token index 102, which corresponds to the classification index. Also the special token index 103 is used as a separator between the different text segments. Each subword token is also assigned a positional index:

In[11]:=

Out[11]=

A lookup is done to map these indices to numeric vectors of size 768:

In[12]:=

embeddings = net[{"Hello world!", "I am here"},
{NetPort[{"embedding", "embeddingpos", "Output"}],
NetPort[{"embedding", "embeddingtokens", "Output"}],
NetPort[{"embedding", "embeddingwords", "Output"}]}];
Map[MatrixPlot, embeddings]

Out[13]=

For each subword token, these three embeddings are combined by summing elements with ThreadingLayer:

In[14]:=

Out[14]=

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[15]:=

Out[15]=

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads”:

In[16]:=

Out[16]=

Each head uses an AttentionLayer at its core:

In[17]:=

Out[17]=

BERT uses self-attention, where the embedding of a given subword depends on the full input text. The following figure compares self-attention (lower left) to other types of connectivity patterns that are popular in deep learning:

Sentence analogies

Define a sentence embedding that takes the last feature vector from BERT subword embeddings (as an arbitrary choice):

In[18]:=

Out[18]=

Define a list of sentences in two broad categories (food and music):

In[19]:=

sentences = {"The music is soothing to the ears", "The song blasted from the little radio", "This soundtrack is good", "Food is needed for survival", "If you are hungry, please eat", "She cooks really well"};

Precompute the embeddings for a list of sentences:

In[20]:=

Visualize the similarity between the sentences using the net as a feature extractor:

In[21]:=

Out[21]=

Train a classifier model with the subword embeddings

Get a text-processing dataset:

In[22]:=

View a random sample of the dataset:

In[23]:=

Out[23]=

Precompute the BERT vectors for the training and the validation datasets (if available, GPU is highly recommended):

In[24]:=

trainembeddings = NetModel["BERT Trained on BookCorpus and Wikipedia Data"][
train[[All, 1]], TargetDevice -> "CPU"] -> train[[All, 2]];
validembeddings = NetModel["BERT Trained on BookCorpus and Wikipedia Data"][
valid[[All, 1]], TargetDevice -> "CPU"] -> valid[[All, 2]];

Define a network to classify the sequences of subword embeddings, using a max-pooling strategy:

In[25]:=

classifierhead = NetChain[{DropoutLayer[], NetMapOperator[2], AggregationLayer[Max, 1], SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", {"negative", "positive"}}]]

Out[25]=

Train the network on the precomputed BERT vectors:

In[26]:=

bertresults = NetTrain[classifierhead, trainembeddings, All,
ValidationSet -> validembeddings,
TargetDevice -> "CPU",
MaxTrainingRounds -> 50]

Out[26]=

Check the classification error rate on the validation data:

In[27]:=

Out[27]=

Let’s compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors for the training and the validation dataset:

In[28]:=

trainembeddingsglove = NetModel[
"GloVe 300-Dimensional Word Vectors Trained on Wikipedia and Gigaword 5 Data"][train[[All, 1]], TargetDevice -> "CPU"] -> train[[All, 2]];
validembeddingsglove = NetModel[
"GloVe 300-Dimensional Word Vectors Trained on Wikipedia and Gigaword 5 Data"][valid[[All, 1]], TargetDevice -> "CPU"] -> valid[[All, 2]];

Train the classifier on the precomputed GloVe vectors:

In[29]:=

gloveresults = NetTrain[classifierhead, trainembeddingsglove, All,
ValidationSet -> validembeddingsglove,
TrainingStoppingCriterion -> <|"Criterion" -> "ErrorRate", "Patience" -> 50|>,
TargetDevice -> "CPU",
MaxTrainingRounds -> 50]

Out[29]=

Compare the results obtained with GPT and with GloVe:

In[30]:=

Out[30]=

Net information

Inspect the number of parameters of all arrays in the net:

In[31]:=

Out[32]=

Obtain the total number of parameters:

In[33]:=

Out[34]=

Obtain the layer type counts:

In[35]:=

Out[36]=

Display the summary graphic:

In[37]:=

Out[38]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[39]:=

Out[40]=

Export also creates a net.params file containing parameters:

In[41]:=

Out[41]=

Get the size of the parameter file:

In[42]:=

Out[42]=

Construction Notebook

Download Construction Notebook

Open in Wolfram Cloud

Requirements

Wolfram Language 12.1 (March 2020) or above

Resource History

Date Created: 7 May 2020
Latest Update: 6 April 2022

Reference

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805 (2018)
Available from: https://github.com/google-research/bert
Rights: Apache 2.0 License