Wolfram Research

BERT Trained on BookCorpus and English Wikipedia Data

Represent text as a sequence of vectors

Released in 2018, Bidirectional Encoder Representations from Transformers (BERT) is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right contexts in all layers. This model can be fine tuned with an additional output layer to create state-of-the art models for a wide range of tasks. It uses bidirectional self-attention, often referred to as a transformer encoder.

Training Set Information

Performance

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["BERT Trained on BookCorpus and English Wikipedia Data"]
Out[1]=

Basic usage

For each token, the net produces a feature vector of length 768:

In[2]:=
embeddings = 
  NetModel["BERT Trained on BookCorpus and English Wikipedia Data"][
   "Hello world! I am here"];

Obtain dimensions of the embeddings:

In[3]:=
Dimensions@embeddings
Out[3]=

Visualize the embeddings:

In[4]:=
MatrixPlot@embeddings
Out[4]=

NetModel parameters

Inspect the available parameters:

In[5]:=
NetModel["BERT Trained on BookCorpus and English Wikipedia Data", \
"ParametersInformation"]
Out[5]=

Pick a non-default model by specifying the parameters:

In[6]:=
NetModel[{"BERT Trained on BookCorpus and English Wikipedia Data", 
  "Type" -> "LargeUncased", "InputType" -> "ListOfStrings"}]
Out[6]=

Pick a non-default untrained net:

In[7]:=
NetModel[{"BERT Trained on BookCorpus and English Wikipedia Data", 
  "Type" -> "BaseUncased", 
  "InputType" -> "String"}, "UninitializedEvaluationNet"]
Out[7]=

Transformer architecture

Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:

In[8]:=
net = NetModel[{"BERT Trained on BookCorpus and English Wikipedia \
Data", "InputType" -> "ListOfStrings"}];
netencoder = NetExtract[net, "Input"]
Out[9]=

For each token, a list of the form {tokenIndex, segmentIndex} is returned by the encoder. For the model returned by setting “InputType” to “ListOfStrings”, segment indices are 1 for the first sentence and 2 for the second. For the case where “InputType” is set to ”String”, segment indices are 1 only:

In[10]:=
netencoder[{"Hello world!", "I am here"}]
Out[10]=

The first token of the first sentence always starts with the special code 102, corresponding to the classification index [CLS]. Both sentences always end with the special token 103, corresponding to the separator index [SEP]:

In[11]:=
netencoder[{"I start with 102 and end with 103", "I end with 103"}]
Out[11]=

Together with the token and segment indices, position indices are also generated:

In[12]:=
net[{"Hello world!", "I am here"}, 
 NetPort[{"embedding", "posembed", "Output"}]]
Out[12]=

Indices are then embedded into numeric vectors of size 768:

In[13]:=
embeddings = net[{"Hello world!", "I am here"},
  {NetPort[{"embedding", "embeddingpos", "Output"}],
   NetPort[{"embedding", "embeddingtokens", "Output"}],
   NetPort[{"embedding", "embeddingsegments", "Output"}]}]
Out[13]=

Obtain the dimensions:

In[14]:=
Map[Dimensions, embeddings]
Out[14]=

Visualize the embedding architecture:

In[15]:=
NetExtract[net, "embedding"]
Out[15]=

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[16]:=
NetExtract[net, "encoder"]
Out[16]=

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads”:

In[17]:=
NetExtract[net, {"encoder", 1, 1}]
Out[17]=

Each head uses an AttentionLayer at its core:

In[18]:=
NetExtract[net, {"encoder", 1, 1, "attention", 1}]
Out[18]=

Sentence analogies

Define a list of sentences for comparison:

In[19]:=
sentences = {"The music is soothing to the ears.", 
   "The song blasted from the little radio.", 
   "This soundtrack is too good.", "Food is needed for survival.", 
   "If you are hungry, please eat.", "She cooks really well."};

Precompute the embeddings for the list of sentences:

In[20]:=
net = NetModel["BERT Trained on BookCorpus and English Wikipedia Data"]
assoc = AssociationThread[sentences -> net[sentences][[All, -1]]];
Out[20]=

Visualize the similarity between the sentences using the net as a feature extractor:

In[21]:=
FeatureSpacePlot[
 Table[Labeled[(Values@assoc)[[i]], (Keys@assoc)[[i]]], {i, 
   Length@assoc}], LabelingFunction -> Callout, ImageSize -> Large]
Out[21]=

Train a classifier model with the word embeddings

Get a text-processing dataset:

In[22]:=
train = ExampleData[{"MachineLearning", "MovieReview"}, 
   "TrainingData"];
valid = ExampleData[{"MachineLearning", "MovieReview"}, "TestData"];

View a random sample of the dataset:

In[23]:=
RandomSample[train, 1]
Out[23]=

Precompute the BERT vectors on the training and the validation dataset (if available, GPU is recommended):

In[24]:=
trainembeddings = 
  net[train[[All, 1]], TargetDevice -> "CPU"] -> train[[All, 2]];
validembeddings = 
  net[valid[[All, 1]], TargetDevice -> "CPU"] -> valid[[All, 2]];

Define a simple network for classification, using a max-pooling strategy:

In[25]:=
classifier = NetChain[
  {DropoutLayer[],
   NetMapOperator[2],
   AggregationLayer[Max, 1],
   SoftmaxLayer[]},
  "Output" -> NetDecoder[{"Class", {"negative", "positive"}}]]
Out[25]=

Train the network on the precomputed BERT vectors:

In[26]:=
results = NetTrain[classifier, trainembeddings, All,
  ValidationSet -> validembeddings,
  TrainingStoppingCriterion -> <|"Criterion" -> "ErrorRate", 
    "Patience" -> 50|>,
  TargetDevice -> "CPU",
  MaxTrainingRounds -> 500]
Out[26]=

Check the classification error rate on the validation data:

In[27]:=
Min@results["ValidationMeasurementsLists", "ErrorRate"]
Out[27]=

Compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors on the training and the validation dataset (if available, GPU is recommended):

In[28]:=
glove = NetModel[
   "GloVe 300-Dimensional Word Vectors Trained on Wikipedia and \
Gigaword 5 Data"];
In[29]:=
trainembeddingsglove = 
  glove[train[[All, 1]], TargetDevice -> "CPU"] -> train[[All, 2]];
validembeddingsglove = 
  glove[valid[[All, 1]], TargetDevice -> "CPU"] -> valid[[All, 2]];

Train the classifier on the precomputed GloVe vectors:

In[30]:=
results = NetTrain[classifier, trainembeddingsglove, All,
  ValidationSet -> validembeddingsglove,
  TrainingStoppingCriterion -> <|"Criterion" -> "ErrorRate", 
    "Patience" -> 50|>,
  TargetDevice -> "CPU",
  MaxTrainingRounds -> 50]
Out[30]=

Check the classification error rate on the validation data:

In[31]:=
Min@results["ValidationMeasurementsLists", "ErrorRate"]
Out[31]=

Net information

Inspect the number of parameters of all arrays in the net:

In[32]:=
NetInformation[
 NetModel["BERT Trained on BookCorpus and English Wikipedia Data"], \
"ArraysElementCounts"]
Out[32]=

Obtain the total number of parameters:

In[33]:=
NetInformation[
 NetModel["BERT Trained on BookCorpus and English Wikipedia Data"], \
"ArraysTotalElementCount"]
Out[33]=

Obtain the layer type counts:

In[34]:=
NetInformation[
 NetModel["BERT Trained on BookCorpus and English Wikipedia Data"], \
"LayerTypeCounts"]
Out[34]=

Display the summary graphic:

In[35]:=
NetInformation[
 NetModel["BERT Trained on BookCorpus and English Wikipedia Data"], \
"SummaryGraphic"]
Out[35]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[36]:=
jsonPath = 
 Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], 
  NetModel["BERT Trained on BookCorpus and English Wikipedia Data"], 
  "MXNet"]
Out[36]=

Export also creates a net.params file containing parameters:

In[37]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[37]=

Get the size of the parameter file:

In[38]:=
FileByteCount[paramPath]
Out[38]=

The size is similar to the byte count of the resource object:

In[39]:=
ResourceObject[
  "BERT Trained on BookCorpus and English Wikipedia \
Data"]["ByteCount"]
Out[39]=

Requirements

Wolfram Language 12.0 (April 2019) or above

Resource History

Reference