RoBERTa Trained on BookCorpus, English Wikipedia, CC-News, OpenWebText and Stories Datasets

Represent text as a sequence of vectors

Released in 2019, this model uses various pre-training and design optimizations, like longer training periods on bigger batches of data and training over more data, removing next-sentence prediction objective, training on longer sequences and changing masking patterns dynamically, to obtain substantial improvement in performance over the existing BERT models.

Number of models: 3

Training Set Information

Performance

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["RoBERTa Trained on BookCorpus, English Wikipedia, CC-News, \
OpenWebText and Stories Datasets"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["RoBERTa Trained on BookCorpus, English Wikipedia, CC-News, \
OpenWebText and Stories Datasets", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"RoBERTa Trained on BookCorpus, English Wikipedia, CC-News, \
OpenWebText and Stories Datasets", "Type" -> "Large", "InputType" -> "ListOfStrings"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"RoBERTa Trained on BookCorpus, English Wikipedia, CC-News, \
OpenWebText and Stories Datasets", "Type" -> "Base", "InputType" -> "ListOfStrings"}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Given a piece of text, the RoBERTa net produces a sequence of feature vectors of size 768, which correspond to the sequence of input words or subwords:

In[5]:=
input = "Hello world! I am here";
embeddings = NetModel["RoBERTa Trained on BookCorpus, English Wikipedia, \
CC-News, OpenWebText and Stories Datasets"][input];

Obtain dimensions of the embeddings:

In[6]:=
Dimensions@embeddings
Out[6]=

Visualize the embeddings:

In[7]:=
MatrixPlot@embeddings
Out[7]=

Transformer architecture

Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:

In[8]:=
net = NetModel[{"RoBERTa Trained on BookCorpus, English Wikipedia, \
CC-News, OpenWebText and Stories Datasets", "InputType" -> "ListOfStrings"}];
netencoder = NetExtract[net, "Input"]
Out[9]=

For each input subword token, the encoder yields a pair of indices that correspond to the token index in the vocabulary, and the index of the sentence within the list of input sentences:

In[10]:=
netencoder[{"Hello world!", "I am here"}]
Out[10]=

The list of tokens always starts with special token index 1, which corresponds to the classification index. The special token index 3 is used as a separator between the different text segments, marking the end and beginning (except the first) of each sentence. Each subword token is also assigned a positional index:

In[11]:=
net[{"Hello world!", "I am here"}, NetPort[{"embedding", "posembed", "Output"}]]
Out[11]=

A lookup is done to map these indices to numeric vectors of size 768:

In[12]:=
embeddings = net[{"Hello world!", "I am here"},
   {NetPort[{"embedding", "embeddingpos", "Output"}],
    NetPort[{"embedding", "embeddingtokens", "Output"}]}];
Map[MatrixPlot, embeddings]
Out[13]=

For each subword token, these three embeddings are combined by summing elements with ThreadingLayer:

In[14]:=
NetExtract[net, "embedding"]
Out[14]=

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[15]:=
NetExtract[net, "encoder"]
Out[15]=

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads”:

In[16]:=
NetExtract[net, {"encoder", 1, 1, "attention"}]
Out[16]=

BERT-like models use self-attention, where the embedding of a given subword depends on the full input text. The following figure compares self-attention (lower left) to other types of connectivity patterns that are popular in deep learning:

Sentence analogies

Define a sentence embedding that takes the last feature vector from RoBERTa subword embeddings (as an arbitrary choice):

In[17]:=
sentenceembedding = NetAppend[
  NetModel["RoBERTa Trained on BookCorpus, English Wikipedia, \
CC-News, OpenWebText and Stories Datasets"], "pooling" -> SequenceLastLayer[]]
Out[17]=

Define a list of sentences in two broad categories (food and music):

In[18]:=
sentences = {"The music is soothing to the ears", "The song blasted from the little radio", "This soundtrack is too good", "Food is needed for survival", "If you are hungry, please eat", "She cooks really well"};

Precompute the embeddings for a list of sentences:

In[19]:=
assoc = AssociationThread[sentences -> sentenceembedding[sentences]];

Visualize the similarity between the sentences using the net as a feature extractor:

In[20]:=
FeatureSpacePlot[assoc, LabelingFunction -> Callout]
Out[20]=

Train a classifier model with the subword embeddings

Get a text-processing dataset:

In[21]:=
train = ResourceData["Sample Data: Movie Review Sentence Polarity", "TrainingData"];
valid = ResourceData["Sample Data: Movie Review Sentence Polarity", "TestData"];

View a random sample of the dataset:

In[22]:=
RandomSample[train, 1]
Out[22]=

Precompute the RoBERTa vectors for the training and the validation datasets (if available, GPU is highly recommended):

In[23]:=
trainembeddings = NetModel["RoBERTa Trained on BookCorpus, English Wikipedia, \
CC-News, OpenWebText and Stories Datasets"][train[[All, 1]], TargetDevice -> "CPU"] -> train[[All, 2]];
validembeddings = NetModel["RoBERTa Trained on BookCorpus, English Wikipedia, \
CC-News, OpenWebText and Stories Datasets"][valid[[All, 1]], TargetDevice -> "CPU"] -> valid[[All, 2]];

Define a network to classify the sequences of subword embeddings, using a max-pooling strategy:

In[24]:=
classifierhead = NetChain[{DropoutLayer[], NetMapOperator[2], AggregationLayer[Max, 1], SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", {"negative", "positive"}}]]
Out[24]=

Train the network on the precomputed vectors from RoBERTa:

In[25]:=
robertaresults = NetTrain[classifierhead, trainembeddings, All,
  ValidationSet -> validembeddings,
  TargetDevice -> "CPU",
  MaxTrainingRounds -> 50]
Out[25]=

Check the classification error rate on the validation data:

In[26]:=
robertaresults["ValidationMeasurements", "ErrorRate"]
Out[26]=

Let’s compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors for the training and the validation dataset:

In[27]:=
trainembeddingsglove = NetModel["GloVe 300-Dimensional Word Vectors Trained on Wikipedia \
and Gigaword 5 Data"][train[[All, 1]], TargetDevice -> "CPU"] -> train[[All, 2]];
validembeddingsglove = NetModel["GloVe 300-Dimensional Word Vectors Trained on Wikipedia \
and Gigaword 5 Data"][valid[[All, 1]], TargetDevice -> "CPU"] -> valid[[All, 2]];

Train the classifier on the precomputed GloVe vectors:

In[28]:=
gloveresults = NetTrain[classifierhead, trainembeddingsglove, All,
  ValidationSet -> validembeddingsglove,
  TrainingStoppingCriterion -> <|"Criterion" -> "ErrorRate", "Patience" -> 50|>,
  TargetDevice -> "CPU",
  MaxTrainingRounds -> 50]
Out[28]=

Compare the results obtained with RoBERTa and with GloVe:

In[29]:=
Dataset[<|"RoBERTa" -> robertaresults["ValidationMeasurements"], "GloVe" -> gloveresults["ValidationMeasurements"]|>]
Out[29]=

Net information

Inspect the number of parameters of all arrays in the net:

In[30]:=
Information[
 NetModel["RoBERTa Trained on BookCorpus, English Wikipedia, CC-News, \
OpenWebText and Stories Datasets"], "ArraysElementCounts"]
Out[30]=

Obtain the total number of parameters:

In[31]:=
Information[
 NetModel["RoBERTa Trained on BookCorpus, English Wikipedia, CC-News, \
OpenWebText and Stories Datasets"], "ArraysTotalElementCount"]
Out[31]=

Obtain the layer type counts:

In[32]:=
Information[
 NetModel["RoBERTa Trained on BookCorpus, English Wikipedia, CC-News, \
OpenWebText and Stories Datasets"], "LayerTypeCounts"]
Out[32]=

Display the summary graphic:

In[33]:=
Information[
 NetModel["RoBERTa Trained on BookCorpus, English Wikipedia, CC-News, \
OpenWebText and Stories Datasets"], "SummaryGraphic"]
Out[33]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[34]:=
jsonPath = Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], NetModel["RoBERTa Trained on BookCorpus, English Wikipedia, \
CC-News, OpenWebText and Stories Datasets"], "MXNet"]
Out[34]=

Export also creates a net.params file containing parameters:

In[35]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[35]=

Get the size of the parameter file:

In[36]:=
FileByteCount[paramPath]
Out[36]=

Requirements

Wolfram Language 12.1 (March 2020) or above

Resource History

Reference