GPT2 Transformer Trained on WebText Data

Generate text in English and represent text as a sequence of vectors

Released in 2019, this model improves and scales up its predecessor model. It has a richer vocabulary and uses BPE tokenization on UTF-8 byte sequences and additional normalization at the end of all of the transformer blocks.

Number of models: 3

Training Set Information

Performance

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["GPT2 Transformer Trained on WebText Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["GPT2 Transformer Trained on WebText Data", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
lm = NetModel[{"GPT2 Transformer Trained on WebText Data", "Task" -> "LanguageModeling", "Size" -> "345M"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"GPT2 Transformer Trained on WebText Data", "Task" -> "LanguageModeling"}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Given a piece of text, the GPT-2 net produces a sequence of feature vectors of size 768, which corresponds to the sequence of input words or subwords:

In[5]:=
embeddings = NetModel["GPT2 Transformer Trained on WebText Data"][
   "The cat is on the mat"];

Obtain dimensions of the embeddings:

In[6]:=
Dimensions[embeddings]
Out[6]=

Visualize the embeddings:

In[7]:=
MatrixPlot[embeddings]
Out[7]=

Transformer architecture

The input string is first normalized and then tokenized, or split into words or subwords. This two-step process is accomplished using the NetEncoder "Function":

In[8]:=
net = NetModel["GPT2 Transformer Trained on WebText Data"];
netencoder = NetExtract[net, "Input"]
Out[9]=

The tokenization step is performed using the NetEncoder "BPESubwordTokens" and can be extracted using the following steps:

In[10]:=
pos = First@ Position[NetExtract[netencoder, "Function"], _NetEncoder];
Extract[NetExtract[netencoder, "Function"], pos]
Out[11]=

The encoder produces an integer index for each subword token that corresponds to the position in the vocabulary:

In[12]:=
netencoder["Hello world! I am here"]
Out[12]=

Each subword token is also assigned a positional index:

In[13]:=
net["Hello world! I am here", NetPort[{"embedding", "posembed", "Output"}]]
Out[13]=

A lookup is done to map these indices to numeric vectors of size 768:

In[14]:=
embeddings = net["Hello world! I am here",
   {NetPort[{"embedding", "embeddingpos", "Output"}],
    NetPort[{"embedding", "embeddingtokens", "Output"}]}];
Map[MatrixPlot, embeddings]
Out[15]=

For each subword token, these two embeddings are combined by summing elements with ThreadingLayer:

In[16]:=
NetExtract[net, "embedding"]
Out[16]=

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[17]:=
NetExtract[net, "decoder"]
Out[17]=

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads”:

In[18]:=
NetExtract[net, {"decoder", 1, 1}]
Out[18]=

Attention is done with causal masking, which means that the embedding of a given subword token depends on the previous subword tokens and not on the next ones. This is a prerequisite to be able to generate text with the language model. The following figures compare causal attention to other forms of connectivity between input tokens:

Language modeling

Retrieve the language model by specifying the "Task" parameter:

In[19]:=
lm = NetModel[{"GPT2 Transformer Trained on WebText Data", "Task" -> "LanguageModeling"}]
Out[19]=

Predict the next word in a given sequence:

In[20]:=
lm["Albert Einstein was a German-born theoretical physicist"]
Out[20]=

Obtain the top 15 probabilities:

In[21]:=
topProbs = lm["Albert Einstein was a German-born theoretical physicist", {"TopProbabilities", 15}]
Out[21]=

Plot the top 15 probabilities:

In[22]:=
BarChart[
 Thread@Labeled[Values@topProbs, Keys[topProbs] /. {"\n" -> "\\n", "\t" -> "\\t"}], ScalingFunctions -> "Log", ImageSize -> Large]
Out[22]=

Text generation

Define a function to predict the next token:

In[23]:=
lm = NetModel[{"GPT2 Transformer Trained on WebText Data", "Task" -> "LanguageModeling"}];
generateSample[languagemodel_][input_String, numTokens_ : 10, temperature_ : 1] := Nest[Function[
    StringJoin[#, languagemodel[#, {"RandomSample", "Temperature" -> temperature}]]], input, numTokens];

Generate the next 20 tokens by using it on a piece of text:

In[24]:=
generateSample[
  lm]["Albert Einstein was a German-born theoretical physicist", 20]
Out[24]=

The third optional argument is a “temperature” parameter that scales the input to the final softmax. A high temperature flattens the distribution from which tokens are sampled, increasing the probability of extracting less likely tokens:

In[25]:=
generateSample[
  lm]["Albert Einstein was a German-born theoretical physicist", 40, 1.5]
Out[25]=

Decreasing the temperature sharpens the peaks of the sampling distribution, further decreasing the probability of extracting less likely tokens:

In[26]:=
generateSample[
  lm]["Albert Einstein was a German-born theoretical physicist", 50, 0.5]
Out[26]=

Very high temperature settings are equivalent to random sampling:

In[27]:=
generateSample[
  lm]["Albert Einstein was a German-born theoretical physicist", 50, 10]
Out[27]=

Very low temperature settings are equivalent to always picking the character with maximum probability. It is typical for sampling to “get stuck in a loop”:

In[28]:=
generateSample[
  lm]["Albert Einstein was a German-born theoretical physicist", 100, 0.01]
Out[28]=

Efficient text generation

The text generation example in the previous section wastes computational resources because every time a new token is produced, the language model reads the entire generated string from the beginning. This means that generating new tokens is more and more costly as text generation progresses. This can be avoided by using NetUnfold:

In[29]:=
lm = NetModel[{"GPT2 Transformer Trained on WebText Data", "Task" -> "LanguageModeling"}];
unfolded = NetUnfold[lm];
encoder = NetExtract[lm, "Input"];

Write a function to efficiently generate text using the unfolded net:

In[30]:=
generateSampleEfficient[languagemodel_][input_String, numTokens_ : 10,
    temperature_ : 1] := Block[
   {encodedinput = encoder[input], index = 1, init, props, generated = {}},
   init = Join[
     <|"Input" -> First@encodedinput, "Index" -> index|>,
     Association@Table["State" <> ToString[i] -> {}, {i, 24}]
     ];
   props = Append[Table["OutState" <> ToString[i], {i, 24}], "Output" -> {"RandomSample", "Temperature" -> temperature}];
   Nest[
    Function@Block[
      {newinput = KeyMap[StringReplace["OutState" -> "State"], languagemodel[#, props]]},
      Join[newinput,
       <|
        "Index" -> ++index,
        "Input" -> If[index <= Length[encodedinput],
          encodedinput[[index]],
          AppendTo[generated, newinput["Output"]];
          Last@encoder@newinput["Output"]
          ]
        |>
       ]
      ],
    init,
    numTokens + Length[encodedinput] - 1
    ];
   StringJoin[input, generated]
   ];

Generate the next 20 tokens efficiently by using it on a piece of text:

In[31]:=
generateSampleEfficient[
  unfolded]["Albert Einstein was a German-born theoretical physicist", 20]
Out[31]=

Compute the timings of the two methods for an increasing number of tokens:

In[32]:=
inefficientTimings = AssociationMap[
   First@AbsoluteTiming[generateSample[lm]["I am", #]] &,
   Range[10, 210, 50]
   ];
efficientTimings = AssociationMap[
   First@
     AbsoluteTiming[generateSampleEfficient[unfolded]["I am", #]] &,
   Range[10, 210, 50]
   ];

Observe that the inefficient method grows quadratically with the number of tokens, while the efficient one is linear:

In[33]:=
ListLinePlot[{inefficientTimings, efficientTimings}, AxesLabel -> {"number of generated tokens", "time (sec)"}]
Out[33]=

Sentence analogies

Define a sentence embedding that consists of the last subword embedding of GPT-2 (this choice is justified by the fact that GPT-2 is a forward causal model):

In[34]:=
sentenceembedding = NetAppend[NetModel["GPT2 Transformer Trained on WebText Data"], "last" -> SequenceLastLayer[]]
Out[34]=

Define some sentences in two broad categories for comparison:

In[35]:=
sentences = {"I put on some nice soothing music.", "The song blasted from the little radio.", "The soundtrack from the movie was so good.", "Food is needed for survival.", "Go on, eat if you are hungry.", "Her baking skills are terrible."};

Precompute the embeddings for a list of sentences:

In[36]:=
assoc = AssociationThread[sentences -> sentenceembedding[sentences]];

Visualize the similarity between the sentences using the net as a feature extractor:

In[37]:=
FeatureSpacePlot[assoc, LabelingFunction -> Callout, ImageSize -> Medium]
Out[37]=

Train a classifier model with the subword embeddings

Get a text-processing dataset:

In[38]:=
train = ResourceData["Sample Data: Movie Review Sentence Polarity", "TrainingData"];
valid = ResourceData["Sample Data: Movie Review Sentence Polarity", "TestData"];

View a random sample of the dataset:

In[39]:=
RandomSample[train, 1]
Out[39]=

Define a sentence embedding that consists of the last subword embedding of GPT-2 (this choice is justified by the fact that GPT-2 is a forward causal model):

In[40]:=
sentenceembedding = NetAppend[NetModel["GPT2 Transformer Trained on WebText Data"], "last" -> SequenceLastLayer[]]
Out[40]=

Precompute the GPT-2 vectors for the training and the validation datasets (if available, GPU is recommended), using the last embedded vector as a representation of the entire text:

In[41]:=
trainembeddings = sentenceembedding[train[[All, 1]], TargetDevice -> "CPU"] -> train[[All, 2]];
validembeddings = sentenceembedding[valid[[All, 1]], TargetDevice -> "CPU"] -> valid[[All, 2]];

Define a simple network for classification:

In[42]:=
classifierhead = NetChain[
  {DropoutLayer[], 2, SoftmaxLayer[]},
  "Output" -> NetDecoder[{"Class", {"negative", "positive"}}]
  ]
Out[42]=

Train the network on the precomputed GPT-2 vectors:

In[43]:=
gpt2results = NetTrain[classifierhead, trainembeddings, All,
  ValidationSet -> validembeddings,
  TrainingStoppingCriterion -> <|"Criterion" -> "ErrorRate", "Patience" -> 50|>,
  TargetDevice -> "CPU",
  MaxTrainingRounds -> 500]
Out[43]=

Check the classification error rate on the validation data:

In[44]:=
gpt2results["ValidationMeasurements", "ErrorRate"]
Out[44]=

Compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors for the training and the validation datasets (if available, GPU is recommended):

In[45]:=
glove = NetModel[
   "GloVe 300-Dimensional Word Vectors Trained on Wikipedia and Gigaword 5 Data"];
In[46]:=
trainembeddingsglove = glove[train[[All, 1]], TargetDevice -> "CPU"] -> train[[All, 2]];
validembeddingsglove = glove[valid[[All, 1]], TargetDevice -> "CPU"] -> valid[[All, 2]];

Define a simple network for classification using a max-pooling strategy:

In[47]:=
gloveclassifierhead = NetChain[
  {DropoutLayer[],
   NetMapOperator[2],
   AggregationLayer[Max, 1],
   SoftmaxLayer[]},
  "Output" -> NetDecoder[{"Class", {"negative", "positive"}}]]
Out[47]=

Train the classifier on the precomputed GloVe vectors:

In[48]:=
gloveresults = NetTrain[gloveclassifierhead, trainembeddingsglove, All,
  ValidationSet -> validembeddingsglove,
  TrainingStoppingCriterion -> <|"Criterion" -> "ErrorRate", "Patience" -> 50|>,
  TargetDevice -> "CPU",
  MaxTrainingRounds -> 50]
Out[48]=

Compare the results obtained with GPT-2 and with GloVe:

In[49]:=
Dataset[<|"GPT-2" -> gpt2results["ValidationMeasurements"], "GloVe" -> gloveresults["ValidationMeasurements"]|>]
Out[49]=

Net information

Inspect the number of parameters of all arrays in the net:

In[50]:=
Information[
 NetModel[
  "GPT2 Transformer Trained on WebText Data"], "ArraysElementCounts"]
Out[50]=

Obtain the total number of parameters:

In[51]:=
Information[
 NetModel[
  "GPT2 Transformer Trained on WebText Data"], "ArraysTotalElementCount"]
Out[51]=

Obtain the layer type counts:

In[52]:=
Information[
 NetModel[
  "GPT2 Transformer Trained on WebText Data"], "LayerTypeCounts"]
Out[52]=

Display the summary graphic:

In[53]:=
Information[
 NetModel[
  "GPT2 Transformer Trained on WebText Data"], "SummaryGraphic"]
Out[53]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[54]:=
jsonPath = Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], NetModel["GPT2 Transformer Trained on WebText Data"], "MXNet"]
Out[54]=

Export also creates a net.params file containing parameters:

In[55]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[55]=

Get the size of the parameter file:

In[56]:=
FileByteCount[paramPath]
Out[56]=

Requirements

Wolfram Language 12.1 (March 2020) or above

Resource History

Reference