Wolfram Research

GPT Transformer Trained on BookCorpus Data

Generate text in English and represent text as a sequence of vectors

Released in 2018, this language model uses a multilayer transformer decoder. It applies multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens.

Number of layers: 857 | Parameter count: 116,534,784 | Trained size: 474 MB

Training Set Information

Performance

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["GPT Transformer Trained on BookCorpus Data"]
Out[1]=

Basic usage

For each token, the net produces a feature vector of length 768:

In[2]:=
embeddings = 
 NetModel["GPT Transformer Trained on BookCorpus Data"][
  "Hello world! I am here"]
Out[2]=

Obtain dimensions of the embeddings:

In[3]:=
Dimensions@embeddings
Out[3]=

Visualize the embeddings:

In[4]:=
MatrixPlot@embeddings
Out[4]=

NetModel parameters

Inspect the available parameters:

In[5]:=
NetModel["GPT Transformer Trained on BookCorpus Data", \
"ParametersInformation"]
Out[5]=

Pick a non-default model by specifying the parameters:

In[6]:=
lm = NetModel[{"GPT Transformer Trained on BookCorpus Data", 
   "Task" -> "LanguageModeling"}]
Out[6]=

Pick a non-default untrained net:

In[7]:=
NetModel[{"GPT Transformer Trained on BookCorpus Data", 
  "Task" -> "LanguageModeling"}, "UninitializedEvaluationNet"]
Out[7]=

Transformer architecture

The input string is first tokenized into words or subwords using a BPE encoder and additional text normalizations:

In[8]:=
net = NetModel["GPT Transformer Trained on BookCorpus Data"];
 netencoder = NetExtract[net, "Input"]
Out[9]=

The encoder produces integer indices for each input token:

In[10]:=
netencoder["Hello world! I am here"]
Out[10]=

Together with the token indices, positional indices are also generated:

In[11]:=
net["Hello world! I am here", 
 NetPort[{"embedding", "posembed", "Output"}]]
Out[11]=

Indices are then embedded into numeric vectors of size 768:

In[12]:=
embeddings = 
 net["Hello world! I am here", {NetPort[{"embedding", "embeddingpos", 
     "Output"}], NetPort[{"embedding", "embeddingtokens", "Output"}]}]
Out[12]=

Obtain the dimensions:

In[13]:=
Map[Dimensions, embeddings]
Out[13]=

Visualize the embedding architecture:

In[14]:=
NetExtract[net, "embedding"]
Out[14]=

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[15]:=
NetExtract[net, "decoder"]
Out[15]=

The key part of these blocks is the attention module consisting of 12 parallel self-attention transformations, also called “attention heads”:

In[16]:=
NetExtract[net, {"decoder", 1, 1}]
Out[16]=

Each head uses an AttentionLayer at its core:

In[17]:=
NetExtract[net, {"decoder", 1, 1, "attention", 1}]
Out[17]=

Language modeling: Basic usage

Retrieve the language model by specifying the “Task” parameter:

In[18]:=
lm = NetModel[{"GPT Transformer Trained on BookCorpus Data", 
   "Task" -> "LanguageModeling"}]
Out[18]=

Predict the next word of a given sequence:

In[19]:=
lm["Where have you "]
Out[19]=

Obtain the top 15 probabilities:

In[20]:=
topProbs = lm["Where have you ", {"TopProbabilities", 15}]
Out[20]=

Plot the top 15 probabilities:

In[21]:=
BarChart[Thread@
  Labeled[Values@topProbs, 
   Keys[topProbs] /. {"\n" -> "\\n", "\t" -> "\\t"}], 
 ScalingFunctions -> "Log", ImageSize -> Large]
Out[21]=

Text generation

Modify the language model so that it accepts the encoded token indices as input and creates the token indices as output:

In[22]:=
netencoder = NetExtract[lm, "Input"];
netdecoder = NetExtract[lm, "Output"];
numwords = NetExtract[netdecoder, "Dimensions"];
lmmod = NetReplacePart[lm,
   {"Input" -> None,
    "Output" -> NetDecoder[{"Class", Range@numwords}]}];

Create a new decoder that performs a lookup to find the corresponding string, followed by some text cleaning:

In[23]:=
assoc = AssociationThread[
   Range@numwords -> NetExtract[netdecoder, "Labels"]];
decoder = Function[array,
   StringReplace[
    StringJoin@Lookup[assoc, array], {"\n " -> "\n", 
     " " ~~ x : PunctuationCharacter :> x}]];

Define a function to predict the next token using the modified language model:

In[24]:=
generateSample[{lmmodified_, netencoder_, decoder_}][input_String, 
  numTokens_: 10, temperature_: 1] :=
 
 Module[{numwords, inputcodes, outputcodes, matrix},
  inputcodes = netencoder[input];
  outputcodes = 
   Nest[Function[
     Join[#, {lmmodified[#, {"RandomSample", 
         "Temperature" -> temperature}]}]], inputcodes, numTokens];
  decoder[outputcodes]]

Get an input:

In[25]:=
input = TextSentences[ResourceData["Alice in Wonderland"]][[2]]
Out[25]=

Generate the next 20 tokens by using it on the input:

In[26]:=
generateSample[{lmmod, netencoder, decoder}][input, 50]
Out[26]=

The third optional argument is a “temperature” parameter that scales the input to the final softmax. A high temperature flattens the distribution from which tokens are sampled, increasing the probability of extracting less likely tokens:

In[27]:=
generateSample[{lmmod, netencoder, decoder}][input, 50, 1.5]
Out[27]=

Decreasing the temperature sharpens the peaks of the sampling distribution, further decreasing the probability of extracting less likely tokens:

In[28]:=
generateSample[{lmmod, netencoder, decoder}][input, 50, 0.4]
Out[28]=

Very high temperature settings are equivalent to random sampling:

In[29]:=
generateSample[{lmmod, netencoder, decoder}][input, 20, 10]
Out[29]=

Very low temperature settings are equivalent to always picking the character with maximum probability. It is typical for sampling to “get stuck in a loop”:

In[30]:=
generateSample[{lmmod, netencoder, decoder}]["How are you?" , 100, 0]
Out[30]=

Sentence analogies

Define a list of sentences for comparison:

In[31]:=
sentences = {"I put on some nice soothing music.", 
   "The song blasted from the little radio.", 
   "The soundtrack from the movie was so good.", 
   "Food is needed for survival.", "Go on, eat if you are hungry.", 
   "Her baking skills are terrible."};

Precompute the embeddings for the list of sentences:

In[32]:=
assoc = AssociationThread[sentences -> net[sentences][[All, -1]]];

Visualize the similarity between the sentences using the net as a feature extractor:

In[33]:=
FeatureSpacePlot[
 Table[Labeled[(Values@assoc)[[i]], (Keys@assoc)[[i]]], {i, 
   Length@assoc}], LabelingFunction -> Callout, ImageSize -> Large]
Out[33]=

Train a classifier model with the word embeddings

Get a text-processing dataset:

In[34]:=
train = ExampleData[{"MachineLearning", "MovieReview"}, 
   "TrainingData"];
valid = ExampleData[{"MachineLearning", "MovieReview"}, "TestData"];

View a random sample of the dataset:

In[35]:=
RandomSample[train, 1]
Out[35]=

Precompute the GPT vectors on the training and the validation dataset (if available, “GPU” is recommended), using the last embedded vector as a representation of the entire text:

In[36]:=
trainembeddings = 
  net[train[[All, 1]], TargetDevice -> "CPU"][[All, -1]] -> 
   train[[All, 2]];
validembeddings = 
  net[valid[[All, 1]], TargetDevice -> "CPU"][[All, -1]] -> 
   valid[[All, 2]];

Define a simple network for classification:

In[37]:=
classifier = NetChain[
  {DropoutLayer[], 2, SoftmaxLayer[]},
  "Output" -> NetDecoder[{"Class", {"negative", "positive"}}]
  ]
Out[37]=

Train the network on the precomputed GPT vectors:

In[38]:=
results = NetTrain[classifier, trainembeddings, All,
  ValidationSet -> validembeddings,
  TrainingStoppingCriterion -> <|"Criterion" -> "ErrorRate", 
    "Patience" -> 50|>,
  TargetDevice -> "CPU",
  MaxTrainingRounds -> 500]
Out[38]=

Check the classification error rate on the validation data:

In[39]:=
Min@results["ValidationMeasurementsLists", "ErrorRate"]
Out[39]=

Compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors on the training and the validation dataset (if available, “GPU” is recommended):

In[40]:=
glove = NetModel[
   "GloVe 300-Dimensional Word Vectors Trained on Wikipedia and \
Gigaword 5 Data"];
In[41]:=
trainembeddingsglove = 
  glove[train[[All, 1]], TargetDevice -> "CPU"] -> train[[All, 2]];
validembeddingsglove = 
  glove[valid[[All, 1]], TargetDevice -> "CPU"] -> valid[[All, 2]];

Define a simple network for classification, using a max-pooling strategy:

In[42]:=
classifier = NetChain[
  {DropoutLayer[],
   NetMapOperator[2],
   AggregationLayer[Max, 1],
   SoftmaxLayer[]},
  "Output" -> NetDecoder[{"Class", {"negative", "positive"}}]]
Out[42]=

Train the classifier on the precomputed GloVe vectors:

In[43]:=
results = NetTrain[classifier, trainembeddingsglove, All,
  ValidationSet -> validembeddingsglove,
  TrainingStoppingCriterion -> <|"Criterion" -> "ErrorRate", 
    "Patience" -> 50|>,
  TargetDevice -> "CPU",
  MaxTrainingRounds -> 50]
Out[43]=

Check the classification error rate on the validation data:

In[44]:=
Min@results["ValidationMeasurementsLists", "ErrorRate"]
Out[44]=

Net information

Inspect the number of parameters of all arrays in the net:

In[45]:=
NetInformation[
 NetModel["GPT Transformer Trained on BookCorpus Data"], \
"ArraysElementCounts"]
Out[45]=

Obtain the total number of parameters:

In[46]:=
NetInformation[
 NetModel["GPT Transformer Trained on BookCorpus Data"], \
"ArraysTotalElementCount"]
Out[46]=

Obtain the layer type counts:

In[47]:=
NetInformation[
 NetModel["GPT Transformer Trained on BookCorpus Data"], \
"LayerTypeCounts"]
Out[47]=

Display the summary graphic:

In[48]:=
NetInformation[
 NetModel["GPT Transformer Trained on BookCorpus Data"], \
"SummaryGraphic"]
Out[48]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[49]:=
jsonPath = 
 Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], 
  NetModel["GPT Transformer Trained on BookCorpus Data"], "MXNet"]
Out[49]=

Export also creates a net.params file containing parameters:

In[50]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[50]=

Get the size of the parameter file:

In[51]:=
FileByteCount[paramPath]
Out[51]=

The size is similar to the byte count of the resource object:

In[52]:=
ResourceObject[
  "GPT Transformer Trained on BookCorpus Data"]["ByteCount"]
Out[52]=

Requirements

Wolfram Language 12.0 (April 2019) or above

Resource History

Reference