Resource retrieval
Get the pre-trained net:
Out[1]= |  |
NetModel parameters
This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:
Out[2]= |  |
Pick a non-default net by specifying the parameters:
Out[3]= |  |
Pick a non-default uninitialized net:
Out[4]= |  |
Basic usage
Given a piece of text, the GPT-2 net produces a sequence of feature vectors of size 768, which correspond to the sequence of input words or subwords:
Out[5]= |  |
Obtain dimensions of the embeddings:
Out[6]= |  |
Visualize the embeddings:
Out[7]= |  |
Transformer architecture
The input string is first normalized and then tokenized, or split into words or subwords. This two-step process is accomplished using the NetEncoder "Function":
Out[8]= |  |
The tokenization step is performed using the NetEncoder "BPESubwordTokens" and can be extracted using the following steps:
Out[10]= |  |
The encoder produces an integer index for each subword token that corresponds to the position in the vocabulary:
Out[11]= |  |
Each subword token is also assigned a positional index:
Out[12]= |  |
A lookup is done to map these indices to numeric vectors of size 768:
Out[14]= |  |
For each subword token, these two embeddings are combined by summing elements with ThreadingLayer:
Out[15]= |  |
The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:
Out[16]= |  |
The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads”:
Out[17]= |  |
Each head uses an AttentionLayer at its core:
Out[18]= |  |
Attention is done with causal masking, which means that the embedding of a given subword token depends on the previous subword tokens and not on the next ones.
This is a prerequisite to be able to generate text with the language model. The following figures compare causal attention to other forms of connectivity between input tokens:

Language modeling: Basic usage
Retrieve the language model by specifying the "Task" parameter:
Out[19]= |  |
Predict the next word in a given sequence:
Out[20]= |  |
Obtain the top 15 probabilities:
Out[21]= |  |
Plot the top 15 probabilities:
Out[22]= |  |
Text generation
Define a function to predict the next token:
Generate the next 20 tokens by using it on a piece of text:
Out[24]= |  |
The third optional argument is a “temperature” parameter that scales the input to the final softmax. A high temperature flattens the distribution from which tokens are sampled, increasing the probability of extracting less likely tokens:
Out[25]= |  |
Decreasing the temperature sharpens the peaks of the sampling distribution, further decreasing the probability of extracting less likely tokens :
Out[26]= |  |
Very high temperature settings are equivalent to random sampling:
Out[27]= |  |
Very low temperature settings are equivalent to always picking the character with maximum probability. It is typical for sampling to “get stuck in a loop”:
Out[28]= |  |
Sentence analogies
Define a sentence embedding that consists of the last subword embedding of GPT-2 (this choice is justified by the fact that GPT-2 is a forward causal model):
Out[29]= |  |
Define some sentences in two broad categories for comparison:
Precompute the embeddings for a list of sentences:
Visualize the similarity between the sentences using the net as a feature extractor:
Out[32]= |  |
Train a classifier model with the subword embeddings
Get a text-processing dataset:
View a random sample of the dataset:
Out[34]= |  |
Define a sentence embedding that consists of the last subword embedding of GPT-2 (this choice is justified by the fact that GPT-2 is a forward causal model):
Out[35]= |  |
Precompute the GPT-2 vectors for the training and the validation datasets (if available, GPU is recommended), using the last embedded vector as a representation of the entire text:
Define a simple network for classification:
Out[37]= |  |
Train the network on the precomputed GPT-2 vectors :
Out[38]= |  |
Check the classification error rate on the validation data:
Out[39]= |  |
Compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors for the training and the validation datasets (if available, GPU is recommended):
Define a simple network for classification, using a max-pooling strategy:
Out[42]= |  |
Train the classifier on the precomputed GloVe vectors:
Out[43]= |  |
Compare the results obtained with GPT-2 and with GloVe:
Out[44]= |  |
Net information
Inspect the number of parameters of all arrays in the net:
Out[45]= |  |
Obtain the total number of parameters:
Out[46]= |  |
Obtain the layer type counts:
Out[47]= |  |
Display the summary graphic:
Out[48]= |  |
Export to MXNet
Export the net into a format that can be opened in MXNet:
Out[49]= |  |
Export also creates a net.params file containing parameters:
Out[50]= |  |
Get the size of the parameter file:
Out[51]= |  |