#
Wolfram Neural Net Repository

Immediate Computable Access to Neural Net Models

Generate text in English and represent text as a sequence of vectors

Released in 2019, this model improves and scales up its predecessor model. It has a richer vocabulary and uses BPE tokenization on UTF-8 byte sequences and additional normalization at the end of all of the transformer blocks.

Number of models: 4

- Preliminary version of the WebText dataset, consisting of 40 GB of text scraped from webpages that have been curated by humans.

The small model of GPT-2 (117M parameters) obtains the following performances on various datasets: Accuracies: 45.99 on LAMBADA, 87.65 on Children’s Book Test Common Nouns, 83.4 on Children’s Book Test Named Entities. Bits-per-Character: 1.16 on enwik8 and 1.17 on text8. Perplexity: 35.13 on LAMBADA, 29.41 on WikiText2, 65.85 on Penn Tree Bank, 37.50 on WikiText103, 75.20 on Google One Billion Words (1BW).

The medium model of GPT-2 (345M parameters) obtains the following performances on various datasets: Accuracies: 55.48 on LAMBADA, 92.35 on Children’s Book Test Common Nouns, 87.1 on Children’s Book Test Named Entities. Bits-per-Character: 1.01 on enwik8 and 1.06 on text8. Perplexity: 15.60 on LAMBADA, 22.76 on WikiText2, 47.33 on Penn Tree Bank, 26.37 on WikiText103, 55.72 on Google One Billion Words (1BW).

Get the pre-trained net:

In[1]:= |

Out[1]= |

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:= |

Out[2]= |

Pick a non-default net by specifying the parameters:

In[3]:= |

Out[3]= |

Pick a non-default uninitialized net:

In[4]:= |

Out[4]= |

Given a piece of text, the GPT-2 net produces a sequence of feature vectors of size 768, which correspond to the sequence of input words or subwords:

In[5]:= |

Out[5]= |

Obtain dimensions of the embeddings:

In[6]:= |

Out[6]= |

Visualize the embeddings:

In[7]:= |

Out[7]= |

The input string is first normalized and then tokenized, or split into words or subwords. This two-step process is accomplished using the NetEncoder "Function":

In[8]:= |

Out[8]= |

The tokenization step is performed using the NetEncoder "BPESubwordTokens" and can be extracted using the following steps:

In[9]:= |

Out[10]= |

The encoder produces an integer index for each subword token that corresponds to the position in the vocabulary:

In[11]:= |

Out[11]= |

Each subword token is also assigned a positional index:

In[12]:= |

Out[12]= |

A lookup is done to map these indices to numeric vectors of size 768:

In[13]:= |

Out[14]= |

For each subword token, these two embeddings are combined by summing elements with ThreadingLayer:

In[15]:= |

Out[15]= |

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[16]:= |

Out[16]= |

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads”:

In[17]:= |

Out[17]= |

Each head uses an AttentionLayer at its core:

In[18]:= |

Out[18]= |

Attention is done with causal masking, which means that the embedding of a given subword token depends on the previous subword tokens and not on the next ones. This is a prerequisite to be able to generate text with the language model. The following figures compare causal attention to other forms of connectivity between input tokens:

Retrieve the language model by specifying the "Task" parameter:

In[19]:= |

Out[19]= |

Predict the next word in a given sequence:

In[20]:= |

Out[20]= |

Obtain the top 15 probabilities:

In[21]:= |

Out[21]= |

Plot the top 15 probabilities:

In[22]:= |

Out[22]= |

Define a function to predict the next token:

In[23]:= |

Generate the next 20 tokens by using it on a piece of text:

In[24]:= |

Out[24]= |

The third optional argument is a “temperature” parameter that scales the input to the final softmax. A high temperature flattens the distribution from which tokens are sampled, increasing the probability of extracting less likely tokens:

In[25]:= |

Out[25]= |

Decreasing the temperature sharpens the peaks of the sampling distribution, further decreasing the probability of extracting less likely tokens :

In[26]:= |

Out[26]= |

Very high temperature settings are equivalent to random sampling:

In[27]:= |

Out[27]= |

Very low temperature settings are equivalent to always picking the character with maximum probability. It is typical for sampling to “get stuck in a loop”:

In[28]:= |

Out[28]= |

Define a sentence embedding that consists of the last subword embedding of GPT-2 (this choice is justified by the fact that GPT-2 is a forward causal model):

In[29]:= |

Out[29]= |

Define some sentences in two broad categories for comparison:

In[30]:= |

Precompute the embeddings for a list of sentences:

In[31]:= |

Visualize the similarity between the sentences using the net as a feature extractor:

In[32]:= |

Out[32]= |

Get a text-processing dataset:

In[33]:= |

View a random sample of the dataset:

In[34]:= |

Out[34]= |

Define a sentence embedding that consists of the last subword embedding of GPT-2 (this choice is justified by the fact that GPT-2 is a forward causal model):

In[35]:= |

Out[35]= |

Precompute the GPT-2 vectors for the training and the validation datasets (if available, GPU is recommended), using the last embedded vector as a representation of the entire text:

In[36]:= |

Define a simple network for classification:

In[37]:= |

Out[37]= |

Train the network on the precomputed GPT-2 vectors :

In[38]:= |

Out[38]= |

Check the classification error rate on the validation data:

In[39]:= |

Out[39]= |

Compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors for the training and the validation datasets (if available, GPU is recommended):

In[40]:= |

In[41]:= |

Define a simple network for classification, using a max-pooling strategy:

In[42]:= |

Out[42]= |

Train the classifier on the precomputed GloVe vectors:

In[43]:= |

Out[43]= |

Compare the results obtained with GPT-2 and with GloVe:

In[44]:= |

Out[44]= |

Inspect the number of parameters of all arrays in the net:

In[45]:= |

Out[45]= |

Obtain the total number of parameters:

In[46]:= |

Out[46]= |

Obtain the layer type counts:

In[47]:= |

Out[47]= |

Display the summary graphic:

In[48]:= |

Out[48]= |

Wolfram Language 12.0 (April 2019) or above

- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, "Language Models are Unsupervised Multitask Learners" (2019)
- (available from https://github.com/openai/gpt-2)
- Rights: MIT License