#
Wolfram Neural Net Repository

Immediate Computable Access to Neural Net Models

Generate text in English and represent text as a sequence of vectors

Released in 2018, this language model uses a multilayer transformer decoder. It applies multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens.

Number of layers: 857 | Parameter count: 116,534,784 | Trained size: 474 MB

- BookCorpus, a dataset consisting of 11,038 unpublished books from 16 different genres.

- The model obtains the following accuracy on various natural language inference tasks: 82.1%, 81.4%, 89.9%, 88.3%, 88.1% and 56% accuracy on MNLI-m, MNLI-mm, SNLI, SciTail, QNLI and RTE datasets, respectively. For question answering and common-sense reasoning, it obtains the following accuracies: 86.5%, 62.9%, 57.4% and 59.0% accuracy on Story Cloze, RACE-m, RACE-h and RACE datasets, respectively.

Get the pre-trained net:

In[1]:= |

Out[1]= |

For each token, the net produces a feature vector of length 768:

In[2]:= |

Out[2]= |

Obtain dimensions of the embeddings:

In[3]:= |

Out[3]= |

Visualize the embeddings:

In[4]:= |

Out[4]= |

Inspect the available parameters:

In[5]:= |

Out[5]= |

Pick a non-default model by specifying the parameters:

In[6]:= |

Out[6]= |

Pick a non-default untrained net:

In[7]:= |

Out[7]= |

The input string is first tokenized into words or subwords using a BPE encoder and additional text normalizations:

In[8]:= |

Out[9]= |

The encoder produces integer indices for each input token:

In[10]:= |

Out[10]= |

Together with the token indices, positional indices are also generated:

In[11]:= |

Out[11]= |

Indices are then embedded into numeric vectors of size 768:

In[12]:= |

Out[12]= |

Obtain the dimensions:

In[13]:= |

Out[13]= |

Visualize the embedding architecture:

In[14]:= |

Out[14]= |

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[15]:= |

Out[15]= |

The key part of these blocks is the attention module consisting of 12 parallel self-attention transformations, also called “attention heads”:

In[16]:= |

Out[16]= |

Each head uses an AttentionLayer at its core:

In[17]:= |

Out[17]= |

Retrieve the language model by specifying the “Task” parameter:

In[18]:= |

Out[18]= |

Predict the next word of a given sequence:

In[19]:= |

Out[19]= |

Obtain the top 15 probabilities:

In[20]:= |

Out[20]= |

Plot the top 15 probabilities:

In[21]:= |

Out[21]= |

Modify the language model so that it accepts the encoded token indices as input and creates the token indices as output:

In[22]:= |

Create a new decoder that performs a lookup to find the corresponding string, followed by some text cleaning:

In[23]:= |

Define a function to predict the next token using the modified language model:

In[24]:= |

Get an input:

In[25]:= |

Out[25]= |

Generate the next 20 tokens by using it on the input:

In[26]:= |

Out[26]= |

The third optional argument is a “temperature” parameter that scales the input to the final softmax. A high temperature flattens the distribution from which tokens are sampled, increasing the probability of extracting less likely tokens:

In[27]:= |

Out[27]= |

Decreasing the temperature sharpens the peaks of the sampling distribution, further decreasing the probability of extracting less likely tokens:

In[28]:= |

Out[28]= |

Very high temperature settings are equivalent to random sampling:

In[29]:= |

Out[29]= |

Very low temperature settings are equivalent to always picking the character with maximum probability. It is typical for sampling to “get stuck in a loop”:

In[30]:= |

Out[30]= |

Define a list of sentences for comparison:

In[31]:= |

Precompute the embeddings for the list of sentences:

In[32]:= |

Visualize the similarity between the sentences using the net as a feature extractor:

In[33]:= |

Out[33]= |

Get a text-processing dataset:

In[34]:= |

View a random sample of the dataset:

In[35]:= |

Out[35]= |

Precompute the GPT vectors on the training and the validation dataset (if available, “GPU” is recommended), using the last embedded vector as a representation of the entire text:

In[36]:= |

Define a simple network for classification:

In[37]:= |

Out[37]= |

Train the network on the precomputed GPT vectors:

In[38]:= |

Out[38]= |

Check the classification error rate on the validation data:

In[39]:= |

Out[39]= |

Compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors on the training and the validation dataset (if available, “GPU” is recommended):

In[40]:= |

In[41]:= |

Define a simple network for classification, using a max-pooling strategy:

In[42]:= |

Out[42]= |

Train the classifier on the precomputed GloVe vectors:

In[43]:= |

Out[43]= |

Check the classification error rate on the validation data:

In[44]:= |

Out[44]= |

Inspect the number of parameters of all arrays in the net:

In[45]:= |

Out[45]= |

Obtain the total number of parameters:

In[46]:= |

Out[46]= |

Obtain the layer type counts:

In[47]:= |

Out[47]= |

Display the summary graphic:

In[48]:= |

Out[48]= |

Wolfram Language 12.0 (April 2019) or above

- A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, "Improving language understanding by generative pre-training," preprint (2018)
- (available from https://github.com/openai/finetune-transformer-lm)
- Rights: MIT License