#
Wolfram Neural Net Repository

Immediate Computable Access to Neural Net Models

Represent text as a sequence of vectors

Released in 2019, this model uses various pre-training and design optimizations, like longer training periods on bigger batches of data and training over more data, removing next-sentence prediction objective, training on longer sequences and changing masking patterns dynamically, to obtain substantial improvement in performance over the existing BERT models.

Number of models: 3

- Five English-language corpora of varying sizes and domains, totaling over 160GB of uncompressed text: BookCorpus, English Wikipedia, CC-News, OpenWebText and Stories Datasets.

Accuracy of the RoBERTa-Large model for various natural language inference tasks:

Get the pre-trained net:

In[1]:= |

Out[1]= |

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:= |

Out[2]= |

Pick a non-default net by specifying the parameters:

In[3]:= |

Out[3]= |

Pick a non-default uninitialized net:

In[4]:= |

Out[4]= |

Given a piece of text, the RoBERTa net produces a sequence of feature vectors of size 768, which correspond to the sequence of input words or subwords:

In[5]:= |

Obtain dimensions of the embeddings:

In[6]:= |

Out[6]= |

Visualize the embeddings:

In[7]:= |

Out[7]= |

Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:

In[8]:= |

Out[9]= |

For each input subword token, the encoder yields a pair of indices that correspond to the token index in the vocabulary, and the index of the sentence within the list of input sentences:

In[10]:= |

Out[10]= |

The list of tokens always starts with special token index 1, which corresponds to the classification index. The special token index 3 is used as a separator between the different text segments, marking the end and beginning (except the first) of each sentence. Each subword token is also assigned a positional index:

In[11]:= |

Out[11]= |

A lookup is done to map these indices to numeric vectors of size 768:

In[12]:= |

Out[13]= |

For each subword token, these three embeddings are combined by summing elements with ThreadingLayer:

In[14]:= |

Out[14]= |

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[15]:= |

Out[15]= |

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads”:

In[16]:= |

Out[16]= |

BERT-like models use self-attention, where the embedding of a given subword depends on the full input text. The following figure compares self-attention (lower left) to other types of connectivity patterns that are popular in deep learning:

Define a sentence embedding that takes the last feature vector from RoBERTa subword embeddings (as an arbitrary choice):

In[17]:= |

Out[17]= |

Define a list of sentences in two broad categories (food and music):

In[18]:= |

Precompute the embeddings for a list of sentences:

In[19]:= |

Visualize the similarity between the sentences using the net as a feature extractor:

In[20]:= |

Out[20]= |

Get a text-processing dataset:

In[21]:= |

View a random sample of the dataset:

In[22]:= |

Out[22]= |

Precompute the RoBERTa vectors for the training and the validation datasets (if available, GPU is highly recommended):

In[23]:= |

Define a network to classify the sequences of subword embeddings, using a max-pooling strategy:

In[24]:= |

Out[24]= |

Train the network on the precomputed vectors from RoBERTa:

In[25]:= |

Out[25]= |

Check the classification error rate on the validation data:

In[26]:= |

Out[26]= |

Let’s compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors for the training and the validation dataset:

In[27]:= |

Train the classifier on the precomputed GloVe vectors:

In[28]:= |

Out[28]= |

Compare the results obtained with RoBERTa and with GloVe:

In[29]:= |

Out[29]= |

Inspect the number of parameters of all arrays in the net:

In[30]:= |

Out[30]= |

Obtain the total number of parameters:

In[31]:= |

Out[31]= |

Obtain the layer type counts:

In[32]:= |

Out[32]= |

Display the summary graphic:

In[33]:= |

Out[33]= |

Wolfram Language 12.1 (March 2020) or above

- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, "Roberta: A Robustly Optimized BERT Pretraining Approach," arXiv: 1907.11692 (2019)
- (available from https://github.com/pytorch/fairseq/tree/master/examples/roberta)
- Rights: MIT License