#
Wolfram Neural Net Repository

Immediate Computable Access to Neural Net Models

Represent text as a sequence of vectors

Released in 2019, this model is a collection of 24 pre-trained miniature BERT nets of different depth and width, all trained using knowledge distillation (from the original BERT) and student pre-training. The largest net is equivalent to the BERT-base model and is 3 times smaller and 1.25 times faster than the teacher; the smallest net is 77 times smaller and 65 times faster. All nets are case-insensitive.

Number of models: 24

- BookCorpus, a dataset consisting of 11,038 unpublished books from 16 different genres and 2,500 million words from text passages of English Wikipedia.

Accuracy of four of these nets for various natural language inference tasks:

Get the pre-trained net:

In[1]:= |

Out[2]= |

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[3]:= |

Out[4]= |

Pick a non-default net by specifying the parameters:

In[5]:= |

Out[6]= |

Pick a non-default uninitialized net:

In[7]:= |

Out[8]= |

Given a piece of text, the default pre-trained distilled BERT net produces a sequence of feature vectors of size 512 (in general, the size of the feature vector is 64 × "AttentionHeads"), which corresponds to the sequence of input words or subwords:

In[9]:= |

Obtain dimensions of the embeddings:

In[10]:= |

Out[10]= |

Visualize the embeddings:

In[11]:= |

Out[11]= |

Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:

In[12]:= |

Out[13]= |

For each input subword token, the encoder yields a pair of indices that corresponds to the token index in the vocabulary and the index of the sentence within the list of input sentences:

In[14]:= |

Out[14]= |

The list of tokens always starts with special token index 102, which corresponds to the classification index. Also, the special token index 103 is used as a separator between the different text segments. Each subword token is also assigned a positional index:

In[15]:= |

Out[15]= |

A lookup is done to map these indices to numeric vectors of size 512:

In[16]:= |

Out[17]= |

For each subword token, these three embeddings are combined by summing elements with ThreadingLayer:

In[18]:= |

Out[18]= |

The transformer architecture then processes the vectors using six structurally identical self-attention blocks stacked in a chain:

In[19]:= |

Out[19]= |

The key part of these blocks is the attention module comprising of parallel self-attention transformations, also called "AttentionHeads". The number of such blocks is given by the parameter "AttentionUnits":

In[20]:= |

Out[20]= |

BERT-like models use self-attention, where the embedding of a given subword depends on the full input text. The following figure compares self-attention (lower left) to other types of connectivity patterns that are popular in deep learning:

Define a sentence embedding that takes the last feature vector from pre-trained distilled BERT subword embeddings (as an arbitrary choice):

In[21]:= |

Out[22]= |

Define a list of sentences:

In[23]:= |

Precompute the embeddings for a list of sentences:

In[24]:= |

Visualize the similarity between the sentences using the net as a feature extractor:

In[25]:= |

Out[25]= |

Get a text-processing dataset:

In[26]:= |

View a random sample of the dataset:

In[27]:= |

Out[27]= |

Precompute the pre-trained distilled BERT vectors for the training and the validation datasets (if available, GPU is highly recommended):

In[28]:= |

Define a network to classify the sequences of subword embeddings, using a max-pooling strategy:

In[29]:= |

Out[29]= |

Train the network on the precomputed vectors from the pre-trained distilled BERT:

In[30]:= |

Out[30]= |

Check the classification error rate on the validation data:

In[31]:= |

Out[31]= |

Compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors for the training and the validation dataset:

In[32]:= |

Train the classifier on the precomputed GloVe vectors:

In[33]:= |

Out[33]= |

Compare the results obtained from the pre-trained distilled BERT with GloVe:

In[34]:= |

Out[34]= |

Inspect the number of parameters of all arrays in the net:

In[35]:= |

Out[36]= |

Obtain the total number of parameters:

In[37]:= |

Out[38]= |

Obtain the layer type counts:

In[39]:= |

Out[40]= |

Display the summary graphic:

In[41]:= |

Out[42]= |

Wolfram Language 12.1 (March 2020) or above

- I. Turc, M.-W. Chang, K. Lee, K. Toutanova, "Well-Read Students Learn Better: On the Importance of Pre-training Compact Models," arXiv:1908.08962 (2019)
- (available from https://github.com/google-research/bert)
- Rights: Apache 2.0 License