#
Wolfram Neural Net Repository

Immediate Computable Access to Neural Net Models

Represent text as a sequence of vectors

Released in 2018, Bidirectional Encoder Representations from Transformers (BERT) is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right contexts in all layers. This model can be fine tuned with an additional output layer to create state-of-the art models for a wide range of tasks. It uses bidirectional self-attention, often referred to as a transformer encoder.

- BookCorpus, a dataset consisting of 11,038 unpublished books from 16 different genres and 2,500 million words from text passages of English Wikipedia.

- BERT Large obtains the following accuracy on various natural language inference tasks: 86.7/85.9%, 72.1%, 91.1%, 94.9%, 60.5%, 86.5%, 89.3% and 70.1% accuracy on MNLI-(m/mm), QQP, QNLI, SST-2, CoLA, STS-B, MRPC and RTE datasets, respectively. BERT Base obtains the following accuracy on various natural language inference tasks: 84.6/83.4%, 71.2%, 90.1%, 93.5%, 52.1%, 85.8%, 88.9% and 66.4% accuracy on MNLI-(m/mm), QQP, QNLI, SST-2, CoLA, STS-B, MRPC and RTE datasets, respectively.

Get the pre-trained net:

In[1]:= |

Out[1]= |

For each token, the net produces a feature vector of length 768:

In[2]:= |

Obtain dimensions of the embeddings:

In[3]:= |

Out[3]= |

Visualize the embeddings:

In[4]:= |

Out[4]= |

Inspect the available parameters:

In[5]:= |

Out[5]= |

Pick a non-default model by specifying the parameters:

In[6]:= |

Out[6]= |

Pick a non-default untrained net:

In[7]:= |

Out[7]= |

Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:

In[8]:= |

Out[9]= |

For each token, a list of the form {tokenIndex, segmentIndex} is returned by the encoder. For the model returned by setting “InputType” to “ListOfStrings”, segment indices are 1 for the first sentence and 2 for the second. For the case where “InputType” is set to ”String”, segment indices are 1 only:

In[10]:= |

Out[10]= |

The first token of the first sentence always starts with the special code 102, corresponding to the classification index [CLS]. Both sentences always end with the special token 103, corresponding to the separator index [SEP]:

In[11]:= |

Out[11]= |

Together with the token and segment indices, position indices are also generated:

In[12]:= |

Out[12]= |

Indices are then embedded into numeric vectors of size 768:

In[13]:= |

Out[13]= |

Obtain the dimensions:

In[14]:= |

Out[14]= |

Visualize the embedding architecture:

In[15]:= |

Out[15]= |

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[16]:= |

Out[16]= |

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads”:

In[17]:= |

Out[17]= |

Each head uses an AttentionLayer at its core:

In[18]:= |

Out[18]= |

Define a list of sentences for comparison:

In[19]:= |

Precompute the embeddings for the list of sentences:

In[20]:= |

Out[20]= |

Visualize the similarity between the sentences using the net as a feature extractor:

In[21]:= |

Out[21]= |

Get a text-processing dataset:

In[22]:= |

View a random sample of the dataset:

In[23]:= |

Out[23]= |

Precompute the BERT vectors on the training and the validation dataset (if available, GPU is recommended):

In[24]:= |

Define a simple network for classification, using a max-pooling strategy:

In[25]:= |

Out[25]= |

Train the network on the precomputed BERT vectors:

In[26]:= |

Out[26]= |

Check the classification error rate on the validation data:

In[27]:= |

Out[27]= |

Compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors on the training and the validation dataset (if available, GPU is recommended):

In[28]:= |

In[29]:= |

Train the classifier on the precomputed GloVe vectors:

In[30]:= |

Out[30]= |

Check the classification error rate on the validation data:

In[31]:= |

Out[31]= |

Inspect the number of parameters of all arrays in the net:

In[32]:= |

Out[32]= |

Obtain the total number of parameters:

In[33]:= |

Out[33]= |

Obtain the layer type counts:

In[34]:= |

Out[34]= |

Display the summary graphic:

In[35]:= |

Out[35]= |

Wolfram Language 12.0 (April 2019) or above

- J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805 (2018)
- (available from https://github.com/google-research/bert)
- Rights: Apache 2.0 License