Resource retrieval
Get the pre-trained net:
Out[1]= | data:image/s3,"s3://crabby-images/9edf8/9edf852a0f1daf80dd58630c7ce6ac0be483e996" alt="" |
NetModel parameters
This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:
Out[2]= | data:image/s3,"s3://crabby-images/97c85/97c853a632fc6716641e8fabea1e7b2a64b21361" alt="" |
Pick a non-default net by specifying the parameters:
Out[3]= | data:image/s3,"s3://crabby-images/b730c/b730ce54e06f4eb17ee2710b83fe445320fd4b38" alt="" |
Pick a non-default uninitialized net:
Out[4]= | data:image/s3,"s3://crabby-images/88be2/88be22961a785e5fe893295bbd8f1eae7268db3b" alt="" |
Basic usage
Given a piece of text, the BERT net produces a sequence of feature vectors of size 768, which correspond to the sequence of input words or subwords:
Obtain dimensions of the embeddings:
Out[6]= | data:image/s3,"s3://crabby-images/1bcca/1bcca549d1fbaa0460e6e11fca1357637634d4a6" alt="" |
Visualize the embeddings:
Out[7]= | data:image/s3,"s3://crabby-images/ee233/ee2331384aefa7aec9c70fc04e760fe016db4b90" alt="" |
Transformer architecture
Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:
Out[9]= | data:image/s3,"s3://crabby-images/24206/242069c590a319ff28a5daadce0e527c483588f2" alt="" |
For each input subword token, the encoder yields a pair of indices that correspond to the token index in the vocabulary, and the index of the sentence within the list of input sentences:
Out[10]= | data:image/s3,"s3://crabby-images/a7dc4/a7dc4c99a876ce2caa0f17f00ce4ac6f7f85f4a3" alt="" |
The list of tokens always starts with special token index 102, which corresponds to the classification index.
Also the special token index 103 is used as a separator between the different text segments. Each subword token is also assigned a positional index:
Out[11]= | data:image/s3,"s3://crabby-images/11794/11794fdcde14d7ad85e34a9c1ecae6e1c3006c14" alt="" |
A lookup is done to map these indices to numeric vectors of size 768:
Out[13]= | data:image/s3,"s3://crabby-images/d1117/d1117ea7b3e111191373bfeb3f4e62db51d690eb" alt="" |
For each subword token, these three embeddings are combined by summing elements with ThreadingLayer:
Out[14]= | data:image/s3,"s3://crabby-images/f7d97/f7d97276db37fe4692b7b773700166ad796cc7aa" alt="" |
The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:
Out[15]= | data:image/s3,"s3://crabby-images/80124/80124de409b11d3a9a785a0a8d1fe5445a59242e" alt="" |
The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads”:
Out[16]= | data:image/s3,"s3://crabby-images/47d0d/47d0d7e2f289c04693655cb908f071cc89f8e718" alt="" |
Each head uses an AttentionLayer at its core:
Out[17]= | data:image/s3,"s3://crabby-images/12c98/12c98bd40d806a4cb90c74d0fa008f20f53fd5da" alt="" |
BERT uses self-attention, where the embedding of a given subword depends on the full input text. The following figure compares self-attention (lower left) to other types of connectivity patterns that are popular in deep learning:
data:image/s3,"s3://crabby-images/0ec1c/0ec1cbaed04bb08d8a9574fd2fc14beab6231662" alt=""
Sentence analogies
Define a sentence embedding that takes the last feature vector from BERT subword embeddings (as an arbitrary choice):
Out[18]= | data:image/s3,"s3://crabby-images/6a8f0/6a8f04c929589823c0b0d79e33bc7f804973b600" alt="" |
Define a list of sentence in two broad categories (food and music):
Precompute the embeddings for a list of sentences:
Visualize the similarity between the sentences using the net as a feature extractor:
Out[21]= | data:image/s3,"s3://crabby-images/cadf0/cadf08339d759664621a9d3a79730686a47943d5" alt="" |
Train a classifier model with the subword embeddings
Get a text-processing dataset:
View a random sample of the dataset:
Out[23]= | data:image/s3,"s3://crabby-images/8a686/8a6868848637f0f8a3c3db9e500666e836a51e16" alt="" |
Precompute the BERT vectors for the training and the validation datasets (if available, GPU is highly recommended):
Define a network to classify the sequences of subword embeddings, using a max-pooling strategy:
Out[25]= | data:image/s3,"s3://crabby-images/1aaf3/1aaf31a1e99fc6f68c3edc22e1af63cbd14f47ba" alt="" |
Train the network on the precomputed BERT vectors:
Out[26]= | data:image/s3,"s3://crabby-images/b1f3d/b1f3d5e6f42f353fba8cd38676939c2962dcd8e9" alt="" |
Check the classification error rate on the validation data:
Out[27]= | data:image/s3,"s3://crabby-images/39d08/39d08a287c30f2dc0a6627252bc8ebf3a8a3c208" alt="" |
Let’s compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors for the training and the validation dataset:
Train the classifier on the precomputed GloVe vectors:
Out[29]= | data:image/s3,"s3://crabby-images/d6900/d6900574dfd0501d081ea17e1ba0e1cf80c59ef3" alt="" |
Compare the results obtained with GPT and with GloVe:
Out[30]= | data:image/s3,"s3://crabby-images/887c0/887c0190bcace840469ce0a3acd0fc2d4f4e8cc2" alt="" |
Net information
Inspect the number of parameters of all arrays in the net:
Out[31]= | data:image/s3,"s3://crabby-images/0f73a/0f73ab4b4c0e70fa3540622cd7ec8ec601bb35bc" alt="" |
Obtain the total number of parameters:
Out[32]= | data:image/s3,"s3://crabby-images/2559d/2559d8b998ada25f46b1c7b5d00c746e89a3a5bb" alt="" |
Obtain the layer type counts:
Out[33]= | data:image/s3,"s3://crabby-images/16d6b/16d6b6e767c1bd191443090fb0a5baa55120bff5" alt="" |
Display the summary graphic:
Out[34]= | data:image/s3,"s3://crabby-images/82240/82240b8f2fd5f37d7537224c0ccf1d5a4bdaeadf" alt="" |
Export to MXNet
Export the net into a format that can be opened in MXNet:
Out[35]= | data:image/s3,"s3://crabby-images/5a170/5a17010f2be09903a07f6b46810ae37d4d37009c" alt="" |
Export also creates a net.params file containing parameters:
Out[36]= | data:image/s3,"s3://crabby-images/a3565/a3565c2e202281bc0872697082dc7479c0edde47" alt="" |
Get the size of the parameter file:
Out[37]= | data:image/s3,"s3://crabby-images/324a9/324a92014818db5631bd47c62900a896c99ec50e" alt="" |