SciBERT Trained on Semantic Scholar Data

Represent text as a sequence of vectors

Released in 2019, these four pre-trained feature extractors leverage a large multidomain scientific corpus with a total of 3.17 billion tokens. Two vocabularies are available, each coming in both cased and uncased versions: the original BERT vocabulary and a new "Scivocab", with overlapping by 42% with the original. The resulting model has improved performance on a suite of downstream scientific NLP tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains.

Number of models: 4

Training Set Information

Performance

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["SciBERT Trained on Semantic Scholar Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["SciBERT Trained on Semantic Scholar Data", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"SciBERT Trained on Semantic Scholar Data", "Type" ->  "ScivocabCased", "InputType" -> "ListOfStrings"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"SciBERT Trained on Semantic Scholar Data", "Type" -> "BasevocabUncased", "InputType" -> "ListOfStrings"}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Given a piece of text, SciBERT net produces a sequence of feature vectors of size 768, which corresponds to the sequence of input words or subwords:

In[5]:=
$Version
Out[5]=
In[6]:=
input = "Cushing syndrome symptoms with adrenal suppression is caused by a exogenous glucocorticoid depot triamcinolone.";
embeddings = NetModel["SciBERT Trained on Semantic Scholar Data"][input];

Obtain the dimensions of the embeddings:

In[7]:=
Dimensions@embeddings
Out[7]=

Visualize the embeddings:

In[8]:=
MatrixPlot@embeddings
Out[8]=

Transformer architecture

Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:

In[9]:=
net = NetModel[{"SciBERT Trained on Semantic Scholar Data", "InputType" -> "ListOfStrings"}];
netencoder = NetExtract[net, "Input"]
Out[10]=

For each input subword token, the encoder yields a pair of indices that corresponds to the token index in the vocabulary and the index of the sentence within the list of input sentences:

In[11]:=
netencoder[{"The patient was on clindamycin and topical tazarotene for his acne.", "His family history included hypertension, diabetes, and heart disease."}]
Out[11]=

The list of tokens always starts with special token index 102, which corresponds to the classification index. Also the special token index 103 is used as a separator between the different text segments. Each subword token is also assigned a positional index:

In[12]:=
net[{"The patient was on clindamycin and topical tazarotene for his acne.", "His family history included hypertension, diabetes, and heart disease."}, NetPort[{"embedding", "posembed", "Output"}]]
Out[12]=

A lookup is done to map these indices to numeric vectors of size 768:

In[13]:=
embeddings = net[{"The patient was on clindamycin and topical tazarotene for his acne.", "His family history included hypertension, diabetes, and heart disease."},
   {NetPort[{"embedding", "embeddingpos", "Output"}],
    NetPort[{"embedding", "embeddingtokens", "Output"}],
    NetPort[{"embedding", "embeddingsegments", "Output"}]}];
Map[MatrixPlot, embeddings]
Out[14]=

For each subword token, these three embeddings are combined by summing elements with ThreadingLayer:

In[15]:=
NetExtract[net, "embedding"]
Out[15]=

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[16]:=
NetExtract[net, "encoder"]
Out[16]=

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads.” Each head uses an AttentionLayer at its core:

In[17]:=
NetExtract[net, {"encoder", 1, 1}]
Out[17]=

SciBERT uses self-attention, where the embedding of a given subword depends on the full input text. The following figure compares self-attention (lower left) to other types of connectivity patterns that are popular in deep learning:

Sentence analogies

Define a sentence embedding that takes the last feature vector from SciBERT subword embeddings (as an arbitrary choice):

In[18]:=
sentenceembedding = NetAppend[NetModel["SciBERT Trained on Semantic Scholar Data"], "pooling" -> SequenceLastLayer[]]
Out[18]=

Define a list of sentences in three broad categories (diseases, medicines and NLP models):

In[19]:=
sentences = {"Hepatitis B is the most common infectious disease in the world.", "Malaria, is a mosquito-borne disease in tropical and subtropical climates.", "Hepatitis C can lead to liver cancer or cirrhosis of the liver over time.", "Tuberculosis is caused by a bacteria and can cause chest pain and a bad cough.",
   "Acetaminophen is used to treat mild to moderate pain and to reduce fever.",
   "Esomeprazole is a proton-pump inhibitor that decreases the acidicity in the stomach.",
   "Haloperidol is an antipsychotic medicine that is used to treat schizophrenia.",
   "Minocycline is used to treat many different bacterial infections.",
   "SpanBERT,a pre-training method designed to better represent spans of text.", "ALBERT uses two parameter reduction techniques to help scaling pre-trained models.",
   "DistilBERT retains 97% of the performance of BERT with 40% fewer parameters.",
   "Q-BERT achieves 13\[Times]compression ratio in weights with at most 2.3% accuracy loss."};

Precompute the embeddings for a list of sentences:

In[20]:=
assoc = AssociationThread[sentences -> sentenceembedding[sentences]];

Visualize the similarity between the sentences using the net as a feature extractor:

In[21]:=
FeatureSpacePlot[assoc, LabelingFunction -> Callout, LabelingSize -> {200, 60}, ImageSize -> 650]
Out[21]=

Net information

Inspect the number of parameters of all arrays in the net:

In[22]:=
Information[
 NetModel["SciBERT Trained on Semantic Scholar Data"], "ArraysElementCounts"]
Out[22]=

Obtain the total number of parameters:

In[23]:=
Information[
 NetModel["SciBERT Trained on Semantic Scholar Data"], "ArraysTotalElementCount"]
Out[23]=

Obtain the layer type counts:

In[24]:=
Information[
 NetModel["SciBERT Trained on Semantic Scholar Data"], "LayerTypeCounts"]
Out[24]=

Display the summary graphic:

In[25]:=
Information[
 NetModel["SciBERT Trained on Semantic Scholar Data"], "SummaryGraphic"]
Out[25]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[26]:=
jsonPath = Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], NetModel["SciBERT Trained on Semantic Scholar Data"], "MXNet"]
Out[26]=

Export also creates a net.params file containing parameters:

In[27]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[27]=

Get the size of the parameter file:

In[28]:=
FileByteCount[paramPath]
Out[28]=

Requirements

Wolfram Language 12.1 (March 2020) or above

Resource History

Reference