Wolfram Research

SciBERT Trained on Semantic Scholar Data

Represent text as a sequence of vectors

Released in 2019, these four pre-trained feature extractors leverage a large multidomain scientific corpus with a total of 3.17 billion tokens. Two vocabularies are available, each coming in both cased and uncased versions: the original BERT vocabulary and a new "Scivocab", with overlapping by 42% with the original. The resulting model has improved performance on a suite of downstream scientific NLP tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains.

Number of models: 4

Training Set Information

Performance

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["SciBERT Trained on Semantic Scholar Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["SciBERT Trained on Semantic Scholar Data", \
"ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"SciBERT Trained on Semantic Scholar Data", "Type" ->  "ScivocabCased", "InputType" -> "ListOfStrings"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"SciBERT Trained on Semantic Scholar Data", "Type" -> "BasevocabUncased", "InputType" -> "ListOfStrings"}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Given a piece of text, SciBERT net produces a sequence of feature vectors of size 768, which corresponds to the sequence of input words or subwords:

In[5]:=
$Version
Out[5]=
In[6]:=
input = "Cushing syndrome symptoms with adrenal suppression is caused \
by a exogenous glucocorticoid depot triamcinolone.";
embeddings = NetModel["SciBERT Trained on Semantic Scholar Data"][input];

Obtain the dimensions of the embeddings:

In[7]:=
Dimensions@embeddings
Out[7]=

Visualize the embeddings:

In[8]:=
MatrixPlot@embeddings
Out[8]=

Transformer architecture

Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:

In[9]:=
net = NetModel[{"SciBERT Trained on Semantic Scholar Data", "InputType" -> "ListOfStrings"}];
netencoder = NetExtract[net, "Input"]
Out[10]=

For each input subword token, the encoder yields a pair of indices that corresponds to the token index in the vocabulary and the index of the sentence within the list of input sentences:

In[11]:=
netencoder[{"The patient was on clindamycin and topical tazarotene \
for his acne.", "His family history included hypertension, diabetes, and heart \
disease."}]
Out[11]=

The list of tokens always starts with special token index 102, which corresponds to the classification index. Also the special token index 103 is used as a separator between the different text segments. Each subword token is also assigned a positional index:

In[12]:=
net[{"The patient was on clindamycin and topical tazarotene for his \
acne.", "His family history included hypertension, diabetes, and \
heart disease."}, NetPort[{"embedding", "posembed", "Output"}]]
Out[12]=

A lookup is done to map these indices to numeric vectors of size 768:

In[13]:=
embeddings = net[{"The patient was on clindamycin and topical tazarotene for his \
acne.", "His family history included hypertension, diabetes, and \
heart disease."},
   {NetPort[{"embedding", "embeddingpos", "Output"}],
    NetPort[{"embedding", "embeddingtokens", "Output"}],
    NetPort[{"embedding", "embeddingsegments", "Output"}]}];
Map[MatrixPlot, embeddings]
Out[14]=

For each subword token, these three embeddings are combined by summing elements with ThreadingLayer:

In[15]:=
NetExtract[net, "embedding"]
Out[15]=

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[16]:=
NetExtract[net, "encoder"]
Out[16]=

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads.” Each head uses an AttentionLayer at its core:

In[17]:=
NetExtract[net, {"encoder", 1, 1}]
Out[17]=

SciBERT uses self-attention, where the embedding of a given subword depends on the full input text. The following figure compares self-attention (lower left) to other types of connectivity patterns that are popular in deep learning:

Sentence analogies

Define a sentence embedding that takes the last feature vector from SciBERT subword embeddings (as an arbitrary choice):

In[18]:=
sentenceembedding = NetAppend[NetModel["SciBERT Trained on Semantic Scholar Data"], "pooling" -> SequenceLastLayer[]]
Out[18]=

Define a list of sentences in three broad categories (diseases, medicines and NLP models):

In[19]:=
sentences = {"Hepatitis B is the most common infectious disease in \
the world.", "Malaria, is a mosquito-borne disease in tropical and subtropical \
climates.", "Hepatitis C can lead to liver cancer or cirrhosis of the liver \
over time.", "Tuberculosis is caused by a bacteria and can cause chest pain and \
a bad cough.",
   "Acetaminophen is used to treat mild to moderate pain and to \
reduce fever.",
   "Esomeprazole is a proton-pump inhibitor that decreases the \
acidicity in the stomach.",
   "Haloperidol is an antipsychotic medicine that is used to treat \
schizophrenia.",
   "Minocycline is used to treat many different bacterial infections.",
   "SpanBERT,a pre-training method designed to better represent spans \
of text.", "ALBERT uses two parameter reduction techniques to help scaling \
pre-trained models.",
   "DistilBERT retains 97% of the performance of BERT with 40% fewer \
parameters.",
   "Q-BERT achieves 13\[Times]compression ratio in weights with at \
most 2.3% accuracy loss."};

Precompute the embeddings for a list of sentences:

In[20]:=
assoc = AssociationThread[sentences -> sentenceembedding[sentences]];

Visualize the similarity between the sentences using the net as a feature extractor:

In[21]:=
FeatureSpacePlot[assoc, LabelingFunction -> Callout, LabelingSize -> {200, 60}, ImageSize -> 650]
Out[21]=

Net information

Inspect the number of parameters of all arrays in the net:

In[22]:=
Information[
 NetModel["SciBERT Trained on Semantic Scholar Data"], \
"ArraysElementCounts"]
Out[22]=

Obtain the total number of parameters:

In[23]:=
Information[
 NetModel["SciBERT Trained on Semantic Scholar Data"], \
"ArraysTotalElementCount"]
Out[23]=

Obtain the layer type counts:

In[24]:=
Information[
 NetModel["SciBERT Trained on Semantic Scholar Data"], \
"LayerTypeCounts"]
Out[24]=

Display the summary graphic:

In[25]:=
Information[
 NetModel["SciBERT Trained on Semantic Scholar Data"], \
"SummaryGraphic"]
Out[25]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[26]:=
jsonPath = Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], NetModel["SciBERT Trained on Semantic Scholar Data"], "MXNet"]
Out[26]=

Export also creates a net.params file containing parameters:

In[27]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[27]=

Get the size of the parameter file:

In[28]:=
FileByteCount[paramPath]
Out[28]=

Requirements

Wolfram Language 12.1 (March 2020) or above

Resource History

Reference