Wolfram Research

SciBERT Trained on Semantic Scholar Data

Represent text as a sequence of vectors

Released in 2019, these four pre-trained feature extractors leverage a large multidomain scientific corpus with a total of 3.17 billion tokens. Two vocabularies are available, each coming in both cased and uncased versions: the original BERT vocabulary and a new "Scivocab", with overlapping by 42% with the original. The resulting model has improved performance on a suite of downstream scientific NLP tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains.

Number of models: 4

Training Set Information

Performance

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["SciBERT Trained on Semantic Scholar Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["SciBERT Trained on Semantic Scholar Data", \
"ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"SciBERT Trained on Semantic Scholar Data", 
  "Type" ->  "scivocab_cased", "InputType" -> "ListOfStrings"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"SciBERT Trained on Semantic Scholar Data", 
  "Type" -> "basevocab_uncased", 
  "InputType" -> "ListOfStrings"}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Given a piece of text, SciBERT net produces a sequence of feature vectors of size 768, which corresponds to the sequence of input words or subwords:

In[5]:=
input = "Cushing syndrome symptoms with adrenal suppression is caused \
by a exogenous glucocorticoid depot triamcinolone.";
embeddings = 
  NetModel["SciBERT Trained on Semantic Scholar Data"][input];

Obtain the dimensions of the embeddings:

In[6]:=
Dimensions@embeddings
Out[6]=

Visualize the embeddings:

In[7]:=
MatrixPlot@embeddings
Out[7]=

Transformer architecture

Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:

In[8]:=
net = NetModel[{"SciBERT Trained on Semantic Scholar Data", 
    "InputType" -> "ListOfStrings"}];
netencoder = NetExtract[net, "Input"]
Out[9]=

For each input subword token, the encoder yields a pair of indices that corresponds to the token index in the vocabulary and the index of the sentence within the list of input sentences:

In[10]:=
netencoder[{"The patient was on clindamycin and topical tazarotene \
for his acne.", 
  "His family history included hypertension, diabetes, and heart \
disease."}]
Out[10]=

The list of tokens always starts with special token index 102, which corresponds to the classification index. Also the special token index 103 is used as a separator between the different text segments. Each subword token is also assigned a positional index:

In[11]:=
net[{"The patient was on clindamycin and topical tazarotene for his \
acne.", "His family history included hypertension, diabetes, and \
heart disease."}, NetPort[{"embedding", "posembed", "Output"}]]
Out[11]=

A lookup is done to map these indices to numeric vectors of size 768:

In[12]:=
embeddings = 
  net[{"The patient was on clindamycin and topical tazarotene for his \
acne.", "His family history included hypertension, diabetes, and \
heart disease."},
   {NetPort[{"embedding", "embeddingpos", "Output"}],
    NetPort[{"embedding", "embeddingtokens", "Output"}],
    NetPort[{"embedding", "embeddingsegments", "Output"}]}];
Map[MatrixPlot, embeddings]
Out[13]=

For each subword token, these three embeddings are combined by summing elements with ThreadingLayer:

In[14]:=
NetExtract[net, "embedding"]
Out[14]=

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[15]:=
NetExtract[net, "encoder"]
Out[15]=

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads.” Each head uses an AttentionLayer at its core:

In[16]:=
NetExtract[net, {"encoder", 1, 1}]
Out[16]=

SciBERT uses self-attention, where the embedding of a given subword depends on the full input text. The following figure compares self-attention (lower left) to other types of connectivity patterns that are popular in deep learning:

Sentence analogies

Define a sentence embedding that takes the last feature vector from SciBERT subword embeddings (as an arbitrary choice):

In[17]:=
sentenceembedding = 
 NetAppend[NetModel["SciBERT Trained on Semantic Scholar Data"], 
  "pooling" -> SequenceLastLayer[]]
Out[17]=

Define a list of sentences in three broad categories (diseases, medicines and NLP models):

In[18]:=
sentences = {"Hepatitis B is the most common infectious disease in \
the world.", 
   "Malaria, is a mosquito-borne disease in tropical and subtropical \
climates.", 
   "Hepatitis C can lead to liver cancer or cirrhosis of the liver \
over time.", 
   "Tuberculosis is caused by a bacteria and can cause chest pain and \
a bad cough.",
   "Acetaminophen is used to treat mild to moderate pain and to \
reduce fever.",
   "Esomeprazole is a proton-pump inhibitor that decreases the \
acidicity in the stomach.",
   "Haloperidol is an antipsychotic medicine that is used to treat \
schizophrenia.",
   "Minocycline is used to treat many different bacterial infections.",
   "SpanBERT,a pre-training method designed to better represent spans \
of text.", 
   "ALBERT uses two parameter reduction techniques to help scaling \
pre-trained models.",
   "DistilBERT retains 97% of the performance of BERT with 40% fewer \
parameters.",
   "Q-BERT achieves 13\[Times]compression ratio in weights with at \
most 2.3% accuracy loss."};

Precompute the embeddings for a list of sentences:

In[19]:=
assoc = AssociationThread[sentences -> sentenceembedding[sentences]];

Visualize the similarity between the sentences using the net as a feature extractor:

In[20]:=
FeatureSpacePlot[assoc, LabelingFunction -> Callout, 
 LabelingSize -> {200, 60}, ImageSize -> 650]
Out[20]=

Net information

Inspect the number of parameters of all arrays in the net:

In[21]:=
Information[
 NetModel["SciBERT Trained on Semantic Scholar Data"], \
"ArraysElementCounts"]
Out[21]=

Obtain the total number of parameters:

In[22]:=
Information[
 NetModel["SciBERT Trained on Semantic Scholar Data"], \
"ArraysTotalElementCount"]
Out[22]=

Obtain the layer type counts:

In[23]:=
Information[
 NetModel["SciBERT Trained on Semantic Scholar Data"], \
"LayerTypeCounts"]
Out[23]=

Display the summary graphic:

In[24]:=
Information[
 NetModel["SciBERT Trained on Semantic Scholar Data"], \
"SummaryGraphic"]
Out[24]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[25]:=
jsonPath = 
 Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], 
  NetModel["SciBERT Trained on Semantic Scholar Data"], "MXNet"]
Out[25]=

Export also creates a net.params file containing parameters:

In[26]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[26]=

Get the size of the parameter file:

In[27]:=
FileByteCount[paramPath]
Out[27]=

Requirements

Wolfram Language 12.1 (March 2020) or above

Resource History

Reference