SciBERT Trained on Semantic Scholar Data

Represent text as a sequence of vectors

Released in 2019, these four pre-trained feature extractors leverage a large multidomain scientific corpus with a total of 3.17 billion tokens. Two vocabularies are available, each coming in both cased and uncased versions: the original BERT vocabulary and a new "Scivocab", with overlapping by 42% with the original. The resulting model has improved performance on a suite of downstream scientific NLP tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains.

Number of models: 4

Training Set Information

SemanticScholar, a dataset consisting of 1.14 million scientific publications with 18% of the papers from computer science domain and 82% from the broad biomedical domain.

Performance

F1 scores for fine-tuned BERT and Uncased-Scivocab SciBERT models for various natural language inference tasks:

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=

Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=

Out[3]=

Pick a non-default uninitialized net:

In[4]:=

Out[4]=

Basic usage

Given a piece of text, SciBERT net produces a sequence of feature vectors of size 768, which corresponds to the sequence of input words or subwords:

In[5]:=

Out[5]=

In[6]:=

input = "Cushing syndrome symptoms with adrenal suppression is caused by a exogenous glucocorticoid depot triamcinolone.";
embeddings = NetModel["SciBERT Trained on Semantic Scholar Data"][input];

Obtain the dimensions of the embeddings:

In[7]:=

Out[7]=

Visualize the embeddings:

In[8]:=

Out[8]=

Transformer architecture

Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:

In[9]:=

Out[10]=

For each input subword token, the encoder yields a pair of indices that corresponds to the token index in the vocabulary and the index of the sentence within the list of input sentences:

In[11]:=

Out[11]=

The list of tokens always starts with special token index 102, which corresponds to the classification index. Also the special token index 103 is used as a separator between the different text segments. Each subword token is also assigned a positional index:

In[12]:=

net[{"The patient was on clindamycin and topical tazarotene for his acne.", "His family history included hypertension, diabetes, and heart disease."}, NetPort[{"embedding", "posembed", "Output"}]]

Out[12]=

A lookup is done to map these indices to numeric vectors of size 768:

In[13]:=

embeddings = net[{"The patient was on clindamycin and topical tazarotene for his acne.", "His family history included hypertension, diabetes, and heart disease."},
{NetPort[{"embedding", "embeddingpos", "Output"}],
NetPort[{"embedding", "embeddingtokens", "Output"}],
NetPort[{"embedding", "embeddingsegments", "Output"}]}];
Map[MatrixPlot, embeddings]

Out[14]=

For each subword token, these three embeddings are combined by summing elements with ThreadingLayer:

In[15]:=

Out[15]=

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[16]:=

Out[16]=

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads.” Each head uses an AttentionLayer at its core:

In[17]:=

Out[17]=

SciBERT uses self-attention, where the embedding of a given subword depends on the full input text. The following figure compares self-attention (lower left) to other types of connectivity patterns that are popular in deep learning:

Sentence analogies

Define a sentence embedding that takes the last feature vector from SciBERT subword embeddings (as an arbitrary choice):

In[18]:=

Out[18]=

Define a list of sentences in three broad categories (diseases, medicines and NLP models):

In[19]:=

$sentences = {"Hepatitis B is the most common infectious disease in the world.", "Malaria, is a mosquito-borne disease in tropical and subtropical climates.", "Hepatitis C can lead to liver cancer or cirrhosis of the liver over time.", "Tuberculosis is caused by a bacteria and can cause chest pain and a bad cough.", "Acetaminophen is used to treat mild to moderate pain and to reduce fever.", "Esomeprazole is a proton-pump inhibitor that decreases the acidicity in the stomach.", "Haloperidol is an antipsychotic medicine that is used to treat schizophrenia.", "Minocycline is used to treat many different bacterial infections.", "SpanBERT,a pre-training method designed to better represent spans of text.", "ALBERT uses two parameter reduction techniques to help scaling pre-trained models.", "DistilBERT retains 97% of the performance of BERT with 40% fewer parameters.", "Q-BERT achieves 13\[Times]compression ratio in weights with at most 2.3% accuracy loss."};$

Precompute the embeddings for a list of sentences:

In[20]:=

Visualize the similarity between the sentences using the net as a feature extractor:

In[21]:=

Out[21]=

Net information

Inspect the number of parameters of all arrays in the net:

In[22]:=

Out[22]=

Obtain the total number of parameters:

In[23]:=

Out[23]=

Obtain the layer type counts:

In[24]:=

Out[24]=

Display the summary graphic:

In[25]:=

Out[25]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[26]:=

Out[26]=

Export also creates a net.params file containing parameters:

In[27]:=

Out[27]=

Get the size of the parameter file:

In[28]:=

Out[28]=

Construction Notebook

Download Construction Notebook

Open in Wolfram Cloud

Requirements

Wolfram Language 12.1 (March 2020) or above

Resource History

Date Created: 15 June 2020

Reference

I. Beltagy, K. Lo, A. Cohan, "SciBERT: A Pretrained Language Model for Scientific Text," arXiv:1903.10676 (2019)
Available from: https://github.com/allenai/scibert
Rights: Apache 2.0 License