BioBERT Trained on PubMed and PMC Data

Represent text as a sequence of vectors

Released in 2019, these three models have been trained on a large-scale biomedical corpora comprising of 4.5 billion words from PubMed abstracts and 13.5 billion words from PMC full-text articles. The resulting model significantly outperformed previous state-of-art models and simple BERT models on biomedical text-mining tasks, namely biomedical named entity recognition, biomedical relation and biomedical question answering.

Number of models: 3

Training Set Information

Performance

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["BioBERT Trained on PubMed and PMC Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["BioBERT Trained on PubMed and PMC Data", \
"ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"BioBERT Trained on PubMed and PMC Data", "Type" -> "V1.0-Pubmed-PMC", "InputType" -> "ListOfStrings"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"BioBERT Trained on PubMed and PMC Data", "Type" -> "V1.0-PMC", "InputType" -> "ListOfStrings"}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Given a piece of text, BioBERT net produces a sequence of feature vectors of size 768, which corresponds to the sequence of input words or subwords:

In[5]:=
input = "Cushing syndrome symptoms with adrenal suppression is caused \
by a exogenous glucocorticoid depot triamcinolone.";
embeddings = NetModel["BioBERT Trained on PubMed and PMC Data"][input];

Obtain dimensions of the embeddings:

In[6]:=
Dimensions@embeddings
Out[6]=

Visualize the embeddings:

In[7]:=
MatrixPlot@embeddings
Out[7]=

Transformer architecture

Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:

In[8]:=
net = NetModel[{"BioBERT Trained on PubMed and PMC Data", "InputType" -> "ListOfStrings"}];
netencoder = NetExtract[net, "Input"]
Out[9]=

For each input subword token, the encoder yields a pair of indices that corresponds to the token index in the vocabulary and the index of the sentence within the list of input sentences:

In[10]:=
netencoder[{"The patient was on clindamycin and topical tazarotene \
for his acne.", "His family history included hypertension, diabetes, and heart \
disease."}]
Out[10]=

The list of tokens always starts with special token index 102, which corresponds to the classification index. Also the special token index 103 is used as a separator between the different text segments. Each subword token is also assigned a positional index:

In[11]:=
net[{"The patient was on clindamycin and topical tazarotene for his \
acne.", "His family history included hypertension, diabetes, and \
heart disease."}, NetPort[{"embedding", "posembed", "Output"}]]
Out[11]=

A lookup is done to map these indices to numeric vectors of size 768:

In[12]:=
embeddings = net[{"The patient was on clindamycin and topical tazarotene for his \
acne.", "His family history included hypertension, diabetes, and \
heart disease."},
   {NetPort[{"embedding", "embeddingpos", "Output"}],
    NetPort[{"embedding", "embeddingtokens", "Output"}],
    NetPort[{"embedding", "embeddingsegments", "Output"}]}];
Map[MatrixPlot, embeddings]
Out[13]=

For each subword token, these three embeddings are combined by summing elements with ThreadingLayer:

In[14]:=
NetExtract[net, "embedding"]
Out[14]=

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[15]:=
NetExtract[net, "encoder"]
Out[15]=

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads.” Each head uses an AttentionLayer at its core:

In[16]:=
NetExtract[net, {"encoder", 1, 1}]
Out[16]=

BioBERT uses self-attention, where the embedding of a given subword depends on the full input text. The following figure compares self-attention (lower left) to other types of connectivity patterns that are popular in deep learning:

Sentence analogies

Define a sentence embedding net that takes the last feature vector from BioBERT subword embeddings (as an arbitrary choice):

In[17]:=
sentenceembedding = NetAppend[NetModel["BioBERT Trained on PubMed and PMC Data"], "pooling" -> SequenceLastLayer[]]
Out[17]=

Define a list of sentences in two broad categories (food and music):

In[18]:=
sentences = {"Hepatitis B is the most common infectious disease in \
the world.", "Malaria, is a mosquito-borne disease in tropical and subtropical \
climates.", "Hepatitis C can lead to liver cancer or cirrhosis of the liver \
over time.", "Tuberculosis is caused by a bacteria and can cause chest pain and \
a bad cough.",
   "Acetaminophen is used to treat mild to moderate pain and to \
reduce fever.",
   "Esomeprazole is a proton-pump inhibitor that decreases the \
acidicity in the stomach.",
   "Haloperidol is an antipsychotic medicine that is used to treat \
schizophrenia.",
   "Minocycline is used to treat many different bacterial \
infections."};

Precompute the embeddings for a list of sentences:

In[19]:=
assoc = AssociationThread[sentences -> sentenceembedding[sentences]];

Visualize the similarity between the sentences using the net as a feature extractor:

In[20]:=
FeatureSpacePlot[assoc, LabelingFunction -> Callout, LabelingSize -> {200, 60}]
Out[20]=

Net information

Inspect the number of parameters of all arrays in the net:

In[21]:=
Information[
 NetModel["BioBERT Trained on PubMed and PMC Data"], \
"ArraysElementCounts"]
Out[21]=

Obtain the total number of parameters:

In[22]:=
Information[
 NetModel["BioBERT Trained on PubMed and PMC Data"], \
"ArraysTotalElementCount"]
Out[22]=

Obtain the layer type counts:

In[23]:=
Information[
 NetModel["BioBERT Trained on PubMed and PMC Data"], \
"LayerTypeCounts"]
Out[23]=

Display the summary graphic:

In[24]:=
Information[
 NetModel["BioBERT Trained on PubMed and PMC Data"], "SummaryGraphic"]
Out[24]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[25]:=
jsonPath = Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], NetModel["BioBERT Trained on PubMed and PMC Data"], "MXNet"]
Out[25]=

Export also creates a net.params file containing parameters:

In[26]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[26]=

Get the size of the parameter file:

In[27]:=
FileByteCount[paramPath]
Out[27]=

Requirements

Wolfram Language 12.1 (March 2020) or above

Resource History

Reference

  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C.-H. So, J. Kang, "BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining," Bioinformatics, 36(4), 1234–1240 (2020)
  • Available from: https://github.com/dmis-lab/biobert
  • Rights: Apache 2.0 License