BioBERT Trained on PubMed and PMC Data

Represent text as a sequence of vectors

Released in 2019, these three models have been trained on a large-scale biomedical corpora comprising of 4.5 billion words from PubMed abstracts and 13.5 billion words from PMC full-text articles. The resulting model significantly outperformed previous state-of-art models and simple BERT models on biomedical text-mining tasks, namely biomedical named entity recognition, biomedical relation and biomedical question answering.

Number of models: 3

Training Set Information

BookCorpus, a dataset consisting of 11,038 unpublished books from 16 different genres generates 0.8 billion words. 2.5 billion words from text passages of English Wikipedia. 4.5 billion words from PubMed abstracts and 13.5 billion words from PMC full-text articles.

Performance

Precision on various datasets, comparison between BioBERT and BERT:
Accuracy on various datasets, comparison between BioBERT and BERT:

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=

$NetModel["BioBERT Trained on PubMed and PMC Data", \ "ParametersInformation"]$

Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=

Out[3]=

Pick a non-default uninitialized net:

In[4]:=

Out[4]=

Basic usage

Given a piece of text, BioBERT net produces a sequence of feature vectors of size 768, which corresponds to the sequence of input words or subwords:

In[5]:=

$input = "Cushing syndrome symptoms with adrenal suppression is caused \ by a exogenous glucocorticoid depot triamcinolone."; embeddings = NetModel["BioBERT Trained on PubMed and PMC Data"][input];$

Obtain dimensions of the embeddings:

In[6]:=

Out[6]=

Visualize the embeddings:

In[7]:=

Out[7]=

Transformer architecture

Each input text segment is first tokenized into words or subwords using a word-piece tokenizer and additional text normalization. Integer codes called token indices are generated from these tokens, together with additional segment indices:

In[8]:=

Out[9]=

For each input subword token, the encoder yields a pair of indices that corresponds to the token index in the vocabulary and the index of the sentence within the list of input sentences:

In[10]:=

$netencoder[{"The patient was on clindamycin and topical tazarotene \ for his acne.", "His family history included hypertension, diabetes, and heart \ disease."}]$

Out[10]=

The list of tokens always starts with special token index 102, which corresponds to the classification index. Also the special token index 103 is used as a separator between the different text segments. Each subword token is also assigned a positional index:

In[11]:=

$net[{"The patient was on clindamycin and topical tazarotene for his \ acne.", "His family history included hypertension, diabetes, and \ heart disease."}, NetPort[{"embedding", "posembed", "Output"}]]$

Out[11]=

A lookup is done to map these indices to numeric vectors of size 768:

In[12]:=

$embeddings = net[{"The patient was on clindamycin and topical tazarotene for his \ acne.", "His family history included hypertension, diabetes, and \ heart disease."}, {NetPort[{"embedding", "embeddingpos", "Output"}], NetPort[{"embedding", "embeddingtokens", "Output"}], NetPort[{"embedding", "embeddingsegments", "Output"}]}]; Map[MatrixPlot, embeddings]$

Out[13]=

For each subword token, these three embeddings are combined by summing elements with ThreadingLayer:

In[14]:=

Out[14]=

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

In[15]:=

Out[15]=

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads.” Each head uses an AttentionLayer at its core:

In[16]:=

Out[16]=

BioBERT uses self-attention, where the embedding of a given subword depends on the full input text. The following figure compares self-attention (lower left) to other types of connectivity patterns that are popular in deep learning:

Sentence analogies

Define a sentence embedding net that takes the last feature vector from BioBERT subword embeddings (as an arbitrary choice):

In[17]:=

Out[17]=

Define a list of sentences in two broad categories (food and music):

In[18]:=

$sentences = {"Hepatitis B is the most common infectious disease in \ the world.", "Malaria, is a mosquito-borne disease in tropical and subtropical \ climates.", "Hepatitis C can lead to liver cancer or cirrhosis of the liver \ over time.", "Tuberculosis is caused by a bacteria and can cause chest pain and \ a bad cough.", "Acetaminophen is used to treat mild to moderate pain and to \ reduce fever.", "Esomeprazole is a proton-pump inhibitor that decreases the \ acidicity in the stomach.", "Haloperidol is an antipsychotic medicine that is used to treat \ schizophrenia.", "Minocycline is used to treat many different bacterial \ infections."};$

Precompute the embeddings for a list of sentences:

In[19]:=

Visualize the similarity between the sentences using the net as a feature extractor:

In[20]:=

Out[20]=

Net information

Inspect the number of parameters of all arrays in the net:

In[21]:=

$Information[ NetModel["BioBERT Trained on PubMed and PMC Data"], \ "ArraysElementCounts"]$

Out[21]=

Obtain the total number of parameters:

In[22]:=

$Information[ NetModel["BioBERT Trained on PubMed and PMC Data"], \ "ArraysTotalElementCount"]$

Out[22]=

Obtain the layer type counts:

In[23]:=

$Information[ NetModel["BioBERT Trained on PubMed and PMC Data"], \ "LayerTypeCounts"]$

Out[23]=

Display the summary graphic:

In[24]:=

Out[24]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[25]:=

Out[25]=

Export also creates a net.params file containing parameters:

In[26]:=

Out[26]=

Get the size of the parameter file:

In[27]:=

Out[27]=

Construction Notebook

Download Construction Notebook

Open in Wolfram Cloud

Requirements

Wolfram Language 12.1 (March 2020) or above

Resource History

Date Created: 2 June 2020

Reference

J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C.-H. So, J. Kang, "BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining," Bioinformatics, 36(4), 1234–1240 (2020)
Available from: https://github.com/dmis-lab/biobert
Rights: Apache 2.0 License