Clinical Concept Embeddings Trained on Health Insurance Claims, Clinical Narratives from Stanford and PubMed Journal Articles

Represent a clinical concept as a vector

Released in 2018 by researchers at Harvard Medical School and the University of North Carolina, this model provides clinical concept embeddings for roughly 110,000 medical concepts, and equals or exceeds the state of the art on all benchmarks. Concepts were mapped into a common co-occurrence space so as to produce a single embedding.

Number of layers: 1 | Parameter count: 54,527,000 | Trained size: 218 MB |

Training Set Information

Training Set Data

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["Clinical Concept Embeddings Trained on Health Insurance \
Claims, Clinical Narratives from Stanford and PubMed Journal \
Articles"]
Out[1]=

Basic usage

This net represents clinical concepts as 500-dimensional vectors. Concepts are identified by their concept unique identifier (CUI). The mapping between identifiers and human-readable concepts can be obtained from the clinical concepts ResourceData:

In[2]:=
ResourceData["Clinical Concepts from Massive Sources of Medical Data"]
Out[2]=

Find the concept unique identifier for the clinical concept "apnea":

In[3]:=
apnea = First[
  Normal[ResourceData[
     "Clinical Concepts from Massive Sources of Medical Data"][
    Select[StringMatchQ[#["Concept"], "apnea", IgnoreCase -> True] &],
     "ConceptUniqueIdentifier"]]]
Out[3]=

Use the net to obtain the associated embedded vector:

In[4]:=
vector = NetModel[
   "Clinical Concept Embeddings Trained on Health Insurance Claims, \
Clinical Narratives from Stanford and PubMed Journal Articles"][apnea]
Out[4]=
In[5]:=
Dimensions[vector]
Out[5]=

Use the embedding layer inside a NetChain:

In[6]:=
chain = NetChain[{NetModel[
    "Clinical Concept Embeddings Trained on Health Insurance Claims, \
Clinical Narratives from Stanford and PubMed Journal Articles"], LongShortTermMemoryLayer[10]}]
Out[6]=

Feature visualization

Create two lists of related concepts:

In[7]:=
viruses = Style[#, Red] & /@ Normal[ResourceData[
     "Clinical Concepts from Massive Sources of Medical Data"][
    Select[StringContainsQ[#["Concept"], "virus", IgnoreCase -> True] &], "ConceptUniqueIdentifier"]]
Out[7]=
In[8]:=
bacteria = Style[#, Blue] & /@ Normal[ResourceData[
     "Clinical Concepts from Massive Sources of Medical Data"][
    Select[StringContainsQ[#["Concept"], "bacteria", IgnoreCase -> True] &], "ConceptUniqueIdentifier"]]
Out[8]=

Visualize the associated embeddings in two dimensions:

In[9]:=
FeatureSpacePlot[Join[bacteria, viruses], FeatureExtractor -> NetModel["Clinical Concept Embeddings Trained on Health Insurance \
Claims, Clinical Narratives from Stanford and PubMed Journal \
Articles"]]
Out[9]=

Visualize the associated embeddings in three dimensions:

In[10]:=
FeatureSpacePlot3D[Join[bacteria, viruses], FeatureExtractor -> NetModel["Clinical Concept Embeddings Trained on Health Insurance \
Claims, Clinical Narratives from Stanford and PubMed Journal \
Articles"]]
Out[10]=

Word analogies

Get the pre-trained net:

In[11]:=
net = NetModel[
  "Clinical Concept Embeddings Trained on Health Insurance Claims, \
Clinical Narratives from Stanford and PubMed Journal Articles"]
Out[11]=

Get a list of concept unique identifiers:

In[12]:=
conceptUIDs = NetExtract[net, "Input"][["Tokens"]]
Out[12]=

Obtain the embeddings:

In[13]:=
vecs = NetExtract[net, "Weights"][[1 ;; -2]];

Create an association whose keys are concept unique identifiers and whose values are vectors:

In[14]:=
word2vec = AssociationThread[conceptUIDs -> vecs];

Find the concept unique identifier for the clinical concept "DNA virus":

In[15]:=
dnavirus = First[Normal[
   ResourceData[
     "Clinical Concepts from Massive Sources of Medical Data"][
    Select[StringContainsQ[#["Concept"], "DNA virus", IgnoreCase -> True] &], "ConceptUniqueIdentifier"]]]
Out[15]=

Find the five nearest concept unique identifiers to "DNA virus":

In[16]:=
nearest = Nearest[word2vec, word2vec[dnavirus], 5]
Out[16]=

Obtain the human-readable concept labels for these concept unique identifiers:

In[17]:=
ResourceData[
  "Clinical Concepts from Massive Sources of Medical Data"][
 Select[StringContainsQ[#["ConceptUniqueIdentifier"], nearest] &], "Concept"]
Out[17]=

Explore similar drugs to a given one. Find the concept unique identifier for "metronidazole":

In[18]:=
metronidazole = First[Normal[
   ResourceData[
     "Clinical Concepts from Massive Sources of Medical Data"][
    Select[StringContainsQ[#["Concept"], "metronidazole", IgnoreCase -> True] &], "ConceptUniqueIdentifier"]]]
Out[18]=

Find the five nearest concept unique identifiers to "metronidazole":

In[19]:=
nearest = Nearest[word2vec, word2vec[metronidazole], 5]
Out[19]=

Obtain the human-readable concept labels for these concept unique identifiers:

In[20]:=
DeleteDuplicates@
 ResourceData[
   "Clinical Concepts from Massive Sources of Medical Data"][
  Select[StringContainsQ[#["ConceptUniqueIdentifier"], nearest] &], "Concept"]
Out[20]=

Identify comorbidity relationships: a comorbidity is a disease or condition that frequently accompanies the primary diagnosis. A comorbidity for the condition "premature infant" is "bronchopulmonary dysplasia." Comorbidities of another condition--for example, obesity--can be investigated using word analogies. First obtain the relevant CUIs:

In[21]:=
{premature, dysplasia, obesity} = Table[First@
   Normal@ResourceData[
      "Clinical Concepts from Massive Sources of Medical Data"][
     Select[#Concept == concept &], "ConceptUniqueIdentifier"], {concept, {"Infant, Premature", "Bronchopulmonary Dysplasia", "Obesity"}}]
Out[21]=

"Premature infant" is to "bronchopulmonary dysplasia" as "obesity" is to:

In[22]:=
results = Nearest[word2vec, word2vec[premature] - word2vec[dysplasia] + word2vec[obesity], 5]
Out[22]=

Obtain the human-readable concept labels for these concept unique identifiers:

In[23]:=
ResourceData[
  "Clinical Concepts from Massive Sources of Medical Data"][
 Select[StringContainsQ[#["ConceptUniqueIdentifier"], results, IgnoreCase -> True] &], "Concept"]
Out[23]=

Net information

Inspect the sizes of all arrays in the net:

In[24]:=
NetInformation[
 NetModel["Clinical Concept Embeddings Trained on Health Insurance \
Claims, Clinical Narratives from Stanford and PubMed Journal \
Articles"], "ArraysElementCounts"]
Out[24]=

Obtain the total number of parameters:

In[25]:=
NetInformation[
 NetModel["Clinical Concept Embeddings Trained on Health Insurance \
Claims, Clinical Narratives from Stanford and PubMed Journal \
Articles"], "ArraysTotalElementCount"]
Out[25]=

Obtain the layer type counts:

In[26]:=
NetInformation[
 NetModel["Clinical Concept Embeddings Trained on Health Insurance \
Claims, Clinical Narratives from Stanford and PubMed Journal \
Articles"], "LayerTypeCounts"]
Out[26]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[27]:=
jsonPath = Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], NetModel["Clinical Concept Embeddings Trained on Health Insurance \
Claims, Clinical Narratives from Stanford and PubMed Journal \
Articles"], "MXNet"]
Out[27]=

Export also creates a net.params file containing parameters:

In[28]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[28]=

Get the size of the parameter file:

In[29]:=
FileByteCount[paramPath]
Out[29]=

The size is similar to the byte count of the resource object:

In[30]:=
ResourceObject[
  "Clinical Concept Embeddings Trained on Health Insurance Claims, \
Clinical Narratives from Stanford and PubMed Journal \
Articles"]["ByteCount"]
Out[30]=

Represent the MXNet net as a graph:

In[31]:=
Import[jsonPath, {"MXNet", "NodeGraphPlot"}]
Out[31]=

Requirements

Wolfram Language 11.3 (March 2018) or above

Resource History

Reference

  • A.L. Beam et al.,
    "Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data," arXiv:1804.01486 (2018)
  • Available from: https://figshare.com/s/00d69861786cd0156d81
  • Rights: Unrestricted use