Wolfram Computation Meets Knowledge

ELMo Contextual Word Representations Trained on 1B Word Benchmark

Represent words as contextual word-embedding vectors

Released in 2018 by the research team of the Allen Institute for Artificial Intelligence (AI2), this representation was trained using a deep bidirectional language model. It produces three vectors per token, two of which are contextual, meaning that they depend on the entire sentence in which they are used. These word vectors are aimed at being linearly combined. They are based on the characters and case-sensitive, so there is no token dictionary.

Number of layers: 127 | Parameter count: 93,600,864 | Trained size: 375 MB

Training Set Information

Examples

Resource retrieval

Retrieve the resource object:

In[1]:=
ResourceObject["ELMo Contextual Word Representations Trained on 1B \
Word Benchmark"]
Out[1]=

Get the pre-trained net:

In[2]:=
NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"]
Out[2]=

Basic usage

For each token, the net produces three length-1024 feature vectors: one that is context-independent (port "Embedding") and two that are contextual (ports "ContextualEmbedding/1" and "ContextualEmbedding/2").

Input strings are tokenized, meaning they are split into tokens that are words and punctuation marks:

In[3]:=
embeddings = 
 NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"]["Hello world"]
Out[3]=

For each port and token, the net produces a 1024-dimensional feature vector:

In[4]:=
Dataset[embeddings]
Out[4]=

Pre-tokenized inputs can be given using TextElement:

In[5]:=
embeddings = 
 NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"][TextElement[{"Hello", "world"}]]
Out[5]=

The representation of the same word in two different sentences is different. Extract the embeddings for a different sentence:

In[6]:=
embeddings2 = 
 NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"][TextElement[{"Hello", "neighbor"}]]
Out[6]=

The context-independent embedding for the same word is the same, whatever the surrounding text is. For instance, for the word "Hello":

In[7]:=
embeddings[["Embedding", 1]] == embeddings2[["Embedding", 1]]
Out[7]=

The context-dependent embeddings are different for the same word in two different sentences:

In[8]:=
embeddings[["ContextualEmbedding/1", 1]] == 
 embeddings2[["ContextualEmbedding/1", 1]]
Out[8]=
In[9]:=
embeddings[["ContextualEmbedding/2", 1]] == 
 embeddings2[["ContextualEmbedding/2", 1]]
Out[9]=

The recommended usage is to take a (possibly weighted) average of the embeddings:

In[10]:=
Mean[Values[embeddings]]
Out[10]=

Word analogies without context

Extract the non-contextual part of the net:

In[11]:=
netNonContextual = 
 NetTake[NetModel[
   "ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"], {NetPort["Input"], "embedding"}]
Out[11]=

Precompute the context-independent embeddings for a list of common words (if available, set TargetDevice -> "GPU" for faster evaluation time):

In[12]:=
word2vec = 
 With[{words = WordList[]}, 
  AssociationThread[
   words -> netNonContextual[words, TargetDevice -> "CPU"][[All, 1]]]]
Out[13]=

Find the five nearest words to "king":

In[14]:=
Nearest[word2vec, word2vec["king"], 5]
Out[14]=

Man is to king as woman is to:

In[15]:=
Nearest[word2vec, 
 word2vec["king"] - word2vec["man"] + word2vec["woman"], 5]
Out[15]=

Visualize the similarity between the words using the net as a feature extractor:

In[16]:=
animals = {"alligator", "bear", Sequence[
   "bird", "bee", "camel", "zebra", "crocodile", "rhinoceros", 
    "giraffe", "dolphin", "duck", "eagle", "elephant", "fish", 
    "fly"]};
In[17]:=
fruits = {"apple", "apricot", Sequence[
   "avocado", "banana", "blackberry", "cherry", "coconut", 
    "cranberry", "grape", "mango", "melon", "papaya", "peach", 
    "pineapple", "raspberry", "strawberry", "fig"]};
In[18]:=
FeatureSpacePlot[Join[animals, fruits], 
 FeatureExtractor -> Function[w, word2vec[w]]]
Out[18]=

Word analogies in context

Define a function that shows the word in context along with the average of its embeddings:

In[19]:=
netevaluateWithContext[sentence_String] := 
 With[{tokenizedSentence = TextElement[StringSplit[sentence]]},
  AssociationThread[
   Thread[{First[tokenizedSentence], sentence}],
   Mean@Values@
     NetModel[
       "ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"]@tokenizedSentence
   ]
  ]

Check the result on a sentence:

In[20]:=
Dataset[netevaluateWithContext["I play the piano"]]
Out[21]=

Define a function to find the nearest word in context in a set of sentences, for a given word in context:

In[22]:=
findSemanticNearestWord[{word_, context_}, otherSentences_] := 
 First@Nearest[
   Association[Join @@ Map[netevaluateWithContext, otherSentences]],
   netevaluateWithContext[context][{word, context}]
   ]

Find the semantically nearest word to the word "play" in "I play the piano":

In[23]:=
findSemanticNearestWord[{"play", "I play the piano"},
 {"This was a nice play", "Guitar can be played with a pick"}
 ]
Out[23]=

Find the semantically nearest word to the word "set" in "The set of values higher than a threshold":

In[24]:=
findSemanticNearestWord[{"set", 
  "The set of values higher than a threshold"},
 {"They set the clock", "This ensemble of items belongs to her"}
 ]
Out[24]=

Train a model with the word embeddings

Take text-processing dataset:

In[25]:=
trainingData = 
  ExampleData[{"MachineLearning", "MovieReview"}, "TrainingData"];
validationData = 
  ExampleData[{"MachineLearning", "MovieReview"}, "TestData"];
In[26]:=
trainingData[[;; 3]]
Out[26]=

Pre-compute the ELMo vectors on the training and the validation dataset (if available, GPU is recommended):

In[27]:=
trainingDataELMo = 
  Total[Values[
      NetModel[
        "ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"][Keys[trainingData], TargetDevice -> "CPU"]]/3.] -> 
   Values[trainingData];
In[28]:=
trainingDataELMo[[All, 1]]
Out[28]=
In[29]:=
validationDataELMo = 
  Total[Values[
      NetModel[
        "ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"][Keys[validationData], TargetDevice -> "CPU"]]/3.] -> 
   Values[validationData];
In[30]:=
validationDataELMo[[All, 1]]
Out[30]=

Define a network that takes word vectors instead of strings for the text-processing task:

In[31]:=
netArchitecture = 
 NetChain[{DropoutLayer[], NetMapOperator[2], 
   AggregationLayer[Max, 1], SoftmaxLayer[]}, 
  "Output" -> NetDecoder[{"Class", {"negative", "positive"}}]]
Out[31]=

Train the network on the pre-computed ELMo vectors:

In[32]:=
trainResultsELMo = 
 NetTrain[netArchitecture, trainingDataELMo, All, 
  ValidationSet -> validationDataELMo, MaxTrainingRounds -> 20]
Out[32]=

Check the classification error rate on the validation data:

In[33]:=
trainResultsELMo["LowestValidationErrorRate"]
Out[33]=

Compare the results with the performance of the same model trained on context-independent embeddings:

In[34]:=
trainingDataGlove = Thread@Rule[
    NetModel[
      "GloVe 300-Dimensional Word Vectors Trained on Wikipedia and \
Gigaword 5 Data"][Keys[trainingData]],
    Values[trainingData]
    ];
In[35]:=
trainingDataGlove[[1]]
Out[35]=
In[36]:=
validationDataGlove = Thread@Rule[
    NetModel[
      "GloVe 300-Dimensional Word Vectors Trained on Wikipedia and \
Gigaword 5 Data"][Keys[validationData]],
    Values[validationData]
    ];
In[37]:=
validationDataGlove[[1]]
Out[37]=
In[38]:=
trainResultsGlove = 
 NetTrain[netArchitecture, trainingDataGlove, All, 
  ValidationSet -> validationDataGlove, MaxTrainingRounds -> 20]
Out[38]=
In[39]:=
trainResultsGlove["LowestValidationErrorRate"]
Out[39]=

Net information

Inspect the number of parameters of all arrays in the net:

In[40]:=
NetInformation[
 NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"], "ArraysElementCounts"]
Out[40]=

Obtain the total number of parameters:

In[41]:=
NetInformation[
 NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"], "ArraysTotalElementCount"]
Out[41]=

Obtain the layer type counts:

In[42]:=
NetInformation[
 NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"], "LayerTypeCounts"]
Out[42]=

Display the summary graphic:

In[43]:=
NetInformation[
 NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"], "SummaryGraphic"]
Out[43]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[44]:=
jsonPath = 
 Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], 
  NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"], "MXNet"]
Out[44]=

Export also creates a net.params file containing parameters:

In[45]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[45]=

Get the size of the parameter file:

In[46]:=
FileByteCount[paramPath]
Out[46]=

The size is similar to the byte count of the resource object:

In[47]:=
ResourceObject[
  "ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"]["ByteCount"]
Out[47]=

Requirements

Wolfram Language 11.3 (March 2018) or above

Resource History

Reference

  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, "Deep Contextualized Word Representations," arXiv:1802.05365 NAACL (2018)
  • (available from http://allennlp.org/elmo)
  • Rights: Apache 2.0 License