MiniLM V2 Text Feature Extractor

Represent text as a sequence of vectors

Released in 2019, MiniLM V2 is a modification of BERT designed to derive semantically meaningful sentence embeddings suitable for large-scale textual similarity tasks. It addresses a major limitation of standard BERT, which requires jointly encoding sentence pairs and is therefore computationally inefficient for semantic search or clustering. Trained on the SNLI and MultiNLI datasets, the model uses a siamese/triplet architecture with a pooling operation on top to produce fixed-size sentence embeddings that can be efficiently compared using cosine similarity.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["MiniLM V2 Text Feature Extractor"]
Out[1]=

entailment

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["MiniLM V2 Text Feature Extractor", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel["MiniLM V2 Text Feature Extractor", "Part" -> "Large"]
Out[3]=

Evaluation function

Get the tokenizer to process text inputs into tokens:

In[4]:=
tokenizer = NetModel["MiniLM V2 Reranker Trained on MS MARCO Data", "Tokenizer"]
Out[4]=

Write a function that preprocesses a list of input sentences:

In[5]:=
prepareBatch[inputStrings_?ListQ, tokenizer_ : tokenizer] := Block[
   {tokens, attentionMask, tokenTypes},
   tokens = tokenizer[inputStrings] - 1;
   attentionMask = PadRight[ConstantArray[1, Length[#]] & /@ tokens, Automatic];
   tokens = PadRight[tokens, Automatic];
   tokenTypes = ConstantArray[0, Dimensions[tokens]];
   <|
    "input_ids" -> tokens, "attention_mask" -> attentionMask, "token_type_ids" -> tokenTypes
    |>
   ];

Write a function that applies mean pooling to the hidden states:

In[6]:=
meanPooler[vectors_?MatrixQ, weights_?VectorQ] := Mean[WeightedData[vectors, weights]]
meanPooler[vectors_?ArrayQ, weights_?ArrayQ] := MapThread[meanPooler, {vectors, weights}]

Write a function that returns the average of the hidden states:

In[7]:=
Options[netevaluate] = {"Part" -> "Base"}; 
netevaluate[inputStrings_ ?ListQ, OptionsPattern[]] := Block[
   {preprocessedAssoc, embeddings, outputFeatures},
   preprocessedAssoc = prepareBatch[inputStrings];
   embeddings = NetModel["MiniLM V2 Text Feature Extractor", "Part" -> OptionValue["Part"]][preprocessedAssoc];
   outputFeatures = meanPooler[embeddings, preprocessedAssoc["attention_mask"]];
   Normalize /@ outputFeatures
   ];
netevaluate[inputString_ ?StringQ, OptionsPattern[]] := First@netevaluate[{inputString},  "Part" -> OptionValue["Part"]];

Basic usage

Get the sentence embedding:

In[8]:=
output = netevaluate["The air in the city is very polluted."];

Get the dimensions of the output:

In[9]:=
Dimensions@output
Out[9]=

Get the sentences:

In[10]:=
sentences = {"Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.", "Berlin is well known for its museums.", "Berlin is the capital and largest city of Germany.", "Berlin is well known for its highly developed bicycle lane system."};

Get the sentence embeddings using a non-default model:

In[11]:=
output = netevaluate[sentences, "Part" -> "Large"];

Get the dimensions of the output:

In[12]:=
Dimensions[output]
Out[12]=

Input preprocessing

Preprocess a batch of sentences into inputs expected by the model. The result is an association:

"input_ids": integer token indices

"attention_mask": a binary mask indicating valid tokens vs. padding tokens

"token_type_ids": segment IDs used for sentence pair tasks showing which sentence each token belongs to (here all zeros since only single sentences are provided)

In[13]:=
inputs = prepareBatch[sentences];

Get the dimensions of the preprocessed sentences:

In[14]:=
Map[Dimensions, inputs]
Out[14]=

Visualize the preprocessed sentences:

In[15]:=
ArrayPlot /@ inputs
Out[15]=

Get the sentence embeddings:

In[16]:=
outputs = NetModel["MiniLM V2 Text Feature Extractor"][inputs];

Get the dimensions of the outputs:

In[17]:=
Dimensions@outputs
Out[17]=

Visualize the first sentence embedding:

In[18]:=
MatrixPlot@outputs[[1]]
Out[18]=

The sentence embedding is the normalized average of all non-padded token representations:

In[19]:=
Normalize@Mean@outputs[[1]] // Short
Out[19]=

FeatureSpacePlot

Get the sentences:

In[20]:=
sentences = {"The air in the city is very polluted.", "Trees help keep the air clean.", "Many people recycle plastic bottles.", "Solar panels make clean electricity from sunlight.", "The river water is getting dirty every year.", "The teacher writes on the board.", "Students read books in the classroom.", "Online lessons are easy to join from home.", "The exam will be next Monday.", "Group study helps students learn faster."};

Get the embeddings of the sentences by taking the mean of the features of the tokens for each sentence:

In[21]:=
embeddings = netevaluate[sentences, "Part" -> "Large"];

Visualize the embeddings:

In[22]:=
FeatureSpacePlot[AssociationThread[sentences -> embeddings], LabelingFunction -> Callout]
Out[22]=

Advanced usage

One-shot learning

Get the labels with one example of each:

In[23]:=
labelSentences = {
   "The football team won the championship after a close final match." -> "Sports",
   "The government announced a new policy to improve public education." -> "Politics",
   "The restaurant serves delicious pasta with fresh ingredients." -> "Food"
   };

Get a set of sentences:

In[24]:=
testSentences = {
   "The basketball player scored the winning point in overtime." -> "Sports",
   "The tennis tournament attracted fans from around the world." -> "Sports",
   "Lawmakers debated the proposal in parliament yesterday." -> "Politics",
   "The president gave a speech about international cooperation." -> "Politics",
   "She tried a new recipe that included homemade bread and soup." -> "Food",
   "A famous chef opened a new bakery downtown." -> "Food"
   };

Get the embeddings of the labels and test sentences:

In[25]:=
labelEmb = netevaluate[Keys@labelSentences];
inputEmb = netevaluate[Keys@testSentences];

Get the predictions. Since all of the embeddings are normalized, SquaredEuclideanDistance, which is equivalent (up to a constant factor) to cosine distance, is used here:

In[26]:=
results = Flatten@Nearest[Thread[labelEmb -> Values@labelSentences], DistanceFunction -> SquaredEuclideanDistance][inputEmb];

Create a table to visualize the correct and predicted labels for each sentence:

In[27]:=
Grid[Prepend[
  Transpose[{Keys@testSentences, Values@testSentences, results}], {"Text", "True Label", "Predicted Label"}], Frame -> All, Background -> {None, {LightGray}}, Alignment -> Left]
Out[27]=

Finding outliers

Get a sample of the sentences:

In[28]:=
movieData = {"The movie received great reviews from critics and audiences.", "The actor delivered an outstanding performance in the film.", "The director created a powerful story with deep emotions.", "The soundtrack perfectly matched the tone of the movie.", "Critics praised the movie for its realistic characters.", "The new film attracted millions of viewers worldwide.", "The main character faced many challenges in the plot.", "The audience applauded at the end of the movie.", "The film\[CloseCurlyQuote]s trailer got millions of views in one day.", "The team celebrated their victory in the final match.", "A new smartphone model was released with advanced features.", "The weather forecast predicts heavy rain for the weekend.", "Students are preparing for their final exams this month." };

Get the embeddings:

In[29]:=
movieEmb = netevaluate[movieData];

Calculate the distance of each sentence embedding from the median embedding to measure how far each one is semantically:

In[30]:=
distance = DistanceMatrix[movieEmb, {Median[movieEmb]}, DistanceFunction -> SquaredEuclideanDistance][[All, 1]]
Out[30]=

Compute a threshold based on the median and interquartile range to detect sentences that are semantic outliers:

In[31]:=
threshold = Median[distance] + 1/2 InterquartileRange[distance]
Out[31]=

Find the indices for which the distance is greater than the threshold:

In[32]:=
outlierIndices = Flatten[Position[distance, _?(# > threshold &)]]
Out[32]=

Get the outliers:

In[33]:=
movieData[[outlierIndices]] // Column
Out[33]=

Document retrieval

Get the data:

In[34]:=
data = RandomSample[ResourceData["Tweets by @WolframResearch"], 1000];

Extract the text column to get the list of sentences:

In[35]:=
dataText = Normal[data[All, "Text"]];

Get the embeddings of the sentences:

In[36]:=
vecData = netevaluate[dataText];

Get a question:

In[37]:=
question = "how should I register for the conference?";

Get the embedding of the question:

In[38]:=
questionEmb = netevaluate[question];

Find the top-three relevant tweets:

In[39]:=
Nearest[Thread[vecData -> dataText], questionEmb, 3, DistanceFunction -> SquaredEuclideanDistance] // Column
Out[39]=

Resource History

Reference