NuTopic Text Feature Extractor

Represent text as a sequence of vectors

Released in 2024, NuTopic is a BERT-based transformer encoder from NuMind designed for topic-classification feature extraction. Publicly available configuration files indicate that it is built on top of the E5-base-v2 text-embedding architecture, with 12 layers, a hidden size of 768 and a maximum sequence length of 512. The model produces contextual text representations intended to capture topic-related properties of language for downstream applications.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["NuTopic Text Feature Extractor"]
Out[1]=

Evaluation function

Get the tokenizer to process text inputs into tokens:

In[2]:=
tokenizer = NetModel["NuTopic Text Feature Extractor", "Tokenizer"]
Out[2]=

Write a function that preprocesses a list of input sentences:

In[3]:=
prepareBatch[inputStrings_?ListQ] := Block[
   {tokens, attentionMask},
   tokens = tokenizer[inputStrings] - 1;
   attentionMask = PadRight[ConstantArray[1, Length[#]] & /@ tokens, Automatic];
   tokens = PadRight[tokens, Automatic, 1];
   <|
    "input_ids" -> tokens, "attention_mask" -> attentionMask
    |>
   ];

Write a function that applies mean pooling to the hidden states:

In[4]:=
meanPooler[vectors_?MatrixQ, weights_?VectorQ] := Divide[weights . vectors, Total[weights]]
meanPooler[vectors_?ArrayQ, weights_?ArrayQ] := MapThread[meanPooler, {vectors, weights}]

Write a function that returns one of the requested outputs from the NuTopic encoder (last hidden state, sentence and normalized embeddings) and optionally trims padding tokens using the "attention_mask" when the optional parameter "ApplyMask" is set to True:

In[5]:=
Options[netevaluate] = {"ApplyMask" -> False}; netevaluate[input_?StringQ, output : ("LastHiddenState" | "SentenceEmbedding" | "NormalizedEmbedding" | "MeanPooling") : "MeanPooling", opts : OptionsPattern[]] := If[output === "NetOutputs", First /@ netevaluate[{input}, output, opts], First@netevaluate[{input}, output, opts]];

netevaluate[inputStrings_?ListQ, output : ("LastHiddenState" | "SentenceEmbedding" | "NormalizedEmbedding" | "MeanPooling") : "MeanPooling", opts : OptionsPattern[]] := Module[
   {assoc, out, h, mask, pooled},
   assoc = prepareBatch[inputStrings];
   mask = assoc["attention_mask"];
   out = NetModel["NuTopic Text Feature Extractor"][assoc];
   Switch[output,
    "LastHiddenState",
    h = out["last_hidden_state"];
    If[TrueQ@OptionValue["ApplyMask"], MapThread[Take, {h, Total /@ mask}], h], "SentenceEmbedding",
    out["sentence_embedding"], "NormalizedEmbedding",
    out["normalized_embedding"], "MeanPooling",
    h = out["last_hidden_state"];
    pooled = meanPooler[h, mask];
    Normalize /@ pooled,
    "NetOutputs",
    out
    ]
   ];

Basic usage

Get the sentence embedding:

In[6]:=
output = netevaluate["query: The air in the city is very polluted."];

Get the dimensions of the output:

In[7]:=
Dimensions@output
Out[7]=

Get the sentences:

In[8]:=
sentences = {
   "query: sports article",
   "query: politics article",
   "passage: The team won the championship after a dramatic final.",
   "passage: Parliament passed a new education reform bill."
   };

Get the sentence embeddings using "NormalizedEmbedding":

In[9]:=
output = netevaluate[sentences, "NormalizedEmbedding"];

Get the dimensions of the output:

In[10]:=
Dimensions[output]
Out[10]=

Get the scores from the output's embeddings:

In[11]:=
scores = output[[1 ;; 2]] . Transpose[output[[3 ;; 4]]]
Out[11]=

Input preprocessing

Preprocess a batch of sentences into inputs expected by the model. The result is an association:

"input_ids": integer token indices

"attention_mask": a binary mask indicating valid tokens vs. padding tokens

In[12]:=
inputs = prepareBatch[sentences];

Get the dimensions of the preprocessed sentences:

In[13]:=
Map[Dimensions, inputs]
Out[13]=

Visualize the preprocessed sentences:

In[14]:=
ArrayPlot /@ inputs
Out[14]=

Get the sentence embeddings:

In[15]:=
outputs = NetModel["NuTopic Text Feature Extractor"][inputs];

Get the dimensions of the outputs:

In[16]:=
Dimensions /@ outputs
Out[16]=

Visualize the first sentence embedding:

In[17]:=
MatrixPlot@outputs[[1]][[1]]
Out[17]=

The sentence embedding is the normalized average of all non-padded token representations:

In[18]:=
Normalize@Mean@outputs[[1]][[1]] // Short
Out[18]=

Advanced usage

One-shot learning

Get a list of classes with one example sentence for each:

In[19]:=
labelSentences = {
   "query: The central bank raised interest rates after inflation increased." -> "Economy",
   "query: Researchers developed a new battery that charges in minutes." -> "Technology",
   "query: The coach announced the starting lineup before the semifinal." -> "Sports",
   "query: The museum opened an exhibition of contemporary photography." -> "Culture"
   };

Get a set of sentences to classify and their correct labels:

In[20]:=
testSentences = {
   "query: Markets fell after the latest inflation report was released." -> "Economy", "query: The finance minister presented the new budget plan." -> "Economy", "query: Engineers introduced a chip designed for on-device AI processing." -> "Technology", "query: The company unveiled a laptop with a longer battery life." -> "Technology", "query: The captain scored the winning goal in extra time." -> "Sports", "query: Fans filled the arena for the opening playoff game." -> "Sports", "query: The festival featured films from several emerging directors." -> "Culture", "query: The gallery is known for its collection of modern sculpture." -> "Culture"
   };

Get the embeddings of the labels and test sentences:

In[21]:=
labelEmb = netevaluate[Keys@labelSentences];
inputEmb = netevaluate[Keys@testSentences];

Get the predictions. Since all of the embeddings are normalized, SquaredEuclideanDistance, which is equivalent (up to a constant factor) to cosine distance, is used here:

In[22]:=
results = Flatten@Nearest[Thread[labelEmb -> Values@labelSentences], DistanceFunction -> SquaredEuclideanDistance][inputEmb];

Create a table to visualize the correct and predicted label for each sentence:

In[23]:=
Grid[Prepend[
  Transpose[{Keys@testSentences, Values@testSentences, results}], {"Text", "True Label", "Predicted Label"}], Frame -> All, Background -> {None, {LightGray}}, Alignment -> Left]
Out[23]=

Transfer learning

Topic classification

Perform topic classification on the DBpedia dataset, where each input sentence is classified into one of 14 ontology-based classes. Texts are encoded using NuTopic Text Feature Extractor sentence embeddings and a simple classifier is trained on top of these embeddings.

Get the dataset:

In[24]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/99162ad1-fe1e-4e4c-8535-8fb1425a1765"]
Out[24]=

Get the label mapping:

In[25]:=
labelMap = <|
   0 -> "Company",
   1 -> "EducationalInstitution",
   2 -> "Artist",
   3 -> "Athlete",
   4 -> "OfficeHolder",
   5 -> "MeanOfTransportation",
   6 -> "Building",
   7 -> "NaturalPlace",
   8 -> "Village",
   9 -> "Animal",
   10 -> "Plant",
   11 -> "Album",
   12 -> "Film",
   13 -> "WrittenWork"
   |>;

Preprocess the dataset:

In[26]:=
i = 0; Monitor[
 encodeddata = Select[TransformColumns[data, "Input" -> Function[i++; Quiet@Check[
        Normal@netevaluate[("query: " <> #text)], $Failed]]], #Input =!= $Failed &], ProgressIndicator[i/Length[data]]]
Out[26]=

Define the classifier model for topic classification, which accepts the embeddings as input and outputs the probabilities for each class of labelMap:

In[27]:=
numClasses = 14;
classifier = NetChain[{LinearLayer[numClasses], SoftmaxLayer[] }]
Out[28]=

Extract the training datasets from the initial data:

In[29]:=
trainData = Take[encodeddata, 1600];
{validationData, testData} = TakeDrop[Drop[encodeddata, 1600], 200];

Train the classifier:

In[30]:=
trainedClassifier = NetTrain[classifier, trainData, ValidationSet -> Dataset@validationData]
Out[30]=

Run the classifier on the embeddings obtained by the NuTopic model using test sentences and categorize the results into "Correct" and "Incorrect" predictions:

In[31]:=
resultsData = TransformColumns[testData, "Prediction" -> Function[trainedClassifier[#Input]]] // TransformColumns[{
    "Correct" -> (Boole[#Prediction == #Output] &),
    "Incorrect" -> (Boole[#Prediction != #Output] &)
    }]
Out[31]=

Compute the accuracy:

In[32]:=
AggregateRows[resultsData, {
  "Accuracy" -> Function[N@Total[#Correct]/(Total[#Correct] + Total[#Incorrect])]}]
Out[32]=

Create a unified pipeline by merging the classifier and NuTopic:

In[33]:=
topicModel = NetReplacePart[
  trainedClassifier, {"Input" -> NetEncoder[{"Function", netevaluate[#] &, 768, SaveDefinitions -> False}], "Output" -> NetDecoder[{"Class", Values@labelMap}]}]
Out[33]=

Show the results:

In[34]:=
topicModel /@ {
  "The company reported higher profits after launching its new phone.",
  "The college opened a new science center for graduate students.",
  "She broke the world record in the 100-meter final.",
  "The cathedral stands in the center of the old city.",
  "The book tells the story of a family over three generations."}
Out[34]=
In[35]:=
topicModel /@ {
  "The manufacturer expanded into new international markets this year.",
  "The academy is known for its engineering and medical programs.",
  "He returned from injury to win the championship in straight sets.",
  "The structure was completed in the nineteenth century and remains a landmark.",
  "The story was first published anonymously and later became a classic."}
Out[35]=

Resource History

Reference

  • L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, F. Wei, "Text Embeddings by Weakly-Supervised Contrastive Pre-training," arXiv:2212.03533v1 (2022)
  • Available from: https://huggingface.co/numind/NuTopic
  • Rights: MIT License