Wolfram Neural Net Repository
Immediate Computable Access to Neural Net Models
Represent text as a sequence of vectors
Released in 2024, NuNER v2.0 is a RoBERTa-based transformer encoder designed for entity-centric feature extraction. It addresses the data inefficiency of traditional named entity recognition models by leveraging large-scale LLM-annotated data instead of fully supervised corpora. Trained on a GPT-3.5–annotated subset of the C4 corpus using a contrastive-learning objective, the model outputs contextual token representations (last hidden states) and a pooled sequence embedding, making it suitable for downstream NER and embedding-based retrieval tasks.
Get the pre-trained net:
| In[1]:= |
| Out[1]= | ![]() |
Get the tokenizer to process text inputs into tokens:
| In[2]:= |
| Out[2]= | ![]() |
Write a function that preprocesses a list of input sentences:
| In[3]:= | ![]() |
Write a function that applies mean pooling to the hidden states:
| In[4]:= |
Write a function that returns one of the requested outputs from the NuNER-V2 encoder (full output, last hidden state, CLS pooling or mean pooling) and optionally trims padding tokens using attention_mask when the optional parameter "ApplyMask" is set True:
| In[5]:= | ![]() |
Get the sentence embedding:
| In[6]:= |
Get the dimensions of the output:
| In[7]:= |
| Out[7]= |
Get the sentences:
| In[8]:= | ![]() |
Get the sentence embeddings using "ClassPooling":
| In[9]:= |
Get the dimensions of the output:
| In[10]:= |
| Out[10]= |
Preprocess a batch of sentences into inputs expected by the model. The result is an association:
• "input_ids": integer token indices
• "attention_mask": a binary mask indicating valid tokens vs. padding tokens
| In[11]:= |
Get the dimensions of the preprocessed sentences:
| In[12]:= |
| Out[12]= |
Visualize the preprocessed sentences:
| In[13]:= |
| Out[13]= |
Get the sentence embeddings:
| In[14]:= |
Get the dimensions of the outputs:
| In[15]:= |
| Out[15]= |
Visualize the first sentence embedding:
| In[16]:= |
| Out[16]= | ![]() |
The sentence embedding is the normalized average of all non-padded token representations:
| In[17]:= |
| Out[17]= |
Get the sentences:
| In[18]:= | ![]() |
Get the embeddings of the sentences by taking the mean of the features of the tokens for each sentence:
| In[19]:= |
Visualize the embeddings:
| In[20]:= | ![]() |
| Out[20]= | ![]() |
Get a list of classes with one example sentence for each:
| In[21]:= | ![]() |
Get a set of sentences to classify and their correct labels:
| In[22]:= | ![]() |
Get the embeddings of the labels and test sentences:
| In[23]:= |
Get the predictions. Since all of the embeddings are normalized, SquaredEuclideanDistance, which is equivalent (up to a constant factor) to cosine distance, is used here:
| In[24]:= | ![]() |
Create a table to visualize the correct and predicted label for each sentence:
| In[25]:= | ![]() |
| Out[25]= | ![]() |
Get a sample of sentences:
| In[26]:= | ![]() |
Get the embeddings:
| In[27]:= |
Calculate the distance of each sentence embedding from the median embedding to measure how far each one is semantically:
| In[28]:= | ![]() |
| Out[28]= |
Compute a threshold based on the median and interquartile range to detect sentences that are semantic outliers:
| In[29]:= |
| Out[29]= |
Find the indices for which the distance is greater than the threshold:
| In[30]:= |
| Out[31]= |
Get the outliers:
| In[32]:= |
| Out[32]= | ![]() |
Perform binary sentiment analysis on the SST-2 dataset, where each input sentence is classified as expressing either negative or positive sentiment. The original dataset labels are 0 for negative sentiment and 1 for positive sentiment. Texts are encoded using NuNER-V2 text feature extractor sentence embeddings, and a simple classifier is trained on top of these embeddings.
Get the dataset:
| In[33]:= |
| Out[33]= | ![]() |
Preprocess the dataset:
| In[34]:= | ![]() |
| Out[34]= | ![]() |
Define the classifier model for sentiment analysis, which accepts the embeddings as an input and outputs the probabilities for each class (positive, negative):
| In[35]:= |
| Out[36]= |
Extract the training datasets from the initial data:
| In[37]:= |
Train the classifier:
| In[38]:= |
| Out[38]= |
Run the classifier on the embeddings obtained by the NuNER model using test sentences and categorize the results into true positive (TP), true negative (TN), false positive (FP) and false negative (FN):
| In[39]:= | ![]() |
| Out[39]= | ![]() |
Compute the precision, recall and "F1Score":
| In[40]:= | ![]() |
| Out[40]= |
Create a unified pipeline by merging the classifier and NuNER-V2:
| In[41]:= | ![]() |
| Out[41]= |
Show the results:
| In[42]:= | ![]() |
| Out[42]= |
Perform named entity recognition (NER), where each token in a sentence is assigned a label indicating whether it belongs to an entity such as a person, organization or location. The dataset provides tokenized text along with token-level NER tags, which are expanded to align with the sub-word tokenization used by the model. Sentences are encoded using NuNER, and the last hidden states are used to predict an entity label for each sub-token. This allows the model to identify and classify named entities at the token level.
Get the dataset:
| In[43]:= |
| Out[43]= | ![]() |
Write a function to make the label tags compatible with our sub-words tokenized setup by expanding each word’s tag to its sub-words:
| In[44]:= | ![]() |
Preprocess the dataset by making tokens a whole sentence, getting the last hidden states and expand the tokens to sub-tokens:
| In[45]:= | ![]() |
| Out[10]= | ![]() |
Extract the training datasets from preprocessed data:
| In[46]:= |
Define the token classification model:
| In[47]:= |
| Out[48]= |
Train the model:
| In[49]:= | ![]() |
Write a function that will compute the confusion matrix components (TP, FP and FN) according to the NuNER-paper microF1 metric. NER is evaluated at the entity-span level. Labels use the 1-to-7 index range. Each label sequence is first converted into spans (type, start and end) using BIO rules, starting at B (begin) and extending through consecutive I (inside) of the same type. TP counts exact span matches (same type and boundaries), FP counts predicted spans not present in gold and FN counts gold spans not predicted:
| In[50]:= | ![]() |
Write a function that will compute the confusion matrix components (TP, FP and FN) per token. Labels are first mapped to entity types, PER (person), ORG (organization), LOC (location) and O (outside any named entity), ignoring the B/I distinction, then TP counts matching entity-type tokens, FP counts entity-type predictions not matching gold and FN counts gold entity-type tokens missed by the prediction, summed across all sentences:
| In[51]:= | ![]() |
Get the trained model's scores for the test data:
| In[52]:= | ![]() |
| Out[53]= | ![]() |
Compute precision, recall and "F1Score" using NuNER-defined span-level micro and token-level metrics based on TP, TN, FP and FN:
| In[54]:= | ![]() |
| Out[54]= |
Get the whole model, merging the head and NuNER-V2:
| In[55]:= | ![]() |
| Out[55]= |
Write a wrapper function to convert the nerModel output to human-readable entities:
| In[56]:= | ![]() |
Show the results without the wrapper function:
| In[57]:= |
| Out[57]= | ![]() |
Show the results with the wrapper function:
| In[58]:= |
| Out[58]= |