MiniLM V2 Reranker Trained on MS MARCO Data

Compute a relevance score for a query-passage pair

Released in 2021, this model is a MiniLM v2 cross-encoder designed to compute relevance scores for query-passage pairs. It is trained on the MS MARCO passage-ranking dataset with supervised labels. By jointly encoding the query-passage pairs and applying a classification head, it outputs a single scalar relevance score that enables tasks like ranking search and question answering.

Training Set Information

MS MARCO (Microsoft Machine Reading Comprehension) is a large-scale dataset focused on machine reading comprehension, question answering and passage ranking.

Model Information

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=

Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=

Out[3]=

Evaluation function

Get the tokenizer:

In[4]:=

Out[4]=

Write a function that preprocesses a list of input sentences:

In[5]:=

$prepareBatch[inputQuery_? StringQ, inputDocuments_ ?ListQ, tokenizer_ : tokenizer] := Block[ {tokens, tokensQ, tokensD, pairs, attentionMask, tokenTypes}, {tokensQ, tokensD} = TakeDrop[tokenizer[Prepend[inputDocuments, inputQuery]], 1] - 1; tokensQ = First@tokensQ; tokens = Join[tokensQ, Rest[#]] & /@ tokensD; tokens = PadRight[tokens, Automatic]; attentionMask = UnitStep[tokens - 1]; tokenTypes = Join[ConstantArray[0, {Length@tokensD, Length@tokensQ}], ConstantArray[1, Length@# - 1] & /@ tokensD, 2]; tokenTypes = PadRight[tokenTypes, Automatic]; <| "input_ids" -> tokens, "attention_mask" -> attentionMask, "token_type_ids" -> tokenTypes |> ];$

Write a function that incorporates all the steps from tokenizing the input string up to outputting the scores:

In[6]:=

$Options[netevaluate] = {"Size" -> "L-6"}; netevaluate[inputQuery_? StringQ, inputDocuments_ ?ListQ, OptionsPattern[]] := Block[ {preprocessedAssoc, embeddings}, preprocessedAssoc = prepareBatch[inputQuery, inputDocuments]; embeddings = NetModel["MiniLM V2 Reranker Trained on MS MARCO Data", "Size" -> OptionValue["Size"]][preprocessedAssoc]; embeddings ];$

Basic usage

Get the question and the sentences:

In[7]:=

question = "How many people live in Berlin?";
sentences = {"Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.", "Berlin is well known for its museums.", "Berlin is the capital and largest city of Germany.", "Berlin is well known for its highly developed bicycle lane system."};;

Get the scores for a given question for each sentence:

In[8]:=

Out[8]=

Get the scores for a given question for each sentence using a non-default model:

In[9]:=

Out[9]=

Input preprocessing

Preprocess a batch of sentences into inputs expected by a model. The result is an association:

• "input_ids": integer token indices representing each token of the concatenated question-sentence pairs

• "attention_mask": binary mask indicating valid tokens vs. padding tokens

• "token_type_ids": segment IDs used for sentence pair tasks showing which sentence each token belongs to (here 0 for the question and 1 for the sentence)

In[10]:=

Get the dimensions of the preprocessed pairs of the sentences:

In[11]:=

Out[11]=

Visualize the preprocessed sentences:

In[12]:=

Out[12]=

Get the scores from the pairs of sentences:

In[13]:=

Out[13]=

Reranker-only document retrieval

The reranker model directly evaluates the joint relationship between a query and each document by concatenating them and producing a single relevance score. This full query-document interaction provides high accuracy, but because reranker scores cannot be precomputed and must be recalculated for every pair, applying it to all documents is computationally expensive and scales poorly. Therefore, this approach is best suited for small corpora. First, get the documents:

In[14]:=

Split the input data into individual sentences to use each as a separate passage:

In[15]:=

Get the query:

In[16]:=

Run the reranker directly on all documents and get the scores from the pairs of sentences with recorded time:

In[17]:=

Get the top-K relevant answers for the given query with the corresponding relevance score:

In[18]:=

topK = 5;
{maxScoresRerankerOnly, topPassagesRerankerOnly} = Transpose@
TakeLargestBy[Transpose[{Flatten[outputsRerankerOnly], passages}], First, topK];

Show the results:

In[19]:=

Labeled[Grid[
Prepend[Transpose[{topPassagesRerankerOnly}], {"Top-K Retrieved Passages by Reranker-Only"}], Frame -> All, Background -> {None, {LightGray}}, Alignment -> Left, Spacings -> {1, 2}], query, Top]

Out[19]=

Embedding-only document retrieval

In large-scale information retrieval, processing every document with a reranker for each query is infeasible. To handle this efficiently, a text embedder encodes each document into a dense vector representation and stores these embeddings for reuse. At query time, the user query is encoded into the same vector space, and the system retrieves the most semantically similar documents based on vector distances. This approach enables fast and scalable retrieval with minimal latency. First, define the utility functions responsible for embedding extraction:

In[20]:=

$embedder[inputStrings_ ?(StringQ[#] || VectorQ[#, StringQ] &), modelName_ : "Base", tokenizer_ : tokenizer] := Block[{embeddings, outputFeatures, tokens, attentionMask, tokenTypes, isString, input, meanPooler}, meanPooler[ PatternTest[ Pattern[vectors, Blank[]], MatrixQ], PatternTest[ Pattern[weights, Blank[]], VectorQ]] := Mean[ WeightedData[vectors, weights]]; meanPooler[ PatternTest[ Pattern[vectors, Blank[]], ArrayQ], PatternTest[ Pattern[weights, Blank[]], ArrayQ]] := MapThread[ meanPooler, {vectors, weights}]; isString = StringQ[ inputStrings]; input = If[ isString, {inputStrings}, inputStrings]; tokens = tokenizer[ input] - 1; attentionMask = PadRight[ Map[ConstantArray[1, Length[#]]& , tokens], Automatic]; tokens = PadRight[ tokens, Automatic]; tokenTypes = ConstantArray[0, Dimensions[tokens]]; embeddings = NetModel[ "MiniLM V2 Text Feature Extractor", "Part" -> modelName][<|"input_ids" -> tokens, "attention_mask" -> attentionMask, "token_type_ids" -> tokenTypes|>]; outputFeatures = meanPooler[ embeddings, attentionMask]; If[isString, First[ Map[Normalize, outputFeatures]], Map[Normalize, outputFeatures]]];$

Evaluate the utility functions to compute embeddings for the query and all passages by recording the time:

In[21]:=

Use the utility function to compute distance scores between the query embedding and all passage embeddings:

In[22]:=

{timeEmbDist, distanceScores} = AbsoluteTiming[
First@DistanceMatrix[queryEmbedding, passagesEmbeddings, DistanceFunction -> SquaredEuclideanDistance]];

Get the top-K relevant answer for the given query with the corresponding distance score:

In[23]:=

Show the results:

In[24]:=

Labeled[Grid[
Prepend[Transpose[{topPassagesEmbOnly}], {"Top-K Retrieved Passages by Embedding-Only"}], Frame -> All, Background -> {None, {LightGray}},
Alignment -> Left, Spacings -> {1, 2}], query, Top]

Out[24]=

Record the overall runtime, excluding the document embedding step, since those embeddings are computed once and reused across queries:

In[25]:=

Out[25]=

Two-stage document retrieval (embedding + reranker)

Two-stage retrieval first uses the fast embedder to fetch a small pool of top-N candidate documents and then applies the more accurate reranker only to those candidates. This keeps retrieval efficient while still benefiting from the reranker’s fine-grained relevance scoring. This process is summarized in the diagram:

In[26]:=

Out[26]=

Get the indices of the top-N closest passages by selecting the smallest distance values from all computed distance scores:

In[27]:=

topN = 32;
{topNscores, topNPassages} = Transpose@
TakeSmallestBy[Transpose[{distanceScores, passages}], First, topN];

Run the reranker on the top-N most similar passages to get the relevance scores:

In[28]:=

Out[28]=

Get the top-K relevant answers for the given query with the corresponding relevance score:

In[29]:=

Show the results:

In[30]:=

Labeled[Grid[
Prepend[Transpose[{topPassagesReranker}], {"Top-K Retrieved Passages by Two-Stage Document Retrieval"}], Frame -> All, Background -> {None, {LightGray}}, Alignment -> Left, Spacings -> {1, 2}], query, Top]

Out[30]=

Record the overall runtime, excluding the document embedding step, since those embeddings are computed once and reused across queries:

In[31]:=

Out[31]=

Timing summary

This section compares the total runtime across three retrieval strategies: embedding-only retrieval, reranker-only retrieval and two-stage retrieval. Each approach was evaluated on the same query and document set to measure their relative computational efficiency. Create an association for each retrieval method by storing the total runtime values and show the results:

In[32]:=

totalTimes = {{"Method", "Time (seconds)"}, {"Embedding-Only Retrieval", timeEmbOnly}, {"Reranker-Only Retrieval", timeRerankerOnly}, {"Two-Stage Retrieval", timeReRankerPipeline}};
Grid[totalTimes, Frame -> All, Background -> {None, {LightGray}}, Alignment -> Left, Spacings -> {2, 1}]

Out[33]=

Requirements

Wolfram Language 12.1 (March 2020) or above

Resource History

Date Created: 20 December 2024

Reference

M. Pande, S. Kumar, A. Y. Damle, "When Fine-Tuning Fails: Lessons from MS MARCO Passage Ranking" arXiv:2506.18535 (2025)
Available from: https://github.com/omnikingzeno/ms-marco-fine-tuning-experiments
Rights: MIT License