BLIP Image-Text Matching Nets Trained on Captioning Datasets

Find the similarity score between a text and an image

The BLIP family of models offers a new approach to image-text matching, significantly enhancing the ability to accurately pair images with their corresponding textual descriptions. BLIP uses two primary methods for this task. The first method employs unimodal feature extractors to independently process images and text. The second method uses BLIP's innovative multimodal mixture of encoder-decoder architecture; this architecture integrates visual and textual information more holistically, capturing intricate interactions between images and text to ensure precise matching. BLIP was trained on 129 million image-text pairs and finetuned on Microsoft COCO and Flickr30k datasets. Additionally, BLIP employs a method known as captioning and filtering (CapFilt) to refine its training dataset. These advancements enable BLIP to excel in image-text matching tasks, offering users superior performance, resulting in a +2.7% in average top-1 recall.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["BLIP Image-Text Matching Nets Trained on Captioning Datasets"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["BLIP Image-Text Matching Nets Trained on Captioning Datasets", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"BLIP Image-Text Matching Nets Trained on Captioning Datasets", "Part" -> "MultimodalEncoder", "Architecture" -> "ViT-L/16"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"BLIP Image-Text Matching Nets Trained on Captioning Datasets", "Part" -> "TextEncoder"}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Define a test image:

In[5]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/1b9f7ff2-1f1b-46b3-8f7d-1106d55d5a31"]

Define a list of text descriptions:

In[6]:=
descriptions = {
   "Blossoming rose on textbook among delicate petals",
   "Photo of foggy forest",
   "A portrait of a man in a park drinking tea",
   "Yellow flower in tilt shift lens",
   "Woman in black leather jacket and white pants",
   "A portrait of a lady in a park drinking tea",
   "Calm body of lake between mountains",
   "Close up shot of a baby girl",
   "Smiling man surfing on wave in ocean",
   "A portrait of two ladies in a park eating pizza",
   "A portrait of two ladies in a hotel drinking tea",
   "A woman with eyeglasses smiling",
   "Elderly woman carrying a baby",
   "A portrait of two ladies in a park drinking tea"
   };

Embed the test image and text descriptions into the same feature space:

In[7]:=
{textEncoder, imageEncoder} = NetModel[{"BLIP Image-Text Matching Nets Trained on Captioning Datasets", "Part" -> #}] & /@ {"TextEncoder", "ImageEncoder"};
In[8]:=
imgFeatures = imageEncoder[img, NetPort["Output"]];
In[9]:=
textFeatures = textEncoder[descriptions];

Rank the text description with respect to the correspondence to the input image according to the CosineDistance. Smaller distances (higher score) mean higher correspondence between the text and the image and higher cosine scores:

In[10]:=
cosineScores = 1 - (First@
     DistanceMatrix[{imgFeatures}, textFeatures, DistanceFunction -> CosineDistance]);
In[11]:=
Dataset@ReverseSortBy[Thread[{descriptions, cosineScores}], Last]
Out[11]=

The "MultimodalEncoder" net outputs an image-text matching score that can be directly used to rank similarity between images and texts:

In[12]:=
imgFeatures = imageEncoder[img, NetPort["RawFeatures"]];
In[13]:=
multimodalEncoder = NetModel[{"BLIP Image-Text Matching Nets Trained on Captioning Datasets", "Part" -> "MultimodalEncoder"}];
In[14]:=
itmScores = multimodalEncoder[<|"Input" -> descriptions, "ImageFeatures" -> ConstantArray[imgFeatures, Length[descriptions]]|>];

Rank the text description with respect to the correspondence to the input image according to the image-text matching score. Higher scores mean higher correspondence between the text and the image:

In[15]:=
Dataset@ReverseSortBy[Thread[{descriptions, itmScores[[;; , 2]]}], Last]
Out[15]=

Note that the image-text matching scores are significantly more precise than the cosine similarity scores:

In[16]:=
ReverseSortBy[
 Transpose@
  Dataset[<|"Textual Descriptions" -> descriptions, "Image-Text \nMatching Scores" -> itmScores[[;; , 2]], "Cosine\nScores" -> cosineScores|>], Last]
Out[16]=

Compare a set of images with a set of texts using image-text matching scores:

In[17]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/9638e54d-4e0a-48e9-bb29-bcebf800fb62"]
In[18]:=
descriptions = {"a photo of Henry Ford", "a photo of Audrey Hepburn", "a photo of Ada Lovelace", "a photo of Stephen Wolfram", "a woman sitting on ottoman in front of paintings", "a dog looking out from a car window", "a man working out", "a clos-up photo of a cheeseburger", "a photo of a church", "a metro sign in Paris"};
In[19]:=
imgFeatures = imageEncoder[images, NetPort["RawFeatures"]];
In[20]:=
itmScores = Table[multimodalEncoder[<|"Input" -> ids, "ImageFeatures" -> features|>][[2]], {ids, descriptions}, {features, imgFeatures}];
In[21]:=
ArrayPlot[
 itmScores,
 ColorFunction -> "BlueGreenYellow",
 PlotLegends -> Automatic,
 FrameTicks -> {
   {Transpose[{Range[Length[descriptions]], descriptions}], None}, {None, Transpose[{Range[Length[images]], ImageResize[#, 60] & /@ images}]}
   }
 , ImageSize -> Large
 ]
Out[21]=

Resource History

Reference