BLIP Image-Text Matching Nets Trained on Captioning Datasets

Find the similarity score between a text and an image

The BLIP family of models offers a new approach to image-text matching, significantly enhancing the ability to accurately pair images with their corresponding textual descriptions. BLIP uses two primary methods for this task. The first method employs unimodal feature extractors to independently process images and text. The second method uses BLIP's innovative multimodal mixture of encoder-decoder architecture; this architecture integrates visual and textual information more holistically, capturing intricate interactions between images and text to ensure precise matching. BLIP was trained on 129 million image-text pairs and finetuned on Microsoft COCO and Flickr30k datasets. Additionally, BLIP employs a method known as captioning and filtering (CapFilt) to refine its training dataset. These advancements enable BLIP to excel in image-text matching tasks, offering users superior performance, resulting in a +2.7% in average top-1 recall.

Training Set Information

LAION-400M is a large-scale, open-access dataset comprising 400 million image-text pairs, filtered and prepared for research purposes in multimodal model training, sourced from Common Crawl web data between 2014 and 2021. Visual Genome contains visual question answering data in a multi-choice setting. It consists of 101,174 images from MS-COCO with 1.7 million question-answer pairs and 17 questions per image on average. The SBU Captions dataset contains one million Flickr images with captions curated from real users, filtered to ensure quality by retaining captions with meaningful content. The Conceptual Captions 12M (CC12M) dataset contains around 12 million image-text pairs for vision-and-language pre-training, surpassing the size and diversity of the widely used Conceptual Captions (CC3M) dataset. Conceptual Captions (CC3M), with more than three million images and web-sourced captions, offers diverse styles compared to MS-COCO. The raw descriptions are extracted from the web alt text associated with the web images. An automatic pipeline filters image-caption pairs for balanced cleanliness, informativeness, fluency and learnability of the resulting captions. Flickr30k comprises a collection of 158,915 crowd-sourced captions describing 31,783 images, emphasizing individuals engaged in routine activities and various events. Microsoft COCO, a dataset for image recognition, segmentation, captioning, object detection and keypoint estimation, consists of more than three hundred thousand images.

Model Information

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=

Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=

Out[3]=

Pick a non-default uninitialized net:

In[4]:=

Out[4]=

Basic usage

Define a test image:

In[5]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/1b9f7ff2-1f1b-46b3-8f7d-1106d55d5a31"]

Define a list of text descriptions:

In[6]:=

descriptions = {
"Blossoming rose on textbook among delicate petals",
"Photo of foggy forest",
"A portrait of a man in a park drinking tea",
"Yellow flower in tilt shift lens",
"Woman in black leather jacket and white pants",
"A portrait of a lady in a park drinking tea",
"Calm body of lake between mountains",
"Close up shot of a baby girl",
"Smiling man surfing on wave in ocean",
"A portrait of two ladies in a park eating pizza",
"A portrait of two ladies in a hotel drinking tea",
"A woman with eyeglasses smiling",
"Elderly woman carrying a baby",
"A portrait of two ladies in a park drinking tea"
};

Embed the test image and text descriptions into the same feature space:

In[7]:=

{textEncoder, imageEncoder} = NetModel[{"BLIP Image-Text Matching Nets Trained on Captioning Datasets", "Part" -> #}] & /@ {"TextEncoder", "ImageEncoder"};

In[8]:=

In[9]:=

Rank the text description with respect to the correspondence to the input image according to the CosineDistance. Smaller distances (higher score) mean higher correspondence between the text and the image and higher cosine scores:

In[10]:=

In[11]:=

Out[11]=

The "MultimodalEncoder" net outputs an image-text matching score that can be directly used to rank similarity between images and texts:

In[12]:=

In[13]:=

multimodalEncoder = NetModel[{"BLIP Image-Text Matching Nets Trained on Captioning Datasets", "Part" -> "MultimodalEncoder"}];

In[14]:=

itmScores = multimodalEncoder[<|"Input" -> descriptions, "ImageFeatures" -> ConstantArray[imgFeatures, Length[descriptions]]|>];

Rank the text description with respect to the correspondence to the input image according to the image-text matching score. Higher scores mean higher correspondence between the text and the image:

In[15]:=

Out[15]=

Note that the image-text matching scores are significantly more precise than the cosine similarity scores:

In[16]:=

$ReverseSortBy[ Transpose@ Dataset[<|"Textual Descriptions" -> descriptions, "Image-Text \nMatching Scores" -> itmScores[[;; , 2]], "Cosine\nScores" -> cosineScores|>], Last]$

Out[16]=

Compare a set of images with a set of texts using image-text matching scores:

In[17]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/9638e54d-4e0a-48e9-bb29-bcebf800fb62"]

In[18]:=

descriptions = {"a photo of Henry Ford", "a photo of Audrey Hepburn", "a photo of Ada Lovelace", "a photo of Stephen Wolfram", "a woman sitting on ottoman in front of paintings", "a dog looking out from a car window", "a man working out", "a clos-up photo of a cheeseburger", "a photo of a church", "a metro sign in Paris"};

In[19]:=

In[20]:=

In[21]:=

ArrayPlot[
itmScores,
ColorFunction -> "BlueGreenYellow",
PlotLegends -> Automatic,
FrameTicks -> {
{Transpose[{Range[Length[descriptions]], descriptions}], None}, {None, Transpose[{Range[Length[images]], ImageResize[#, 60] & /@ images}]}
}
, ImageSize -> Large
]

Out[21]=

Feature space visualization

Get a set of images:

In[22]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/b9826de5-68db-4583-ba36-d8aacb6240d1"]

Visualize the feature space embedding performed by the image encoder. Notice that images from the same class are clustered together:

In[23]:=

FeatureSpacePlot[
Thread[NetModel[{"BLIP Image-Text Matching Nets Trained on Captioning Datasets", "Part" -> "ImageEncoder"}][imgs, NetPort[{"Proj", "Output"}]] -> imgs],
LabelingSize -> 70,
RandomSeeding -> 37,
LabelingFunction -> Callout,
ImageSize -> 700,
AspectRatio -> 0.9
]

Out[23]=

Define a list of sentences in two categories:

In[24]:=

sentences = {
"The Empire State Building's observation deck in New York is a must-visit for its iconic skyline views.",
"The Charging Bull in the financial district of New York has become a symbol of market optimism.",
"Times Square in New York is best known for its bright billboards and bustling atmosphere.",
"The Statue of Liberty in New York stands as a universal symbol of freedom and opportunity.",
"Central Park in New York is an urban oasis, providing a natural escape amidst the city's skyscrapers.",
"Sacré-Cœur in Paris offers both spiritual solace and panoramic views from its hilltop location.",
"The Eiffel Tower's light in Paris show adds a romantic touch to the city's engineering marvel.",
"Bridges over the Seine in Paris are scenic spots that often host art and book vendors.",
"The Louvre's glass pyramid in Paris modernizes the entrance to a museum filled with historical art.",
"The Panthéon in Paris serves as a tribute to national heroes, complete with educational exhibits."
};

Visualize the similarity between the sentences using the net as a feature extractor:

In[25]:=

FeatureSpacePlot[
Thread[NetModel[{"BLIP Image-Text Matching Nets Trained on Captioning Datasets", "Part" -> "TextEncoder"}][
sentences] -> (Tooltip[Style[Text@#, Medium]] & /@ sentences)],
LabelingSize -> {100, 100},
RandomSeeding -> 37,
LabelingFunction -> Callout,
ImageSize -> 700,
AspectRatio -> 0.9
]

Out[25]=

Zero-shot image classification

By using the text and image feature extractors together, it's possible to perform generic image classification between any set of classes without having to explicitly train any model for those particular classes (zero-shot classification). Obtain the FashionMNIST test data, which contains ten thousand test images and 10 classes:

In[26]:=

Display a few random examples from the set:

In[27]:=

Out[27]=

Get a mapping between class IDs and labels:

In[28]:=

Out[28]=

Generate the text templates for the FashionMNIST labels and embed them. The text templates will effectively act as classification labels:

In[29]:=

Out[29]=

Classify an image from the test set. Obtain its embedding:

In[30]:=

Out[30]=

In[31]:=

imgFeatures = NetModel[{"BLIP Image-Text Matching Nets Trained on Captioning Datasets", "Part" -> "ImageEncoder"}][img, NetPort["RawFeatures"]];

In[32]:=

Out[32]=

In[33]:=

scores = NetModel[{"BLIP Image-Text Matching Nets Trained on Captioning Datasets", "Part" -> "MultimodalEncoder"}][<|
"Input" -> labelTemplates, "ImageFeatures" -> ConstantArray[imgFeatures, Length[labelTemplates]]|>];

In[34]:=

Out[34]=

The result of the classification is the description that has the highest similarity score:

In[35]:=

Out[35]=

Find the top 10 descriptions closest to the image:

In[36]:=

Out[36]=

Cross-attention visualization for images

When processing its image and text inputs, the multimodal encoder attends on the image features produced by the image encoder. Such features are a set of 577 vectors of length 768, where every vector except the first one corresponds to one of 24x24 patches taken from the input image (the extra vector exists because the image encoder inherits the architecture of the image classification model Vision Transformer Trained on ImageNet Competition Data, but in this case, it doesn't have any special importance). This means that the multimodal encoder's attention weights to these image features can be interpreted as the image patches the encoder is "looking at" for every token of the input text, and it is possible to visualize this information. Get a test image and compute the features:

In[37]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/19b223e8-ec43-4586-b11d-cf42ce05642f"]

In[38]:=

imgFeatures = NetModel[{"BLIP Image-Text Matching Nets Trained on Captioning Datasets", "Part" -> "ImageEncoder"}][testImage, NetPort["RawFeatures"]];

In[39]:=

Out[39]=

Feed the feature and some text to the multimodal encoder. There are 12 attention blocks in the encoder and each generates its own set of attention weights. Inspect the attention weights for a single block:

In[40]:=

description = "a girl riding a brown horse on the beach";
tokenizer = NetExtract[
NetModel[{"BLIP Image-Text Matching Nets Trained on Captioning Datasets", "Part" -> "TextEncoder"}], "Input"];
allTokens = NetExtract[tokenizer, "Tokens"];
codes = tokenizer[description];
tokens = allTokens[[codes]];

In[41]:=

tokenAttentionWeights = Thread[tokens -> NetModel[{"BLIP Image-Text Matching Nets Trained on Captioning Datasets", "Part" -> "MultimodalEncoder"}][<|"Input" -> description, "ImageFeatures" -> imgFeatures|>, NetPort[{"TextEncoder", "TextLayer1", "CrossAttention", "Attention", "AttentionWeights"}]]];

Each token corresponds to a 12x577 array of attention weights, where 12 is the number of the attention heads and 577 is the 24x24 patches plus the extra one:

In[42]:=

Out[42]=

Extract the attention weights related to the image patches:

In[43]:=

Out[44]=

Reshape the flat image patch dimension to 24x24 and take the average over the attention heads, thus obtaining a 24x24 attention matrix for each token:

In[45]:=

attentionWeights = ArrayReshape[
attentionWeights, {numTokens, numHeads, Sqrt[numPatches], Sqrt[numPatches]}];
attentionWeights = Map[Mean, attentionWeights, {1}];
attentionWeights // Dimensions

Out[46]=

To reveal novel patch interactions specific to each token, suppress the consistently high attention weights by subtracting the minimum value aggregated across the token dimension:

In[47]:=

Visualize the attention weight matrices. Patches with higher values (red) are what is mostly being "looked at" when generating the corresponding token:

In[48]:=

Out[48]=

In[49]:=

GraphicsGrid[
Partition[
MapThread[
Labeled[#1, #2] &, {MatrixPlot /@ attentionWeights, Keys[tokenAttentionWeights]}], 4, 4, {1, 1}, ""], ImageSize -> Large]

Out[49]=

Define a function to visualize the attention matrix on the image:

In[50]:=

$visualizeAttention[img_Image, attentionMatrix_, label_] := Block[{heatmap, wh}, wh = ImageDimensions[img]; heatmap = ImageApply[{#, 1 - #, 1 - #} &, ImageAdjust@Image[attentionMatrix]]; heatmap = ImageResize[heatmap, ImageDimensions[img]]; Labeled[ImageCompose[img, {ColorConvert[heatmap, "RGB"], 0.5}], label] ]$

Visualize the attention mechanism for each token. A recurrent noisy pattern of large positive activation can be observed, but notice the emphasis on the girl for the token "girl" and the beach for the token "beach":

In[51]:=

In[52]:=

Out[52]=

Net information

Inspect the number of parameters of all arrays in the net:

In[53]:=

Out[53]=

Obtain the total number of parameters:

In[54]:=

Out[54]=

Obtain the layer type counts:

In[55]:=

Out[55]=

Resource History

Date Created: 19 June 2024

Reference

J. Li, D. Li, C. Xiong, S. Hoi, "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation," arXiv:2201.12086v2 (2022)
Available from: https://github.com/salesforce/BLIP/tree/main/models
Rights: BSD 3-Clause License