BLIP Image Captioning Nets Trained on MS-COCO Data

Generate a textual description of an image

The BLIP family of models introduces a new approach to image captioning within its vision-language pre-training framework. Its innovative multimodal mixture of encoder-decoder architecture significantly enhances performance in generating accurate and contextually rich captions for images (+2.8% in CIDEr). This architecture comprises three key components: unimodal encoders for aligning visual and linguistic representations, and an image-grounded text decoder tailored for sophisticated caption generation. In addition to its advanced architecture, BLIP was trained on 129 million image-text pairs and fine-tuned on the MS-COCO Captions dataset, it also employs the captioning and filtering (CapFilt) method to refine the training dataset. With its innovative design and dataset optimization strategy, BLIP sets new standards in image captioning, delivering state-of-the-art results for users seeking high-performance and reliable image description capabilities in their applications.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["BLIP Image Captioning Nets Trained on MS-COCO Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["BLIP Image Captioning Nets Trained on MS-COCO Data", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"BLIP Image Captioning Nets Trained on MS-COCO Data", "Architecture" -> "ViT-B/16"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"BLIP Image Captioning Nets Trained on MS-COCO Data", "Architecture" -> "ViT-B/16", "Part" -> "TextDecoder"}, "UninitializedEvaluationNet"]
Out[4]=

Pick the tokenizer:

In[5]:=
NetModel["BLIP Image Captioning Nets Trained on MS-COCO Data", "Tokenizer"]
Out[5]=

Evaluation function

Write an evaluation function to automate the image features computation and the text generation loop, directly producing the caption:

In[6]:=
Options[netevaluate] = {"Architecture" -> "ViT-L/16", MaxIterations -> 25, "NumberOfFrames" -> 16, "Temperature" -> 0, "TopProbabilities" -> 10, TargetDevice -> "CPU"};
netevaluate[input : (_?ImageQ | _?VideoQ), prompt : (_?StringQ | None) : None, opts : OptionsPattern[]] := Module[
   {images, outSpec, tokenizer, tokens, promptCodes, imageEncoder, textDecoder, imgFeatures, index = 1, init, generated = {}, eosCode = 103, netOut, isGenerating}, images = Switch[input,
     _?VideoQ,
     	VideoFrameList[input, OptionValue["NumberOfFrames"]],
     _?ImageQ,
     	input
     ]; {imageEncoder, textDecoder} = NetModel[{"BLIP Image Captioning Nets Trained on MS-COCO Data", "Architecture" -> OptionValue["Architecture"], "Part" -> #}] & /@ {"ImageEncoder", "TextDecoder"};
   tokenizer = NetModel["BLIP Image Captioning Nets Trained on MS-COCO Data", "Tokenizer"]; tokens = NetExtract[tokenizer, "Tokens"];
   promptCodes = If[SameQ[prompt, None], tokenizer["a photography of"], tokenizer[prompt]]; imgFeatures = imageEncoder[images, TargetDevice -> OptionValue[TargetDevice]];
   If[MatchQ[input, _?VideoQ],
    	imgFeatures = Mean[imgFeatures]
    ]; outSpec = Replace[NetPort /@ Information[textDecoder, "OutputPortNames"], NetPort["Output"] -> (NetPort["Output"] -> {"RandomSample", "Temperature" -> OptionValue["Temperature"], "TopProbabilities" -> OptionValue["TopProbabilities"]}), {1}];
   init = Join[
     <|
      "Index" -> index,
      "Input" -> First[promptCodes],
      "ImageFeatures" -> imgFeatures
      |>,
     Association@Table["State" <> ToString[i] -> {}, {i, 24}]
     ]; isGenerating = False;
   NestWhile[
    Function[
     If[index === Length[promptCodes], isGenerating = True];
     netOut = textDecoder[#, outSpec, TargetDevice -> OptionValue[TargetDevice]];
     If[isGenerating, AppendTo[generated, netOut["Output"]]];
     Join[
      KeyMap[StringReplace["OutState" -> "State"], netOut],
      <|
       "Index" -> ++index,
       "Input" -> If[isGenerating, netOut["Output"], promptCodes[[index]]],
       "ImageFeatures" -> imgFeatures
       |>
      ]
     ],
    init,
    #Input =!= eosCode &,
    1,
    Length[promptCodes] + OptionValue[MaxIterations]
    ];
   If[Last[generated] === eosCode,
    generated = Most[generated]
    ];
   StringTrim@StringJoin@tokens[[Join[Rest@promptCodes, generated]]]
   ];

Basic usage

Define a test image:

In[7]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/8424c547-b6dd-4026-9164-90147c2f0720"]

Generate an image caption:

In[8]:=
netevaluate[img]
Out[8]=

Try different initial prompts:

In[9]:=
netevaluate[img, #] & /@ {"the woman", "the weather today", "the park is", "the flowers", ""}
Out[9]=

Obtain a test video:

In[10]:=
video = ResourceData["Sample Video: Practicing Yoga"];
VideoFrameList[video, 5]
Out[2]=

Generate a caption for the video. The caption is generated from a number of uniformly spaced frames whose features are averaged. The number of frames can be controlled via an option:

In[11]:=
netevaluate[video, "a video of", "NumberOfFrames" -> 8]
Out[11]=

Feature space visualization

Get a set of images of cars and airplanes:

In[12]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/e92e7f52-89d7-4c52-ad8f-431625272097"]

Visualize the feature space embedding performed by the image encoder. Notice that images from the same class are clustered together:

In[13]:=
FeatureSpacePlot[
 Thread[NetModel[{"BLIP Image Captioning Nets Trained on MS-COCO Data", "Part" -> "ImageEncoder"}][imgs] -> imgs],
 LabelingSize -> 70,
 LabelingFunction -> Callout,
 ImageSize -> 700,
 AspectRatio -> 0.9,
 RandomSeeding -> 37
 ]
Out[13]=

Cross-attention visualization

When generating a new token of the image caption, the text decoder attends on the image features produced by the image encoder. Such features are a set of 577 vectors of length 1024, where every vector except the first one corresponds to one of 24x24 patches taken from the input image (the extra vector exists because the image encoder inherits the architecture of the image classification model Vision Transformer Trained on ImageNet Competition Data, but in this case, it doesn’t have any special importance). This means that the text decoder attention weights to these image features can be interpreted as the image patches the decoder is "looking at" when generating every new token, and it is possible to visualize this information. Get a test image and compute the features:

In[14]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/259f9367-28bd-4655-bf5b-8ecf6000c8de"]
In[15]:=
imgFeatures = NetModel[{"BLIP Image Captioning Nets Trained on MS-COCO Data", "Part" -> "ImageEncoder"}][testImage];
In[16]:=
Dimensions[imgFeatures]
Out[16]=

Generate the caption, accumulating the attention weights for each token generation. There are 12 attention blocks in the text decoder and each generates its own set of attention weights, and as usual in deep learning, the deeper blocks are the ones generating the most semantic information. Hence the following code extracts the attention weights from the last attention block only:

In[17]:=
tokenizer = NetModel["BLIP Image Captioning Nets Trained on MS-COCO Data", "Tokenizer"];
tokens = NetExtract[tokenizer, "Tokens"];
textDecoder = NetModel[{"BLIP Image Captioning Nets Trained on MS-COCO Data", "Part" -> "TextDecoder"}];
promptCodes = tokenizer["a photography of"];
init = Join[
   <|
    "Index" -> 1,
    "Input" -> First[promptCodes],
    "ImageFeatures" -> imgFeatures
    |>,
   Association@Table["State" <> ToString[i] -> {}, {i, 24}]
   ];
attentionWeightsNetSpec = {"TextEncoder", "TextLayer12", "CrossAttention", "Attention", "AttentionWeights"};
outSpec = NetPort /@ Join[
    {{"Output"}, attentionWeightsNetSpec},
    Table[{"OutState" <> ToString[i]}, {i, 24}]
    ];
index = 1;
tokenAttentionWeights = {};
generated = {};
eosCode = 103;
isGenerating = False;
NestWhile[
  Function[
   If[index === Length[promptCodes], isGenerating = True];
   netOut = textDecoder[#, outSpec];
   attentionWeights = netOut[NetPort[attentionWeightsNetSpec]];
   netOut = KeyDrop[netOut, NetPort[attentionWeightsNetSpec]];
   If[isGenerating,
    AppendTo[tokenAttentionWeights, tokens[[netOut["Output"]]] -> attentionWeights];
    AppendTo[generated, netOut["Output"]]
    ];
   Join[
    KeyMap[StringReplace["OutState" -> "State"], netOut],
    <|
     "Index" -> ++index,
     "Input" -> If[isGenerating, netOut["Output"], promptCodes[[index]]],
     "ImageFeatures" -> imgFeatures
     |>
    ]
   ],
  init,
  #Input =!= eosCode &,
  1,
  Length[promptCodes] + 20
  ];
If[Last[generated] === eosCode,
  generated = Most[generated]
  ];
generatedText = StringTrim@StringJoin@tokens[[generated]]
Out[31]=

Each generated token corresponds to a 12x577 array of attention weights, where 12 is the number of attention heads and 577 is the 24x24 patches plus the extra one:

In[32]:=
MapAt[Dimensions, tokenAttentionWeights, {All, 2}]
Out[32]=

Extract the attention weights related to the image patches and rescale them by their minimum value, which makes them more suitable for visualization:

In[33]:=
attentionWeights = Values[tokenAttentionWeights][[All, All, 2 ;;]];
attentionWeights = attentionWeights - Threaded@ArrayReduce[Min, attentionWeights, 1];
In[34]:=
{numTokens, numHeads, numPatches} = Dimensions[attentionWeights]
Out[34]=

Reshape the flat image patch dimension to 24x24 and take the average over the attention heads, thus obtaining a 24x24 attention matrix for each of the 10 generated tokens:

In[35]:=
attentionWeights = ArrayReshape[
   attentionWeights, {numTokens, numHeads, Sqrt[numPatches], Sqrt[numPatches]}];
attentionWeights = ArrayReduce[Mean, attentionWeights, {2}];
In[36]:=
Dimensions[attentionWeights]
Out[36]=

Visualize the attention weight matrices. Patches with higher values (red) are what is mostly being "looked at" when generating the corresponding token:

In[37]:=
GraphicsGrid[
 Partition[
  MapThread[
   Labeled[#1, #2] &, {MatrixPlot /@ attentionWeights, Keys[tokenAttentionWeights]}], 5], ImageSize -> Full]
Out[37]=

Define a function to visualize the attention matrix on the image:

In[38]:=
visualizeAttention[img_Image, attentionMatrix_, label_] := Block[{heatmap, wh},
  wh = ImageDimensions[img];
  heatmap = ImageApply[{#, 1 - #, 1 - #} &, ImageAdjust@Image[attentionMatrix]];
  heatmap = ImageResize[heatmap, ImageDimensions[img]];
  Labeled[
   ImageResize[ImageCompose[img, {ColorConvert[heatmap, "RGB"], 0.4}],
     wh*500/Min[wh]], label]
  ]

Visualize the attention mechanism for each token. Notice the emphasis on the head for the tokens "little" and "girl," on the hands for "taking" and on the camera for "camera":

In[39]:=
imgs = MapThread[
   visualizeAttention[testImage, #1, #2] &, {attentionWeights, Keys[tokenAttentionWeights]}];
In[40]:=
Grid@Partition[imgs, 6, 6, {1, 1}, ""]
Out[40]=

Net information

Inspect the number of parameters of all arrays in the net:

In[41]:=
Information[
 NetModel[
  "BLIP Image Captioning Nets Trained on MS-COCO Data"], "ArraysElementCounts"]
Out[41]=

Obtain the total number of parameters:

In[42]:=
Information[
 NetModel[
  "BLIP Image Captioning Nets Trained on MS-COCO Data"], "ArraysTotalElementCount"]
Out[42]=

Obtain the layer type counts:

In[43]:=
Information[
 NetModel[
  "BLIP Image Captioning Nets Trained on MS-COCO Data"], "LayerTypeCounts"]
Out[43]=

Resource History

Reference