BLIP Visual Question Answering Nets Trained on VQA Data

Generate an answer to a question given an image

The BLIP model offers a state-of-the-art approach to visual question answering (VQA), enabling precise and context-aware answers to questions about images (+1.6% in VQA score). At the heart of BLIP is its innovative multimodal mixture of encoder-decoder architecture. This architecture effectively aligns visual and language information, captures complex interactions between images and text and generates detailed and accurate responses. BLIP was trained on 129 million image-text pairs and fine-tuned on the VQA2.0 visual question answering dataset. Additionally, BLIP enhances its training process with a method called captioning and filtering (CapFilt). Thanks to these advancements, BLIP excels in VQA tasks, providing users with high-quality, reliable answers to questions about visual content.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["BLIP Visual Question Answering Nets Trained on VQA Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["BLIP Visual Question Answering Nets Trained on VQA Data", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"BLIP Visual Question Answering Nets Trained on VQA Data", "Part" -> "TextDecoder"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"BLIP Visual Question Answering Nets Trained on VQA Data", "Part" -> "TextEncoder", "CapFilt" -> False}, "UninitializedEvaluationNet"]
Out[4]=

Evaluation function

Define an evaluation function that uses all model parts to obtain the image features and automate the question answering generation:

In[5]:=
Options[netevaluate] = {"CapFilt" -> True, MaxIterations -> 25, "NumberOfFrames" -> 16, "Temperature" -> 0, "TopProbabilities" -> 10, TargetDevice -> "CPU"};
netevaluate[input : (_?ImageQ | _?VideoQ), question_ : (_?StringQ), opts : OptionsPattern[]] := Module[
   {imgInput, imageEncoder, textEncoder, textDecoder, questionFeatures, tokens, imgFeatures, outSpec, init, netOut, index = 1, generated = {}, eosCode = 103, bosCode = 102}, imgInput = Switch[input,
     _?VideoQ,
     	VideoFrameList[input, OptionValue["NumberOfFrames"]],
     _?ImageQ,
     	input
     ]; {imageEncoder, textEncoder, textDecoder} = NetModel[{"BLIP Visual Question Answering Nets Trained on VQA Data", "CapFilt" -> OptionValue["CapFilt"], "Part" -> #}] & /@ {"ImageEncoder", "TextEncoder", "TextDecoder"}; tokens = NetExtract[textEncoder, {"Input", "Tokens"}]; imgFeatures = imageEncoder[imgInput, TargetDevice -> OptionValue[TargetDevice]];
   If[MatchQ[input, _?VideoQ],
    	imgFeatures = Mean[imgFeatures]
    ]; questionFeatures = textEncoder[<|"Input" -> question, "ImageFeatures" -> imgFeatures|>];
   outSpec = Replace[NetPort /@ Information[textDecoder, "OutputPortNames"], NetPort["Output"] -> (NetPort["Output"] -> {"RandomSample", "Temperature" -> OptionValue["Temperature"], "TopProbabilities" -> OptionValue["TopProbabilities"]}), {1}];
   init = Join[
     <|
      "Index" -> index,
      "Input" -> bosCode,
      "QuestionFeatures" -> questionFeatures
      |>,
     Association@Table["State" <> ToString[i] -> {}, {i, 24}]
     ]; NestWhile[
    Function[
     netOut = textDecoder[#, outSpec, TargetDevice -> OptionValue[TargetDevice]];
     AppendTo[generated, netOut["Output"]];
     Join[
      KeyMap[StringReplace["OutState" -> "State"], netOut],
      <|
       "Index" -> ++index,
       "Input" -> netOut["Output"],
       "QuestionFeatures" -> questionFeatures
       |>
      ]
     ],
    init,
    #Input =!= eosCode &,
    1,
    OptionValue[MaxIterations]
    ];
   If[Last[generated] === eosCode,
    generated = Most[generated]
    ];
   StringTrim@StringJoin@tokens[[generated]]
   ];

Basic usage

Define a test image:

In[6]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/d514a8d9-c524-421a-abd1-a37b3c399aa8"]

Answer a question about the image:

In[7]:=
netevaluate[img, "what is the girl doing?"]
Out[7]=

Try different questions:

In[8]:=
netevaluate[img, #] & /@ {"Where is she?", "What is on the blanket?", "How many people are with her?", "How is the weather?", "Who took the picture?"}
Out[8]=

Obtain a test video:

In[9]:=
video = ResourceData["Sample Video: Friends at the Beach"];
In[10]:=
VideoFrameList[video, 5]
Out[10]=

Generate an answer to a question about the video. The answer is generated from a number of uniformly spaced frames whose features are averaged. The number of frames can be controlled via an option:

In[11]:=
netevaluate[video, "What is happening?", "NumberOfFrames" -> 8]
Out[11]=

Feature space visualization

Get a set of images of coffee and ice cream:

In[12]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/f014372c-6ba0-4368-9462-a2c6e91e12a6"]

Visualize the feature space embedding performed by the image encoder. Notice that images from the same class are clustered together:

In[13]:=
FeatureSpacePlot[
 Thread[NetModel[{"BLIP Visual Question Answering Nets Trained on VQA Data", "Part" -> "ImageEncoder"}][imgs] -> imgs],
 LabelingSize -> 70,
 LabelingFunction -> Callout,
 ImageSize -> 700,
 AspectRatio -> 0.9
 ]
Out[13]=
In[14]:=

Cross-attention visualization for images

When encoding a question, the text encoder attends on the image features produced by the image encoder. Such features are a set of 577 vectors of length 768, where every vector except the first one corresponds to one of 24x24 patches taken from the input image (the extra vector exists because the image encoder inherits the architecture of the image classification model Vision Transformer Trained on ImageNet Competition Data, but in this case, it doesn’t have any special importance). This means that the text encoder attention weights to these image features can be interpreted as the image patches the encoder is "looking at" when generating every new token, and it is possible to visualize this information. Get a test image and compute the features:

In[15]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/94d26d67-ee4b-4c03-a45b-d341783a72e6"]
In[16]:=
imgFeatures = NetModel[{"BLIP Visual Question Answering Nets Trained on VQA Data",
      "Part" -> "ImageEncoder"}][testImage];
In[17]:=
Dimensions[imgFeatures]
Out[17]=

Define a question pass it through the text encoder. There are 12 attention blocks in the encoder and each generates its own set of attention weights. Inspect the attention weights for a single block:

In[18]:=
question = "what is the boy doing?";
tokenizer = NetExtract[
   NetModel[{"BLIP Visual Question Answering Nets Trained on VQA Data", "Part" -> "TextEncoder"}], "Input"];
tokens = NetExtract[tokenizer, "Tokens"];
In[19]:=
tokenAttentionWeights = Thread[tokens[[tokenizer[question]]] -> NetModel[{"BLIP Visual Question Answering Nets Trained on VQA Data", "Part" -> "TextEncoder"}][<|"Input" -> question, "ImageFeatures" -> imgFeatures|>, NetPort[{"TextEncoder", "TextLayer1", "CrossAttention", "Attention", "AttentionWeights"}]]];

Each question's token corresponds to a 12x577 array of attention weights, where 12 is the number of attention heads and 577 is the 24x24 patches plus the extra one:

In[20]:=
MapAt[Dimensions, tokenAttentionWeights, {All, 2}]
Out[20]=
In[21]:=
Length[{StartOfString -> {12, 577}, " what" -> {12, 577}, " is" -> {12, 577}, " the" -> {12, 577}, " boy" -> {12, 577}, " doing" -> {12, 577}, " ?" -> {12, 577}, EndOfString -> {12, 577}}]
Out[21]=

Extract the attention weights related to the image patches:

In[22]:=
attentionWeights = Values[tokenAttentionWeights][[All, All, 2 ;;]];
{numTokens, numHeads, numPatches} = Dimensions[attentionWeights]
Out[23]=

Reshape the flat image patch dimension to 24x24 and take the average over the attention heads, thus obtaining a 24x24 attention matrix for each of the eight generated tokens:

In[24]:=
attentionWeights = ArrayReshape[
   attentionWeights, {numTokens, numHeads, Sqrt[numPatches], Sqrt[numPatches]}];
attentionWeights = Map[Mean, attentionWeights, {1}];
attentionWeights // Dimensions
Out[25]=

To reveal novel patch interactions specific to each token, suppress the consistently high attention weights by subtracting the minimum value aggregated across the token dimension:

In[26]:=
attentionWeights = attentionWeights - Threaded@ArrayReduce[Min, attentionWeights, {1}];

Visualize the attention weight matrices. Patches with higher values (red) are what is mostly being "looked at" when generating the corresponding token:

In[27]:=
GraphicsGrid[
 Partition[
  MapThread[
   Labeled[#1, #2] &, {MatrixPlot /@ attentionWeights, Keys[tokenAttentionWeights]}], 4, 4, {1, 1}, ""], ImageSize -> Large]
Out[27]=

Define a function to visualize the attention matrix on the image:

In[28]:=
visualizeAttention[img_Image, attentionMatrix_, label_] := Block[{heatmap, wh},
  wh = ImageDimensions[img];
  heatmap = ImageApply[{#, 1 - #, 1 - #} &, ImageAdjust@Image[attentionMatrix]];
  heatmap = ImageResize[heatmap, ImageDimensions[img]];
  Labeled[
   ImageResize[ImageCompose[img, {ColorConvert[heatmap, "RGB"], 0.5}],
     wh*500/Min[wh]], label]
  ]

Visualize the attention mechanism for each token. A recurrent noisy pattern of large positive activation can be observed, but notice the emphasis on the head for the token "boy" and on the hands for "doing":

In[29]:=
imgs = MapThread[
   visualizeAttention[testImage, #1, #2] &, {attentionWeights, Keys[tokenAttentionWeights]}];
In[30]:=
Grid@Partition[imgs, 4, 4, {1, 1}, ""]
Out[30]=

Net information

Inspect the number of parameters of all arrays in the net:

In[31]:=
Information[
 NetModel[
  "BLIP Visual Question Answering Nets Trained on VQA Data"], "ArraysElementCounts"]
Out[31]=

Obtain the total number of parameters:

In[32]:=
Information[
 NetModel[
  "BLIP Visual Question Answering Nets Trained on VQA Data"], "ArraysTotalElementCount"]
Out[32]=

Obtain the layer type counts:

In[33]:=
Information[
 NetModel[
  "BLIP Visual Question Answering Nets Trained on VQA Data"], "LayerTypeCounts"]
Out[33]=

Resource History

Reference