Resource retrieval
Get the pre-trained net:
NetModel parameters
This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:
Pick a non-default net by specifying the parameters:
Pick a non-default uninitialized net:
Basic usage
Define a test image:
Define a list of text descriptions:
Embed the test image and text descriptions into the same feature space:
Rank the text description with respect to the correspondence to the input image according to the CosineDistance. Smaller distances (higher score) mean higher correspondence between the text and the image and higher cosine scores:
The "MultimodalEncoder" net outputs an image-text matching score that can be directly used to rank similarity between images and texts:
Rank the text description with respect to the correspondence to the input image according to the image-text matching score. Higher scores mean higher correspondence between the text and the image:
Note that the image-text matching scores are significantly more precise than the cosine similarity scores:
Compare a set of images with a set of texts using image-text matching scores:
Feature space visualization
Get a set of images:
Visualize the feature space embedding performed by the image encoder. Notice that images from the same class are clustered together:
Define a list of sentences in two categories:
Visualize the similarity between the sentences using the net as a feature extractor:
Zero-shot image classification
By using the text and image feature extractors together, it's possible to perform generic image classification between any set of classes without having to explicitly train any model for those particular classes (zero-shot classification). Obtain the FashionMNIST test data, which contains ten thousand test images and 10 classes:
Display a few random examples from the set:
Get a mapping between class IDs and labels:
Generate the text templates for the FashionMNIST labels and embed them. The text templates will effectively act as classification labels:
Classify an image from the test set. Obtain its embedding:
The result of the classification is the description that has the highest similarity score:
Find the top 10 descriptions closest to the image:
Cross-attention visualization for images
When processing its image and text inputs, the multimodal encoder attends on the image features produced by the image encoder. Such features are a set of 577 vectors of length 768, where every vector except the first one corresponds to one of 24x24 patches taken from the input image (the extra vector exists because the image encoder inherits the architecture of the image classification model Vision Transformer Trained on ImageNet Competition Data, but in this case, it doesn't have any special importance). This means that the multimodal encoder's attention weights to these image features can be interpreted as the image patches the encoder is "looking at" for every token of the input text, and it is possible to visualize this information. Get a test image and compute the features:
Feed the feature and some text to the multimodal encoder. There are 12 attention blocks in the encoder and each generates its own set of attention weights. Inspect the attention weights for a single block:
Each token corresponds to a 12x577 array of attention weights, where 12 is the number of the attention heads and 577 is the 24x24 patches plus the extra one:
Extract the attention weights related to the image patches:
Reshape the flat image patch dimension to 24x24 and take the average over the attention heads, thus obtaining a 24x24 attention matrix for each token:
To reveal novel patch interactions specific to each token, suppress the consistently high attention weights by subtracting the minimum value aggregated across the token dimension:
Visualize the attention weight matrices. Patches with higher values (red) are what is mostly being "looked at" when generating the corresponding token:
Define a function to visualize the attention matrix on the image:
Visualize the attention mechanism for each token. A recurrent noisy pattern of large positive activation can be observed, but notice the emphasis on the girl for the token "girl" and the beach for the token "beach":
Net information
Inspect the number of parameters of all arrays in the net:
Obtain the total number of parameters:
Obtain the layer type counts: