# Wolfram Neural Net Repository

Immediate Computable Access to Neural Net Models

Represent words and images as vectors

Released in 2022, the OpenCLIP (Contrastive Language–Image Pre–training) family of transformer-based neural nets is a collection of models trained as pure feature extractors learning joint text and image representations from scratch. Utilizing the LAION-5B dataset, which contains five billion image-text pairs, the authors examine the effects of scaling on the performance of OpenCLIP models. The study reveals that performance consistently improves with the scaling of model size, data and computational resources, adhering to a power law. Interestingly, OpenCLIP outperforms in zero-shot retrieval tasks, while the original OpenAI CLIP models excel in zero-shot classification. The authors hypothesize that the training dataset significantly influences these task-specific scaling differences.

- LAION-5B, containing 5.85 billion CLIP-filtered image-text pairs, 14 times bigger than LAION-400M, previously the biggest openly accessible image-text dataset in the world.

Get the pre-trained net:

In[1]:= |

Out[1]= |

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:= |

Out[2]= |

Pick a non-default net by specifying the parameters:

In[3]:= |

Out[3]= |

Pick a non-default uninitialized net:

In[4]:= |

Out[4]= |

Use the OpenCLIP text encoder to obtain the feature representation of a piece of text:

In[5]:= |

The default OpenCLIP text encoder embeds the input text into a vector of size 512:

In[6]:= |

Out[6]= |

Use the OpenCLIP image encoder to obtain the feature representation of an image:

In[7]:= |

The default OpenCLIP image encoder embeds the input text into a vector of size 512:

In[8]:= |

Out[8]= |

Get a set of images:

In[9]:= |

Visualize the features of a set of images:

In[10]:= |

Out[10]= |

Define a list of sentences in two broad categories:

In[11]:= |

Visualize the similarity between the sentences using the net as a feature extractor:

In[12]:= |

Out[12]= |

Define a test image:

In[13]:= |

Define a list of text descriptions:

In[14]:= |

Embed the test image and text descriptions into the same feature space:

In[15]:= |

Rank the text description with respect to the correspondence to the input image according to the CosineDistance. Smaller distances mean higher correspondence between the text and the image:

In[16]:= |

Out[16]= |

By using the text and image feature extractors together, it's possible to perform generic image classification between any set of classes without having to explicitly train any model for those particular classes (zero-shot classification). Obtain the FashionMNIST test data, which contains ten thousand test images and 10 classes:

In[17]:= |

Display a few random examples from the set:

In[18]:= |

Out[18]= |

Get a mapping between class IDs and labels:

In[19]:= |

Out[19]= |

Generate the text templates for the FashionMNIST labels and embed them. The text templates will effectively act as classification labels:

In[20]:= |

Out[20]= |

In[21]:= |

In[22]:= |

Out[22]= |

Classify an image from the test set. Obtain its embedding:

In[23]:= |

Out[23]= |

In[24]:= |

In[25]:= |

Out[25]= |

The result of the classification is the description of the embedding that is closest to the image embedding:

In[26]:= |

Out[26]= |

Find the top 10 description nearest to the image embedding:

In[27]:= |

Out[27]= |

Obtain the accuracy of this procedure on the entire test set. Extract the features for all the images (if a GPU is available, setting TargetDevice -> "GPU" is recommended as the computation will take several minutes on CPU):

In[28]:= |

In[29]:= |

Out[29]= |

Calculate the distance matrix between the computed text and image embeddings:

In[30]:= |

Obtain the top-1 predictions:

In[31]:= |

Obtain the final classification results:

In[32]:= |

Out[32]= |

Just like the original Vision Transformer (see the model "Vision Transformer Trained on ImageNet Competition Data"), the image feature extractor divides the input images in 7x7 patches and performs self-attention on a set of 50 vectors: 49 vectors, or "tokens," representing the 7x7 patches and an additional one, a "feature extraction token," that is eventually used to produce the final feature representation of the image. Thus the attention procedure for this model can be visualized by inspecting the attention weights between the feature extraction token and the patch tokens. Define a test image:

In[33]:= |

Extract the attention weights used for the last block of self-attention:

In[34]:= |

In[35]:= |

Out[35]= |

Extract the attention weights between the feature extraction token and the input patches. These weights can be interpreted as which patches in the original image the net is "looking at" in order to perform the feature extraction:

In[36]:= |

Out[37]= |

Reshape the weights as a 3D array of 12 7x7 matrices. Each matrix corresponds to an attention head, while each element of the matrices corresponds to a patch in the original image:

In[38]:= |

In[39]:= |

Out[39]= |

Visualize the attention weight matrices. Patches with higher values (red) are what is mostly being "looked at" for each attention head:

In[40]:= |

Out[40]= |

Define a function to visualize the attention matrix on an image:

In[41]:= |

Visualize the mean attention across all the attention heads:

In[42]:= |

Out[42]= |

Visualize each attention head separately:

In[43]:= |

Out[43]= |

The text feature extractor tokenizes the input string prepending and appending the special tokens StartOfString and EndOfString and then performs causal self-attention on the token embedding vectors. After the self-attention stack, the last vector (corresponding to the token EndOfString) is used to obtain the final feature representation of the text. Thus the attention procedure for this model can be visualized by inspecting the attention weights between the last vector and the previous ones. Define a test string:

In[44]:= |

Extract the NetEncoder of the net to encode the string:

In[45]:= |

Out[45]= |

In[46]:= |

Out[46]= |

Extract the list of available tokens and inspect how the input string was tokenized. Even though the BPE tokenization generally segments the input into subwords, it's common to observe that all tokens correspond to full words. Also observe that the StartOfString and EndOfString tokens are added automatically:

In[47]:= |

In[48]:= |

Out[48]= |

In[49]:= |

Out[49]= |

Feed the string to the net and extract the attention weights used for the last block of self-attention:

In[50]:= |

Out[51]= |

Extract the attention weights between the last vector and the previous ones, leaving the initial vector corresponding to StartOfString out. These weights can be interpreted as which tokens in the original sentence the net is "looking at" in order to perform the feature extraction:

In[52]:= |

Out[53]= |

Inspect the average attention weights for each token across the attention heads. Observe that the token the net is mostly focused on is "hair":

In[54]:= |

Out[54]= |

Visualize each head separately:

In[55]:= |

Out[55]= |

Extract the attention weights for all 12 attention layers:

In[56]:= |

In[57]:= |

Out[58]= |

Compute the average across all heads, leaving the StartOfString token out:

In[59]:= |

Out[60]= |

Define a function to visualize the attention weights:

In[61]:= |

Explore the attention weights for every layer. A thicker arrow pointing from token A to token B indicates that the layer is paying attention to token B when generating the vector corresponding to token A:

In[62]:= |

Out[62]= |

Use the pre-trained model to build a classifier for telling apart indoor and outdoor photos. Create a test set and a training set:

In[63]:= |

In[64]:= |

Remove the last linear layer from the pre-trained net:

In[65]:= |

Out[65]= |

Create a new net composed of the pre-trained net followed by a linear layer and a softmax layer:

In[66]:= |

Train on the dataset, freezing all the weights except for those in the "linearNew" layer (use TargetDevice -> "GPU" for training on a GPU):

In[67]:= |

Out[67]= |

Perfect accuracy is obtained on the test set:

In[68]:= |

Out[68]= |

Inspect the number of parameters of all arrays in the net:

In[69]:= |

Out[69]= |

Obtain the total number of parameters:

In[70]:= |

Out[70]= |

Obtain the layer type counts:

In[71]:= |

Out[71]= |

- M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, J. Jitsev, "Reproducible Scaling Laws for Contrastive Language–Image Learning," arXiv:2212.07143 (2022)
- Available from: https://github.com/mlfoundations/open_clip
- Rights: Copyright © 2012–2021 Gabriel Ilharco, Mitchell Wortsman, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, John Miller, Hongseok Namkoong, Hannaneh Hajishirzi, Ali Farhadi, Ludwig Schmidt