# Wolfram Neural Net Repository

Immediate Computable Access to Neural Net Models

Detect, segment and localize objects in an image

The Fast Segment Anything Model (FastSAM) is a novel, real-time solution based on convolutional neural networks (CNNs) for the "segment anything" task leveraging the YOLO V8 Segment architecture. The "segment anything" task is designed to segment any object within an image based on various possible user hints that specify the object to segment with either a single location in the image, a region of interest or a textual prompt. FastSAM divides the task into two steps: all-instance segmentation by the CNN, which segments all objects and regions, followed by prompt-guided selection, which produces the final segmentation mask depending on the hint. In the case of text-guided selection, the textual prompt and the segmented parts of the image are processed using CLIP models in order to produce embeddings that can be compared.

- Microsoft COCO, a dataset for image recognition, segmentation, captioning, object detection and keypoint estimation, consisting of more than three hundred thousand images.

Get the pre-trained net:

In[1]:= |

Out[2]= |

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[3]:= |

Out[3]= |

Pick a non-default net by specifying the parameters:

In[4]:= |

Out[4]= |

Pick a non-default uninitialized net:

In[5]:= |

Out[5]= |

Write an evaluation function:

In[6]:= |

Define a test image:

In[7]:= |

Obtain a segmentation mask for the desired object using a textual prompt, a bounding box and a single point:

In[8]:= |

In[9]:= |

In[10]:= |

All the obtained masks are binary and have the dimensions of the input image:

In[11]:= |

Out[11]= |

Out[12]= |

Out[13]= |

Visualize the mask obtained via the text prompt:

In[14]:= |

Out[14]= |

Visualize the box hint and mask obtained from it:

In[15]:= |

Out[15]= |

Visualize the point hint and mask obtained from it:

In[16]:= |

Out[16]= |

If no hint is specified, binary masks for all the identified objects will be returned:

In[17]:= |

In[18]:= |

Out[18]= |

Visualize the masks:

In[19]:= |

Out[19]= |

In[20]:= |

Out[20]= |

When hints are used, the process is split in two phases: all-instance segmentation (where the entire image is segmented into its components) followed by prompt-guided selection (where a final mask is obtained using the prompt). Define an image:

In[21]:= |

The initial all-instance segmentation follows the pipeline of the model "YOLO V8 Segment Trained on MS-COCO Data." Obtain all the segmentation masks:

In[22]:= |

Show all the obtained masks on top of the image:

In[23]:= |

Out[23]= |

For the case of a point guidance, the final mask is the union of all the masks that contain the point. Show a point hint relative to the image and all of the masks:

In[24]:= |

Out[8]= |

Check which masks contain the point:

In[25]:= |

Out[26]= |

Take the union of the masks and show the final result:

In[27]:= |

Out[28]= |

In[29]:= |

Out[29]= |

For the case of a box guidance, the final mask is the one with maximal intersection over union (IOU) with the box. Show a box hint relative to the image and all masks:

In[30]:= |

Out[31]= |

Obtain the measure of the intersections between the masks and the box:

In[32]:= |

Out[33]= |

Obtain the measure of the unions between the masks and the box:

In[34]:= |

Out[35]= |

Compute the IOU and select the mask with maximal value:

In[36]:= |

Out[36]= |

In[37]:= |

Out[37]= |

Show the final result:

In[38]:= |

Out[15]= |

In[39]:= |

Out[39]= |

For the case of a text guidance, the segmented parts of the image and the text are fed to multi-domain CLIP models, obtaining feature vectors that can be compared. The selected mask is the one closest to the text in feature space. Define a text hint and segment the image in its parts:

In[40]:= |

In[41]:= |

Out[41]= |

Obtain the multi-domain features:

In[42]:= |

Out[20]= |

In[43]:= |

Out[44]= |

Compute the cosine distances between the features and select the mask with maximal value:

In[45]:= |

Out[45]= |

In[46]:= |

Out[46]= |

Show the final result:

In[47]:= |

Out[11]= |

In[48]:= |

Out[48]= |

Inspect the number of parameters of all arrays in the net:

In[49]:= |

Out[36]= |

Obtain the total number of parameters:

In[50]:= |

Out[37]= |

Obtain the layer type counts:

In[51]:= |

Out[38]= |

Display the summary graphic:

In[52]:= |

Out[15]= |

Export the net to the ONNX format:

In[53]:= |

Out[39]= |

Get the size of the ONNX file:

In[54]:= |

Out[54]= |

The size is similar to the byte count of the resource object:

In[55]:= |

Out[41]= |

Check some metadata of the ONNX model:

In[56]:= |

Out[56]= |

Import the model back into Wolfram Language. However, the NetEncoder and NetDecoder will be absent because they are not supported by ONNX:

In[57]:= |

Out[57]= |

- X. Zhao, W. Ding, Y. An, Y. Du, T. Yu, M. Li, M. Tang, J. Wang, "Fast Segment Anything," arXiv:2306.12156 (2023)
- Available from: https://github.com/ultralytics/ultralytics
- Rights: GNU General Public License