FastSAM Trained on MS-COCO Data

Detect, segment and localize objects in an image

The Fast Segment Anything Model (FastSAM) is a novel, real-time solution based on convolutional neural networks (CNNs) for the "segment anything" task leveraging the YOLO V8 Segment architecture. The "segment anything" task is designed to segment any object within an image based on various possible user hints that specify the object to segment with either a single location in the image, a region of interest or a textual prompt. FastSAM divides the task into two steps: all-instance segmentation by the CNN, which segments all objects and regions, followed by prompt-guided selection, which produces the final segmentation mask depending on the hint. In the case of text-guided selection, the textual prompt and the segmented parts of the image are processed using CLIP models in order to produce embeddings that can be compared.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["FastSAM Trained on MS-COCO Data"]
Out[2]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[3]:=
NetModel["FastSAM Trained on MS-COCO Data", "ParametersInformation"]
Out[3]=

Pick a non-default net by specifying the parameters:

In[4]:=
NetModel[{"FastSAM Trained on MS-COCO Data", "Size" -> "S"}]
Out[4]=

Pick a non-default uninitialized net:

In[5]:=
NetModel[{"FastSAM Trained on MS-COCO Data", "Size" -> "S"}, "UninitializedEvaluationNet"]
Out[5]=

Evaluation function

Write an evaluation function:

In[6]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/1bf64db7-7cba-468f-8429-aedcac281e74"]

Basic usage

Define a test image:

In[7]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/01a72e89-665e-4f78-a61a-fa85d2c4806c"]

Obtain a segmentation mask for the desired object using a textual prompt, a bounding box and a single point:

In[8]:=
maskText = netevaluate[NetModel["FastSAM Trained on MS-COCO Data"], testImage, "man handstanding"];
In[9]:=
maskBox = netevaluate[NetModel["FastSAM Trained on MS-COCO Data"], testImage, Rectangle[{400, 200}, {480, 360}]];
In[10]:=
maskPoint = netevaluate[NetModel["FastSAM Trained on MS-COCO Data"], testImage, Point[{200, 250}]];

All the obtained masks are binary and have the dimensions of the input image:

In[11]:=
{Dimensions[maskText], DeleteDuplicates@Flatten[maskText]}
{Dimensions[maskBox], DeleteDuplicates@Flatten[maskBox]}
{Dimensions[maskPoint], DeleteDuplicates@Flatten[maskPoint]}
Out[11]=
Out[12]=
Out[13]=

Visualize the mask obtained via the text prompt:

In[14]:=
HighlightImage[testImage, maskText]
Out[14]=

Visualize the box hint and mask obtained from it:

In[15]:=
HighlightImage[testImage, {Rectangle[{400, 200}, {480, 360}], maskBox}]
Out[15]=

Visualize the point hint and mask obtained from it:

In[16]:=
HighlightImage[testImage, {PointSize[0.04], Point[{200, 250}], maskPoint}]
Out[16]=

If no hint is specified, binary masks for all the identified objects will be returned:

In[17]:=
allMasks = netevaluate[NetModel["FastSAM Trained on MS-COCO Data"], testImage];
In[18]:=
Dimensions[allMasks]
Out[18]=

Visualize the masks:

In[19]:=
imgs = Image /@ allMasks
Out[19]=
In[20]:=
HighlightImage[testImage, Thread[Range[Length[imgs]] -> Map[{Opacity[0.6], #} &, imgs]], ImageLabels -> None]
Out[20]=

Prompt-guided selection

When hints are used, the process is split in two phases: all-instance segmentation (where the entire image is segmented into its components) followed by prompt-guided selection (where a final mask is obtained using the prompt). Define an image:

In[21]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/7d03fe1a-cc0d-4c0f-bbbf-2fcea1972f68"]

The initial all-instance segmentation follows the pipeline of the model "YOLO V8 Segment Trained on MS-COCO Data." Obtain all the segmentation masks:

In[22]:=
allMasks = netevaluate[NetModel["FastSAM Trained on MS-COCO Data"], img];

Show all the obtained masks on top of the image:

In[23]:=
HighlightImage[img, Thread[Range[Length[allMasks]] -> Map[{Opacity[0.6], #} &, Image /@ allMasks]], ImageLabels -> None]
Out[23]=

For the case of a point guidance, the final mask is the union of all the masks that contain the point. Show a point hint relative to the image and all of the masks:

In[24]:=
pointHint = Point[{550, 250}];
GraphicsGrid[
 Partition[
  HighlightImage[#, pointHint] & /@ Prepend[Image /@ allMasks, img], 5, 5, {1, 1}, ConstantImage[1, ImageDimensions[img]]],
 ImageSize -> 1000
 ]
Out[8]=

Check which masks contain the point:

In[25]:=
{x, y} = getcoords[pointHint, ImageDimensions[img]];
maskIds = Position[allMasks[[All, y, x]], 1]
Out[26]=

Take the union of the masks and show the final result:

In[27]:=
finalMask = Unitize@Total@Extract[allMasks, maskIds];
Image[finalMask]
Out[28]=
In[29]:=
HighlightImage[img, {pointHint, finalMask}]
Out[29]=

For the case of a box guidance, the final mask is the one with maximal intersection over union (IOU) with the box. Show a box hint relative to the image and all masks:

In[30]:=
boxHint = Rectangle[{400, 200}, {580, 300}];
GraphicsGrid[
 Partition[
  HighlightImage[#, boxHint] & /@ Prepend[Image /@ allMasks, img], 5],
 ImageSize -> 1000
 ]
Out[31]=

Obtain the measure of the intersections between the masks and the box:

In[32]:=
{x1, y1, x2, y2} = getcoords[boxHint, ImageDimensions[img]];
intersection = Total[allMasks[[All, y1 ;; y2, x1 ;; x2]], {2, 3}]
Out[33]=

Obtain the measure of the unions between the masks and the box:

In[34]:=
boxArea = Area[boxHint];
masksArea = Total[allMasks, {2, 3}];
union = masksArea + boxArea - intersection
Out[35]=

Compute the IOU and select the mask with maximal value:

In[36]:=
iou = N[intersection/union]
Out[36]=
In[37]:=
maskId = PositionLargest[iou]
Out[37]=

Show the final result:

In[38]:=
finalMask = Extract[allMasks, maskId];
Image[finalMask]
Out[15]=
In[39]:=
HighlightImage[img, {boxHint, finalMask}]
Out[39]=

For the case of a text guidance, the segmented parts of the image and the text are fed to multi-domain CLIP models, obtaining feature vectors that can be compared. The selected mask is the one closest to the text in feature space. Define a text hint and segment the image in its parts:

In[40]:=
textHint = "tree";
In[41]:=
segmentedImgs = ImageAdd[img, Image[#]] & /@ (1 - allMasks)
Out[41]=

Obtain the multi-domain features:

In[42]:=
textFeaures = NetModel[{"CLIP Multi-domain Feature Extractor", "InputDomain" -> "Text", "Architecture" -> "ViT-B/32"}][textHint];
Dimensions[textFeaures]
Out[20]=
In[43]:=
imgFeatures = NetModel[{"CLIP Multi-domain Feature Extractor", "InputDomain" -> "Image", "Architecture" -> "ViT-B/32"}][
   segmentedImgs];
Dimensions[imgFeatures]
Out[44]=

Compute the cosine distances between the features and select the mask with maximal value:

In[45]:=
distances = CosineDistance[#, textFeaures] & /@ imgFeatures
Out[45]=
In[46]:=
maskId = PositionSmallest[distances]
Out[46]=

Show the final result:

In[47]:=
finalMask = Extract[allMasks, maskId];
Image[finalMask]
Out[11]=
In[48]:=
HighlightImage[img, finalMask]
Out[48]=

Net information

Inspect the number of parameters of all arrays in the net:

In[49]:=
Information[
 NetModel["FastSAM Trained on MS-COCO Data"], "ArraysElementCounts"]
Out[36]=

Obtain the total number of parameters:

In[50]:=
Information[
 NetModel[
  "FastSAM Trained on MS-COCO Data"], "ArraysTotalElementCount"]
Out[37]=

Obtain the layer type counts:

In[51]:=
Information[
 NetModel["FastSAM Trained on MS-COCO Data"], "LayerTypeCounts"]
Out[38]=

Display the summary graphic:

In[52]:=
Information[
 NetModel["FastSAM Trained on MS-COCO Data"], "SummaryGraphic"]
Out[15]=

Export to ONNX

Export the net to the ONNX format:

In[53]:=
onnxFile = Export[FileNameJoin[{$TemporaryDirectory, "net.onnx"}], NetModel["FastSAM Trained on MS-COCO Data"]]
Out[39]=

Get the size of the ONNX file:

In[54]:=
FileByteCount[onnxFile]
Out[54]=

The size is similar to the byte count of the resource object:

In[55]:=
NetModel["FastSAM Trained on MS-COCO Data", "ByteCount"]
Out[41]=

Check some metadata of the ONNX model:

In[56]:=
{OpsetVersion, IRVersion} = {Import[onnxFile, "OperatorSetVersion"], Import[onnxFile, "IRVersion"]}
Out[56]=

Import the model back into Wolfram Language. However, the NetEncoder and NetDecoder will be absent because they are not supported by ONNX:

In[57]:=
Import[onnxFile]
Out[57]=

Resource History

Reference