Resource retrieval
Get the pre-trained net:
NetModel parameters
This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:
Pick a non-default net by specifying the parameters:
Pick a non-default uninitialized net:
Evaluation function
Write an evaluation function:
Basic usage
Define a test image:
Obtain a segmentation mask for the desired object using a textual prompt, a bounding box and a single point:
All the obtained masks are binary and have the dimensions of the input image:
Visualize the mask obtained via the text prompt:
Visualize the box hint and mask obtained from it:
Visualize the point hint and mask obtained from it:
If no hint is specified, binary masks for all the identified objects will be returned:
Visualize the masks:
Prompt-guided selection
When hints are used, the process is split in two phases: all-instance segmentation (where the entire image is segmented into its components) followed by prompt-guided selection (where a final mask is obtained using the prompt). Define an image:
The initial all-instance segmentation follows the pipeline of the model "YOLO V8 Segment Trained on MS-COCO Data." Obtain all the segmentation masks:
Show all the obtained masks on top of the image:
For the case of a point guidance, the final mask is the union of all the masks that contain the point. Show a point hint relative to the image and all of the masks:
Check which masks contain the point:
Take the union of the masks and show the final result:
For the case of a box guidance, the final mask is the one with maximal intersection over union (IOU) with the box. Show a box hint relative to the image and all masks:
Obtain the measure of the intersections between the masks and the box:
Obtain the measure of the unions between the masks and the box:
Compute the IOU and select the mask with maximal value:
Show the final result:
For the case of a text guidance, the segmented parts of the image and the text are fed to multi-domain CLIP models, obtaining feature vectors that can be compared. The selected mask is the one closest to the text in feature space. Define a text hint and segment the image in its parts:
Obtain the multi-domain features:
Compute the cosine distances between the features and select the mask with maximal value:
Show the final result:
Net information
Inspect the number of parameters of all arrays in the net:
Obtain the total number of parameters:
Obtain the layer type counts:
Display the summary graphic:
Export to ONNX
Export the net to the ONNX format:
Get the size of the ONNX file:
The size is similar to the byte count of the resource object:
Check some metadata of the ONNX model:
Import the model back into Wolfram Language. However, the NetEncoder and NetDecoder will be absent because they are not supported by ONNX: