# Wolfram Neural Net Repository

Immediate Computable Access to Neural Net Models

Detect and localize human joints and objects in an image

Released in 2019, this family of models estimates the locations of human joints in an image. In a similar manner to the CenterNet object detection models, these models generate heat maps for each human joint class, and the heat maps are then corrected by the regressed joint offsets. In order to group the predicted keypoints by different human instances, the models also detect entire human bodies separately and parametrize each keypoint by a displacement from the center of the body. Finally, the keypoints regressed from the object centers are aligned with the closest keypoints extracted from the human pose heat maps. Note that CenterNet MobileNetV2 detects only the human instances while ResNet models detect all 80 classes in the MS-COCO dataset.

- Microsoft COCO, a dataset for image recognition, segmentation and captioning, consisting of more than three hundred thousand images in 80 different object classes.

Get the pre-trained net:

In[1]:= |

Out[1]= |

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:= |

Out[2]= |

Pick a non-default net by specifying the parameters:

In[3]:= |

Out[3]= |

Pick a non-default uninitialized net:

In[4]:= |

Out[4]= |

Define the label list for this model:

In[5]:= |

Define helper utilities for netevaluate:

In[6]:= |

Write an evaluation function to estimate the locations of the objects and human keypoints:

In[7]:= |

Obtain the detected bounding boxes with their corresponding classes and confidences as well as the locations of human joints for a given image:

In[8]:= |

In[9]:= |

Inspect the prediction keys:

In[10]:= |

Out[10]= |

The "ObjectDetection" key contains the coordinates of the detected objects as well as its confidences and classes:

In[11]:= |

Out[11]= |

Inspect which classes are detected:

In[12]:= |

Out[12]= |

The "KeypointEstimation" key contains the locations of top predicted keypoints as well as their confidences for each person:

In[13]:= |

Out[13]= |

Inspect the predicted keypoint locations:

In[14]:= |

Visualize the keypoints:

In[15]:= |

Out[15]= |

Visualize the keypoints grouped by person:

In[16]:= |

Out[16]= |

Visualize the keypoints grouped by a keypoint type:

In[17]:= |

Out[17]= |

Define a function to combine the keypoints into a skeleton shape:

In[18]:= |

Visualize the pose keypoints, object detections and human skeletons:

In[19]:= |

Out[19]= |

In[20]:= |

Obtain the detected bounding boxes with their corresponding classes and confidences as well as the locations of human joints for a given image:

In[21]:= |

Visualize the pose keypoints, object detections and human skeletons. Note that some of the keypoints are misaligned:

In[22]:= |

Out[22]= |

Inspect the various effects of a radius defined by an optional parameter "NeighborhoodRadius":

In[23]:= |

Out[23]= |

For the default input size of 512x512, the net produces 128x128 bounding boxes whose centers mostly follow a square grid. For each bounding box, the net produces the box size and the offset of the box center with respect to the square grid:

In[24]:= |

In[25]:= |

Out[25]= |

Change the coordinate system into a graphics domain:

In[26]:= |

Compute and visualize the box center positions:

In[27]:= |

Visualize the box center positions. They follow a square grid with offsets:

In[28]:= |

Out[28]= |

Compute the boxes' coordinates:

In[29]:= |

In[30]:= |

Out[30]= |

Define a function to rescale the box coordinates to the original image size:

In[31]:= |

Visualize all the boxes predicted by the net scaled by their "objectness" measures:

In[32]:= |

Out[32]= |

Visualize all the boxes scaled by the probability that they contain a dog:

In[33]:= |

Out[33]= |

In[34]:= |

Out[34]= |

Superimpose the cat prediction on top of the scaled input received by the net:

In[35]:= |

Out[35]= |

In[36]:= |

Every box is associated to a scalar strength value indicating the likelihood that the patch contains an object:

In[37]:= |

In[38]:= |

Out[38]= |

The strength of each patch is the maximal element aggregated across all classes. Obtain the strength of each patch:

In[39]:= |

Out[40]= |

Visualize the strength of each patch as a heat map:

In[41]:= |

Out[41]= |

Stretch and unpad the heat map to the original image domain:

In[42]:= |

Out[42]= |

Overlay the heat map on the image:

In[43]:= |

Out[43]= |

Obtain and visualize the strength of each patch for the "dog" class:

In[44]:= |

Out[47]= |

Overlay the heat map on the image:

In[48]:= |

Out[48]= |

Define a general function to visualize a heat map on an image:

In[49]:= |

In[50]:= |

Out[50]= |

Automatic image resizing can be avoided by replacing the NetEncoder. First get the NetEncoder:

In[51]:= |

Out[51]= |

Note that the NetEncoder resizes the image by keeping the aspect ratio and then pads the result to have a fixed shape of 512x512. Visualize the output of NetEncoder adjusting for brightness:

In[52]:= |

Out[52]= |

Create a new NetEncoder with the desired dimensions:

In[53]:= |

Out[53]= |

Attach the new NetEncoder:

In[54]:= |

Out[54]= |

Obtain the detected bounding boxes with their corresponding classes and confidences for a given image:

In[55]:= |

Visualize the detection:

In[56]:= |

Out[56]= |

Note that even though the localization results and the box confidences are slightly worse compared to the original net, the resized network runs significantly faster:

In[57]:= |

Out[57]= |

In[58]:= |

Out[58]= |

Inspect the number of parameters of all arrays in the net:

In[59]:= |

Out[59]= |

Obtain the total number of parameters:

In[60]:= |

Out[60]= |

Obtain the layer type counts:

In[61]:= |

Out[61]= |

Display the summary graphic:

In[62]:= |

Out[62]= |

- X. Zhou, D. Wang, P. Krähenbühl, "Objects as Points," arXiv:1904.07850 (2019)
- Available from: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md
- Rights: Copyright 2022 Google LLC. All rights reserved. Apache License 2.0