CenterNet Pose Estimation Nets Trained on MS-COCO Data

Detect and localize human joints and objects in an image

Released in 2019, this family of models estimates the locations of human joints in an image. In a similar manner to the CenterNet object detection models, these models generate heat maps for each human joint class, and the heat maps are then corrected by the regressed joint offsets. In order to group the predicted keypoints by different human instances, the models also detect entire human bodies separately and parametrize each keypoint by a displacement from the center of the body. Finally, the keypoints regressed from the object centers are aligned with the closest keypoints extracted from the human pose heat maps. Note that CenterNet MobileNetV2 detects only the human instances while ResNet models detect all 80 classes in the MS-COCO dataset.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["CenterNet Pose Estimation Nets Trained on MS-COCO Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["CenterNet Pose Estimation Nets Trained on MS-COCO Data", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"CenterNet Pose Estimation Nets Trained on MS-COCO Data", "Architecture" -> "ResNetV1-50"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"CenterNet Pose Estimation Nets Trained on MS-COCO Data", "Architecture" -> "ResNetV1-50"}, "UninitializedEvaluationNet"]
Out[4]=

Evaluation function

Define the label list for this model:

In[5]:=
labels = {
Sequence[
   "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"]};

Define helper utilities for netevaluate:

In[6]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/96ac21ef-3eb3-460b-8ae9-31fa3977a291"]

Write an evaluation function to estimate the locations of the objects and human keypoints:

In[7]:=
Options[netevaluate] = Join[Options[alignKeypoints], Options[decode]];
netevaluate[net_, img_Image, opts : OptionsPattern[]] := Block[
  {scale, predictions},
  predictions = decode[net, img, Sequence @@ DeleteCases[{opts}, "NeighborhoodRadius" | "FilterOutsideDetections" -> _]];
  scale = Max@N[ImageDimensions[img]/
      Reverse@Rest@NetExtract[net, NetPort["PosePeaks"]]]; alignKeypoints[predictions,
   "FilterOutsideDetections" -> OptionValue["FilterOutsideDetections"],
   "NeighborhoodRadius" -> OptionValue["NeighborhoodRadius"]*scale
   ]
  ]

Basic usage

Obtain the detected bounding boxes with their corresponding classes and confidences as well as the locations of human joints for a given image:

In[8]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/1ac8b227-06bb-4acf-bf64-123358367291"]
In[9]:=
predictions = netevaluate[
   NetModel["CenterNet Pose Estimation Nets Trained on MS-COCO Data"],
    testImage];

Inspect the prediction keys:

In[10]:=
Keys[predictions]
Out[10]=

The "ObjectDetection" key contains the coordinates of the detected objects as well as its confidences and classes:

In[11]:=
predictions["ObjectDetection"]
Out[11]=

Inspect which classes are detected:

In[12]:=
classes = DeleteDuplicates@Flatten@predictions[["ObjectDetection", All, 2]]
Out[12]=

The "KeypointEstimation" key contains the locations of top predicted keypoints as well as their confidences for each person:

In[13]:=
Dimensions@predictions["KeypointEstimation"]
Out[13]=

Inspect the predicted keypoint locations:

In[14]:=
keypoints = predictions["KeypointEstimation"];

Visualize the keypoints:

In[15]:=
HighlightImage[testImage, DeleteMissing[keypoints, 2]]
Out[15]=

Visualize the keypoints grouped by person:

In[16]:=
HighlightImage[testImage, AssociationThread[Range[Length[#]] -> #] &@
  DeleteMissing[keypoints, 2], ImageLabels -> None]
Out[16]=

Visualize the keypoints grouped by a keypoint type:

In[17]:=
HighlightImage[testImage, AssociationThread[Range[Length[#]] -> #] &@
  DeleteMissing[Transpose[keypoints], 2], ImageLabels -> None]
Out[17]=

Define a function to combine the keypoints into a skeleton shape:

In[18]:=
{{1, 2}, {1, 3}, {2, 4}, {3, 5}, {1, 6}, {1, 7}, {6, 8}, {8, 10}, {7, 9}, {9, 11}, {6, 7}, {6, 12}, {7, 13}, {12, 13}, {12, 14}, {14, 16}, {13, 15}, {15, 17}};
getSkeleton[personKeypoints_] := Line[DeleteMissing[
   Map[personKeypoints[[#]] &, {{1, 2}, {1, 3}, {2, 4}, {3, 5}, {1, 6}, {1, 7}, {6, 8}, {8, 10}, {7, 9}, {9, 11}, {6, 7}, {6, 12}, {7,
     13}, {12, 13}, {12, 14}, {14, 16}, {13, 15}, {15, 17}}], 1, 2]]

Visualize the pose keypoints, object detections and human skeletons:

In[19]:=
HighlightImage[testImage,
 Append[
  AssociationThread[Range[Length[#]] -> #] & /@ {keypoints, Map[getSkeleton, keypoints]},
  GroupBy[predictions["ObjectDetection"][[All, ;; 2]], Last -> First]
  ],
 ImageLabels -> None
 ]
Out[19]=

Resource History

Reference