Wolfram Research

CenterNet Pose Estimation Nets Trained on MS-COCO Data

Detect and localize human joints and objects in an image

Released in 2019, this family of models estimates the locations of human joints in an image. In a similar manner to the CenterNet object detection models, these models generate heat maps for each human joint class, and the heat maps are then corrected by the regressed joint offsets. In order to group the predicted keypoints by different human instances, the models also detect entire human bodies separately and parametrize each keypoint by a displacement from the center of the body. Finally, the keypoints regressed from the object centers are aligned with the closest keypoints extracted from the human pose heat maps. Note that CenterNet MobileNetV2 detects only the human instances while ResNet models detect all 80 classes in the MS-COCO dataset.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["CenterNet Pose Estimation Nets Trained on MS-COCO Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["CenterNet Pose Estimation Nets Trained on MS-COCO Data", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"CenterNet Pose Estimation Nets Trained on MS-COCO Data", "Architecture" -> "ResNetV1-50"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"CenterNet Pose Estimation Nets Trained on MS-COCO Data", "Architecture" -> "ResNetV1-50"}, "UninitializedEvaluationNet"]
Out[4]=

Evaluation function

Define the label list for this model:

In[5]:=
labels = {
Sequence[
   "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"]};

Define helper utilities for netevaluate:

In[6]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/96ac21ef-3eb3-460b-8ae9-31fa3977a291"]

Write an evaluation function to estimate the locations of the objects and human keypoints:

In[7]:=
Options[netevaluate] = Join[Options[alignKeypoints], Options[decode]];
netevaluate[net_, img_Image, opts : OptionsPattern[]] := Block[
  {scale, predictions},
  predictions = decode[net, img, Sequence @@ DeleteCases[{opts}, "NeighborhoodRadius" | "FilterOutsideDetections" -> _]];
  scale = Max@N[ImageDimensions[img]/
      Reverse@Rest@NetExtract[net, NetPort["PosePeaks"]]]; alignKeypoints[predictions,
   "FilterOutsideDetections" -> OptionValue["FilterOutsideDetections"],
   "NeighborhoodRadius" -> OptionValue["NeighborhoodRadius"]*scale
   ]
  ]

Basic usage

Obtain the detected bounding boxes with their corresponding classes and confidences as well as the locations of human joints for a given image:

In[8]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/1ac8b227-06bb-4acf-bf64-123358367291"]
In[9]:=
predictions = netevaluate[
   NetModel["CenterNet Pose Estimation Nets Trained on MS-COCO Data"],
    testImage];

Inspect the prediction keys:

In[10]:=
Keys[predictions]
Out[10]=

The "ObjectDetection" key contains the coordinates of the detected objects as well as its confidences and classes:

In[11]:=
predictions["ObjectDetection"]
Out[11]=

Inspect which classes are detected:

In[12]:=
classes = DeleteDuplicates@Flatten@predictions[["ObjectDetection", All, 2]]
Out[12]=

The "KeypointEstimation" key contains the locations of top predicted keypoints as well as their confidences for each person:

In[13]:=
Dimensions@predictions["KeypointEstimation"]
Out[13]=

Inspect the predicted keypoint locations:

In[14]:=
keypoints = predictions["KeypointEstimation"];

Visualize the keypoints:

In[15]:=
HighlightImage[testImage, DeleteMissing[keypoints, 2]]
Out[15]=

Visualize the keypoints grouped by person:

In[16]:=
HighlightImage[testImage, AssociationThread[Range[Length[#]] -> #] &@
  DeleteMissing[keypoints, 2], ImageLabels -> None]
Out[16]=

Visualize the keypoints grouped by a keypoint type:

In[17]:=
HighlightImage[testImage, AssociationThread[Range[Length[#]] -> #] &@
  DeleteMissing[Transpose[keypoints], 2], ImageLabels -> None]
Out[17]=

Define a function to combine the keypoints into a skeleton shape:

In[18]:=
{{1, 2}, {1, 3}, {2, 4}, {3, 5}, {1, 6}, {1, 7}, {6, 8}, {8, 10}, {7, 9}, {9, 11}, {6, 7}, {6, 12}, {7, 13}, {12, 13}, {12, 14}, {14, 16}, {13, 15}, {15, 17}};
getSkeleton[personKeypoints_] := Line[DeleteMissing[
   Map[personKeypoints[[#]] &, {{1, 2}, {1, 3}, {2, 4}, {3, 5}, {1, 6}, {1, 7}, {6, 8}, {8, 10}, {7, 9}, {9, 11}, {6, 7}, {6, 12}, {7,
     13}, {12, 13}, {12, 14}, {14, 16}, {13, 15}, {15, 17}}], 1, 2]]

Visualize the pose keypoints, object detections and human skeletons:

In[19]:=
HighlightImage[testImage,
 Append[
  AssociationThread[Range[Length[#]] -> #] & /@ {keypoints, Map[getSkeleton, keypoints]},
  GroupBy[predictions["ObjectDetection"][[All, ;; 2]], Last -> First]
  ],
 ImageLabels -> None
 ]
Out[19]=

Advanced visualization

In[20]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/2e67ed23-827d-4440-a269-cb17a44084bb"]

Obtain the detected bounding boxes with their corresponding classes and confidences as well as the locations of human joints for a given image:

In[21]:=
predictions = netevaluate[
   NetModel["CenterNet Pose Estimation Nets Trained on MS-COCO Data"],
    testImage2, "ObjectDetectionThreshold" -> 0.28, "HumanPoseThreshold" -> 0.2, "FilterOutsideDetections" -> True];
keypoints = predictions["KeypointEstimation"];

Visualize the pose keypoints, object detections and human skeletons. Note that some of the keypoints are misaligned:

In[22]:=
HighlightImage[testImage2,
 AssociationThread[Range[Length[#]] -> #] & /@ {keypoints, Map[getSkeleton, keypoints], predictions[["ObjectDetection", All, 1]]}
 , ImageLabels -> None]
Out[22]=

Inspect the various effects of a radius defined by an optional parameter "NeighborhoodRadius":

In[23]:=
Grid@List@Table[
   predictions = netevaluate[
     NetModel[
      "CenterNet Pose Estimation Nets Trained on MS-COCO Data"], testImage2, "ObjectDetectionThreshold" -> 0.28, "HumanPoseThreshold" -> 0.2, "NeighborhoodRadius" -> r,
     "FilterOutsideDetections" -> False
     ];
   HighlightImage[testImage2,
    AssociationThread[Range[Length[#]] -> #] & /@ {predictions[
       "KeypointEstimation"], Map[getSkeleton, predictions["KeypointEstimation"]], predictions[["ObjectDetection", All, 1]]}
    , ImageLabels -> None],
    {r, {1, 5, 8}}
    ]
Out[23]=

Network object detection result

For the default input size of 512x512, the net produces 128x128 bounding boxes whose centers mostly follow a square grid. For each bounding box, the net produces the box size and the offset of the box center with respect to the square grid:

In[24]:=
res = NetModel[
    "CenterNet Pose Estimation Nets Trained on MS-COCO Data"][
   testImage];
In[25]:=
Map[Dimensions, res]
Out[25]=

Change the coordinate system into a graphics domain:

In[26]:=
res = Map[Transpose[#, {3, 2, 1}] &, res];
res["BoxOffsets"] = Map[Reverse, res["BoxOffsets"], {2}];
res["BoxSizes"] = Map[Reverse, res["BoxSizes"], {2}];

Compute and visualize the box center positions:

In[27]:=
grid = Table[{i, j}, {i, 128}, {j, 128}];
centers = grid + res["BoxOffsets"];

Visualize the box center positions. They follow a square grid with offsets:

In[28]:=
Graphics@{PointSize[0.005], Point@Flatten[centers, 1]}
Out[28]=

Compute the boxes' coordinates:

In[29]:=
boxes = Transpose@
   Flatten[{centers - 0.5*res["BoxSizes"], centers + 0.5*res["BoxSizes"]}, {{1, 4}, {2, 3}}];
In[30]:=
Dimensions[boxes]
Out[30]=

Define a function to rescale the box coordinates to the original image size:

In[31]:=
boxDecoder[{a_, b_, c_, d_}, {w_, h_}, scale_] := Rectangle[{a*scale, h - b*scale}, {c*scale, h - d*scale}];

Visualize all the boxes predicted by the net scaled by their "objectness" measures:

In[32]:=
Graphics[
 MapThread[{EdgeForm[Opacity[Total[#1]*0.5]], #2} &, {Flatten@
    Map[Max, res["ObjectHeatmaps"], {2}], Map[boxDecoder[#, {128, 128}, 1] &, boxes]}],
 BaseStyle -> {FaceForm[], EdgeForm[{Thin, Black}]}
 ]
Out[32]=

Visualize all the boxes scaled by the probability that they contain a dog:

In[33]:=
idx = Position[labels, "dog"][[1, 1]]
Out[33]=
In[34]:=
Graphics[
 MapThread[{EdgeForm[Opacity[#1]], Rectangle @@ #2} &, {Flatten[
    res["ObjectHeatmaps"][[All, All, idx]]], Map[boxDecoder[#, {128, 128}, 1] &, boxes]}],
 BaseStyle -> {FaceForm[], EdgeForm[{Thin, Black}]}
 ]
Out[34]=

Superimpose the cat prediction on top of the scaled input received by the net:

In[35]:=
HighlightImage[testImage, Graphics[
  MapThread[{EdgeForm[{Opacity[#1]}], #2} &, {Flatten[
     res["ObjectHeatmaps"][[All, All, idx]]], Map[boxDecoder[#, ImageDimensions[testImage], Max@N[ImageDimensions[testImage]/{128, 128}]] &, boxes]}]], BaseStyle -> {FaceForm[], EdgeForm[{Thin, Red}]}]
Out[35]=

Heat map visualization

In[36]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/cb65e6fb-7b1b-49ea-af60-8df89bc2ec56"]

Every box is associated to a scalar strength value indicating the likelihood that the patch contains an object:

In[37]:=
objectHeatmaps = NetModel["CenterNet Pose Estimation Nets Trained on MS-COCO Data"][
    testImage]["ObjectHeatmaps"];
In[38]:=
Dimensions[objectHeatmaps]
Out[38]=

The strength of each patch is the maximal element aggregated across all classes. Obtain the strength of each patch:

In[39]:=
strengthArray = Map[Max, Transpose[objectHeatmaps, {3, 1, 2}], {2}];
Dimensions[strengthArray]
Out[40]=

Visualize the strength of each patch as a heat map:

In[41]:=
heatmap = ImageApply[{#, 1 - #, 1 - #} &, ImageAdjust@Image[strengthArray]]
Out[41]=

Stretch and unpad the heat map to the original image domain:

In[42]:=
heatmap = ImageTake[
    ImageResize[heatmap, {Max[#]}], {1, Last[#]}, {1, First[#]}] &@
  ImageDimensions[testImage]
Out[42]=

Overlay the heat map on the image:

In[43]:=
ImageCompose[testImage, {ColorConvert[heatmap, "RGB"], 0.4}]
Out[43]=

Obtain and visualize the strength of each patch for the "dog" class:

In[44]:=
idx = Position[labels, "dog"][[1, 1]];
strengthArray = objectHeatmaps[[idx]];
heatmap = ImageApply[{#, 1 - #, 1 - #} &, ImageAdjust@Image[strengthArray]];
heatmap = ImageTake[
    ImageResize[heatmap, {Max[#]}], {1, Last[#]}, {1, First[#]}] &@
  ImageDimensions[testImage]
Out[47]=

Overlay the heat map on the image:

In[48]:=
ImageCompose[testImage, {ColorConvert[heatmap, "RGB"], 0.4}]
Out[48]=

Define a general function to visualize a heat map on an image:

In[49]:=
visualizeHeatmap[img_Image, heatmap_] := Block[{strengthArray, w, h},
   {w, h} = ImageDimensions[img];
   strengthArray = Map[Max, Transpose[heatmap, {3, 1, 2}], {2}];
   strengthArray = ImageApply[{#, 1 - #, 1 - #} &, ImageAdjust@Image[strengthArray]];
   strengthArray = ImageTake[ImageResize[strengthArray, {Max[w, h]}], {1, h}, {1, w}];
   ImageCompose[img, {ColorConvert[strengthArray, "RGB"], 0.4}]
   ];
In[50]:=
Map[visualizeHeatmap[testImage, #] &, NetModel["CenterNet Pose Estimation Nets Trained on MS-COCO Data"][
  testImage, {"ObjectHeatmaps", "PoseHeatmaps"}]]
Out[50]=

Adapt to any size

Automatic image resizing can be avoided by replacing the NetEncoder. First get the NetEncoder:

In[51]:=
encoder = NetExtract[
  NetModel["CenterNet Pose Estimation Nets Trained on MS-COCO Data"], "Input"]
Out[51]=

Note that the NetEncoder resizes the image by keeping the aspect ratio and then pads the result to have a fixed shape of 512x512. Visualize the output of NetEncoder adjusting for brightness:

In[52]:=
Show[ImageAdjust@Image[encoder[testImage], Interleaving -> False], ImageSize -> 280]
Out[52]=

Create a new NetEncoder with the desired dimensions:

In[53]:=
newEncoder = NetEncoder[{"Image", {320, 320}, Method -> "Fit", Alignment -> {Left, Top}, "MeanImage" -> NetExtract[encoder, "MeanImage"], "VarianceImage" -> NetExtract[encoder, "VarianceImage"]}]
Out[53]=

Attach the new NetEncoder:

In[54]:=
resizedNet = NetReplacePart[
  NetModel["CenterNet Pose Estimation Nets Trained on MS-COCO Data"], "Input" -> newEncoder]
Out[54]=

Obtain the detected bounding boxes with their corresponding classes and confidences for a given image:

In[55]:=
detection = netevaluate[resizedNet, testImage];

Visualize the detection:

In[56]:=
HighlightImage[testImage, GroupBy[detection[["ObjectDetection", All, ;; 2]], Last -> First]]
Out[56]=

Note that even though the localization results and the box confidences are slightly worse compared to the original net, the resized network runs significantly faster:

In[57]:=
netevaluate[
   NetModel["CenterNet Pose Estimation Nets Trained on MS-COCO Data"],
    testImage]; // AbsoluteTiming
Out[57]=
In[58]:=
netevaluate[resizedNet, testImage]; // AbsoluteTiming
Out[58]=

Net information

Inspect the number of parameters of all arrays in the net:

In[59]:=
Information[
 NetModel[
  "CenterNet Pose Estimation Nets Trained on MS-COCO Data"], "ArraysElementCounts"]
Out[59]=

Obtain the total number of parameters:

In[60]:=
Information[
 NetModel[
  "CenterNet Pose Estimation Nets Trained on MS-COCO Data"], "ArraysTotalElementCount"]
Out[60]=

Obtain the layer type counts:

In[61]:=
Information[
 NetModel[
  "CenterNet Pose Estimation Nets Trained on MS-COCO Data"], "LayerTypeCounts"]
Out[61]=

Display the summary graphic:

In[62]:=
Information[
 NetModel[
  "CenterNet Pose Estimation Nets Trained on MS-COCO Data"], "SummaryGraphic"]
Out[62]=

Resource History

Reference