CenterNet Trained on MS-COCO Data

Detect and localize objects in an image

Released in 2019, this family of object detection models detects objects by their central point instead of directly computing axis-aligned boxes. The models exploit keypoint estimation techniques to find center point locations from generated heat maps and then regress the box sizes and offsets. The center locations are predicted per class, while box sizes and offsets are class agnostic. Compared to the anchor-based approaches, CenterNet does not suffer from the extremely large amounts of box candidates that require complicated labeling methods as well as additional post-processing like non-maximum suppression. Also, the detector does not make any implicit assumptions on the objects scales and aspect ratios, contrary to the other popular approaches that encode them in the anchors.

Training Set Information

Microsoft COCO, a dataset for image recognition, segmentation and captioning, consisting of more than three hundred thousand images in 80 different object classes.

Model Information

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=

Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=

Out[3]=

Pick a non-default uninitialized net:

In[4]:=

Out[4]=

Evaluation function

Define the label list for this model:

In[5]:=

labels = {"person", "bicycle", "car", "motorcycle", "airplane", "bus",
"train", "truck", "boat", "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"};

Write an evaluation function to scale the result to the input image size and suppress the least probable detections:

In[6]:=

$Options[netevaluate] = {MaxFeatures -> 100, AcceptanceThreshold -> 0.1^6}; netevaluate[net_, img_Image, detectionThreshold_ : 0.5, opts : OptionsPattern[]] := Module[{ netOut, w, h, fc, fh, fw, flatHeatMapCenter, flatProbsCenter, suppresedPeaksCenter, newProbsCenter, flatPosCenter, posCenter, highScoredCenter, detectionsPerClass, filteredClassIdx, uniqueClasses, classes, scale, pHpW, offsetHw, boxCoordinates, tuples }, netOut = net[img]; {w, h} = ImageDimensions[img]; {fc, fh, fw} = Dimensions[netOut["Peaks"]]; (*Extract Peaks*) {flatHeatMapCenter, flatProbsCenter} = Map[Transpose@Flatten[Transpose[#, {3, 1, 2}], {1, 2}] &, {netOut[ "Peaks"], netOut["ClassProb"]}]; suppresedPeaksCenter = UnitStep[OptionValue[AcceptanceThreshold] - flatHeatMapCenter]; newProbsCenter = flatProbsCenter*suppresedPeaksCenter; (*Find positions of (up to) MaxFeatures strongest keypoints*) flatPosCenter = Map[Ordering[#, -Min[OptionValue[MaxFeatures], fw*fh]] &, newProbsCenter]; posCenter = QuotientRemainder[flatPosCenter - 1, fw] + 1; newProbsCenter = MapThread[#1[[#2]] &, {newProbsCenter, flatPosCenter}]; (*Filter low-scored detections*) highScoredCenter = UnitStep[newProbsCenter - detectionThreshold]; If[Total[highScoredCenter] == 0, Return[]]; {posCenter, newProbsCenter} = Map[Pick[#, highScoredCenter, 1] &, {posCenter, newProbsCenter}]; detectionsPerClass = Map[Total, highScoredCenter]; filteredClassIdx = UnitStep[detectionsPerClass - 1]; {detectionsPerClass, uniqueClasses} = Map[Pick[#, filteredClassIdx, 1] &, {detectionsPerClass, labels}]; classes = Flatten[MapThread[ ConstantArray[#1, #2] &, {uniqueClasses, detectionsPerClass}], 1]; {posCenter, newProbsCenter} = Map[Flatten[Pick[#, filteredClassIdx, 1], 1] &, {posCenter, newProbsCenter}]; (*From array positions to image keypoint positions*) scale = Max@N[{w, h}/{fw, fh}]; {pHpW, offsetHw} = Map[Extract[#, Prepend[All] /@ posCenter] &, {netOut[["BoxSizes"]], netOut[["BoxOffsets"]]}]; boxCoordinates = Transpose[{Reverse[posCenter - 0.5*pHpW + offsetHw, 2], Reverse[posCenter + 0.5*pHpW + offsetHw, 2]}]; boxCoordinates = Rectangle @@@ MapAt[h - # &, boxCoordinates*scale, {All, All, 2}]; (*Extract the features and strengths*) tuples = MaximalBy[Transpose[{boxCoordinates, classes, newProbsCenter}], Last, UpTo[OptionValue@MaxFeatures]]; tuples ]$

Basic usage

Obtain the detected bounding boxes with their corresponding classes and confidences for a given image:

In[7]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/ea76869c-40be-4a4f-a0f8-693e9473d199"]

In[8]:=

Out[8]=

Inspect which classes are detected:

In[9]:=

Out[9]=

Visualize the detection:

In[10]:=

Out[10]=

Network result

For the default input size of 512x512, the net produces 128x128 bounding boxes whose centers mostly follow a square grid. For each bounding box, the net produces the box’s size and the offset of the box’s center with respect to the square grid:

In[11]:=

In[12]:=

Out[12]=

Change coordinate system into a graphics domain:

In[13]:=

res = Map[Transpose[#, {3, 2, 1}] &, res];
res["BoxOffsets"] = Map[Reverse, res["BoxOffsets"], {2}];
res["BoxSizes"] = Map[Reverse, res["BoxSizes"], {2}];

Compute and visualize the box center positions:

In[14]:=

Visualize the box center positions. They follow a square grid with offsets:

In[15]:=

Out[15]=

Compute the boxes coordinates:

In[16]:=

In[17]:=

Out[17]=

Define a function to rescale the box coordinates to the original image size:

In[18]:=

$boxDecoder[{a_, b_, c_, d_}, {w_, h_}, scale_] := Rectangle[{a*scale, h - b*scale}, {c*scale, h - d*scale}];$

Visualize all the boxes predicted by the net scaled by their "objectness" measures:

In[19]:=

Graphics[
MapThread[{EdgeForm[Opacity[Total[#1]*0.5]], #2} &, {Flatten@
Map[Max, res["ClassProb"], {2}], Map[boxDecoder[#, {128, 128}, 1] &, boxes]}],
BaseStyle -> {FaceForm[], EdgeForm[{Thin, Black}]}
]

Out[19]=

Visualize all the boxes scaled by the probability that they contain a cat:

In[20]:=

Out[20]=

In[21]:=

Graphics[
MapThread[{EdgeForm[Opacity[#1]], Rectangle @@ #2} &, {Flatten[
res["ClassProb"][[All, All, idx]]], Map[boxDecoder[#, {128, 128}, 1] &, boxes]}],
BaseStyle -> {FaceForm[], EdgeForm[{Thin, Black}]}
]

Out[21]=

Superimpose the cat prediction on top of the scaled input received by the net:

In[22]:=

HighlightImage[testImage, Graphics[
MapThread[{EdgeForm[{Opacity[#1]}], #2} &, {Flatten[
res["ClassProb"][[All, All, idx]]], Map[boxDecoder[#, ImageDimensions[testImage], Max@N[ImageDimensions[testImage]/{128, 128}]] &, boxes]}]], BaseStyle -> {FaceForm[], EdgeForm[{Thin, Red}]}]

Out[22]=

Heat map visualization

Every box is associated to a scalar strength value indicating the likelihood that the patch contains an object:

In[23]:=

In[24]:=

Out[24]=

The strength of each patch is the maximal element aggregated across all classes. Obtain the strength of each patch:

In[25]:=

Out[26]=

Visualize the strength of each patch as a heat map:

In[27]:=

Out[27]=

Stretch and unpad the heat map to the original image domain:

In[28]:=

Out[28]=

Overlay the heat map on the image:

In[29]:=

Out[29]=

Obtain and visualize the strength of each patch for the "cat" class:

In[30]:=

strengthArray = classProbs[[idx]];
heatmap = ImageApply[{#, 1 - #, 1 - #} &, ImageAdjust@Image[strengthArray]];
heatmap = ImageTake[
ImageResize[heatmap, {Max[#]}], {1, Last[#]}, {1, First[#]}] &@
ImageDimensions[testImage]

Out[32]=

Overlay the heat map on the image:

In[33]:=

Out[33]=

Adapt to any size

Automatic image resizing can be avoided by replacing the NetEncoder. First get the NetEncoder:

In[34]:=

Out[34]=

Note that the NetEncoder resizes the image by keeping the aspect ratio and then pads the result to have a fixed shape of 512x512. Visualize the output of NetEncoder adjusting for brightness:

In[35]:=

Out[35]=

Create a new NetEncoder with the desired dimensions:

In[36]:=

newEncoder = NetEncoder[{"Image", {320, 320}, Method -> "Fit", Alignment -> {Left, Top}, "MeanImage" -> NetExtract[encoder, "MeanImage"], "VarianceImage" -> NetExtract[encoder, "VarianceImage"]}]

Out[36]=

Attach the new NetEncoder:

In[37]:=

Out[37]=

Obtain the detected bounding boxes with their corresponding classes and confidences for a given image:

In[38]:=

Out[38]=

Visualize the detection:

In[39]:=

Out[39]=

Note that even though the localization results and the box confidences are slightly worse compared to the original net, the resized network runs significantly faster:

In[40]:=