PANet Text Detector Trained on ICDAR-2015 and CTW1500 Data

Detect and localize text in an image

The Pixel Aggregation Network (PAN) is a family of models that offers an efficient and accurate detector featuring a low-cost segmentation head and learnable postprocessing. PAN's segmentation head includes the Feature Pyramid Enhancement Module for multilevel segmentation and the Feature Fusion Module for feature refinement. The pixel aggregation method enhances precision by grouping text pixels via similarity vectors. Experiments show PAN achieves a strong 79.9% F-measure at 84.2 FPS on CTW1500.

Training Set Information

The ICDAR-2015 dataset consists of one thousand training images and five hundred testing images, which are captured by Google Glasses with a resolution of 720x1280. The text instances are labeled at the word level. The CTW1500 dataset contains 1,500 images: one thousand for training and five hundred for testing. It provides 10,751 cropped text instance images, including 3,530 with curved text.

Model Information

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific architecture. Inspect the available parameters:

In[2]:=

Out[2]=

Pick a non-default net by specifying the architecture:

In[3]:=

Out[3]=

Pick a non-default uninitialized net:

In[4]:=

Out[4]=

Evaluation function

Write an evaluation function to extract the bounding regions and masks for each text instance:

In[5]:=

$perimeter = ImageSubtract[Dilation[#, 1] - #] &; expandComponent[component_, similarity_, t_] := Module[{p, mean, dist, new}, p = PixelValuePositions[perimeter@component, 1]; mean = ImageMeasurements[similarity, "Mean", Masking -> component]; dist = DistanceMatrix[PixelValue[similarity, p], {mean}][[All, 1]]; new = Pick[p, UnitStep[t - dist], 1]; ReplacePixelValue[component, new -> 1]]$

In[6]:=

$Options[netevaluate] = { "MaskThreshold" -> 0.5, "KernelThreshold" -> 0.2, "MinTextArea" -> 16, "RegionType" -> "MinOrientedRectangle", "Output" -> "Regions" | "Masks"}; netevaluate[img_, OptionsPattern[]] := Module[ {result, kernel, embeddings, similarity, comlist, labels, scores, masks, elem, area, mean, score, i, inputImageDims, h, w, ratio, tRatio, contours, boundingReg}, result = NetModel[ "PANet Text Detector Trained on ICDAR-2015 and CTW1500 Data"][ img]; kernel = Image[UnitStep[result["Kernel"] - OptionValue["KernelThreshold"]]* UnitStep[result["TextRegion"] - OptionValue["MaskThreshold"]]]; similarity = Image[result["Similarity"], Interleaving -> False]; comlist = Image /@ Values@ComponentMeasurements[kernel, "Mask"]; labels = Map[expandComponent[#, similarity, OptionValue["MaskThreshold"]] &, comlist]; (*Filter the results by defining thresholds for the instance area and pixel value's mean*) masks = Association[]; scores = Association[]; elem = 1; For[i = 1, i <= Length[labels], i++, With[{label = labels[[i]]}, area = Values[ComponentMeasurements[label, "Area"]][[1]]; If[SameQ[area, {}], Continue[]]; If[area >= OptionValue["MinTextArea"], AppendTo[masks, elem -> label]; elem += 1, Continue[] ] ] ]; (*scale the results to match the shape of the original image*) inputImageDims = ImageDimensions[img]; {w, h} = ImageDimensions[kernel]; ratio = ImageAspectRatio[img]; tRatio = ImageAspectRatio[kernel]; masks = Map[If[ tRatio/ratio > 1, ImageResize[ImageCrop[#, {w, w*ratio}], inputImageDims], ImageResize[ImageCrop[#, {h /ratio, h}], inputImageDims] ] &, masks]; If[SameQ[OptionValue["Output"], "Masks"], Return[masks]]; (*get the texts contours*) contours = Map[Values[ ComponentMeasurements[#, "PerimeterPositions", CornerNeighbors -> True]][[1]] &, masks]; (*get the bounding region for each contour *) boundingReg = Which[ SameQ[OptionValue["RegionType"], "MinRectangle"], Map[BoundingRegion[#[[1]], "MinRectangle"] & , contours], SameQ[OptionValue["RegionType"], "MinOrientedRectangle"], Map[BoundingRegion[#[[1]], "MinOrientedRectangle"] & , contours], SameQ[OptionValue["RegionType"], "MinConvexPolygon"], Map[BoundingRegion[#[[1]], "MinConvexPolygon"] & , contours] ]; boundingReg ];$

Basic usage

Obtain the bounding boxes and masks for each text instance in a given image:

In[7]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/c554b551-f58a-4ce8-8565-7db9f95fdc81"]

In[8]:=

The output is an Association containing the detected bounding boxes with their labels:

In[9]:=

Out[9]=

Visualize the bounding regions:

In[10]:=

Out[10]=

Advanced usage

Get an image:

In[11]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/2d82ac04-570f-419d-a511-81fdee62edaf"]

Obtain the bounding regions using the default evaluation and visualize them:

In[12]:=

In[13]:=

Out[13]=

Get the individual masks via the option "Output"->"Masks":

In[14]:=

Out[14]=

In[15]:=

Out[15]=

Increase the "MinTextArea" to remove small regions:

In[16]:=

In[17]:=

Out[17]=

Set the region type to "MinConvexPolygon" to generate arbitrarily shaped regions:

In[18]:=

In[19]:=

Out[19]=

Network result

Get an image:

In[20]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/0eb24de7-a93c-4416-9f0b-32c0ab1005d1"]

Run the model on the image:

In[21]:=

In[22]:=

Out[22]=

The model's outputs are the "TextRegion", "Kernel" and "Similarity" components. The text region matrix outlines the entire area of each text instance, while the kernel matrix helps distinguish between individual text instances. The similarity vector then guides the grouping of pixels within each instance:

In[23]:=

Out[23]=

In[24]:=

Out[24]=

In[25]:=

Out[25]=

Binarize the text probability map and the kernel. Multiply both images to obtain the final kernel:

In[26]:=

mask = Binarize[textProbMap, 0.5];
kernel = Binarize[kernelMap, 0.5];
kernel = ImageMultiply[kernel, mask]

Out[28]=

Split the detected instances:

In[29]:=

Out[29]=

Use the expandComponent function to expand the kernel region using the similarity matrices as a guide:

In[30]:=

Out[30]=

Filter the small areas:

In[31]:=

Out[31]=

All outputs contain rectangular matrices with fixed dimensions, specifically 160×192. Adjust the result dimensions to the original image shape:

In[32]:=

Out[32]=

In[33]:=

$scaleResult[img_Image, orImg_Image] := Module[{inputImageDims, w, h, ratio, tRatio}, (*scale the results to match the shape of the original image*) inputImageDims = ImageDimensions[orImg]; {w, h} = ImageDimensions[img]; ratio = ImageAspectRatio[orImg]; tRatio = ImageAspectRatio[img]; If[ tRatio/ratio > 1, ImageResize[ImageCrop[img, {w, w*ratio}], inputImageDims], ImageResize[ImageCrop[img, {h /ratio, h}], inputImageDims] ] ]; labels = Map[scaleResult[#, testImage] &, labels]$

Out[34]=

Visualize the detected text instances:

In[35]:=

Out[35]=

Net information

Inspect the number of parameters of all arrays in the net:

In[36]:=

Out[36]=

Obtain the total number of parameters:

In[37]:=

Out[37]=

Obtain the layer type counts:

In[38]:=

Out[38]=

Display the summary graphic:

In[39]:=

Out[39]=

Export to ONNX

Export the net to the ONNX format:

In[40]:=

Out[40]=

Get the size of the ONNX file:

In[41]:=

Out[41]=

The size is similar to the byte count of the resource object:

In[42]:=

Out[42]=

Check some metadata of the ONNX model:

In[43]:=

Out[43]=

Import the model back into Wolfram Language. However, the NetEncoder and NetDecoder will be absent because they are not supported by ONNX:

In[44]:=

Out[44]=

Resource History

Date Created: 23 January 2025

Reference

W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, C. Shen, "Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network," arXiv:1908.05900v2 (2020)
Available from: https://github.com/open-mmlab/mmocr
Rights: Apache 2.0 License