PSENet Text Detector Trained on ICDAR-2015 and CTW1500 Data

Detect and localize text in an image

This family of models introduce the novel Progressive Scale Expansion Network (PSENet), which can precisely detect text instances with arbitrary shapes. The basic framework of PSENet is implemented from a Feature Pyramid Network with a ResNet backbone; the network produces seven segmentation masks at a certain scale. In the postprocessing step, PSENet uses a progressive scale expansion algorithm that gradually expands the minimal scale kernel to the text instance with the complete shape, avoiding conflicting pixel labeling at each expansion step. Experiments on CTW1500 validate the effectiveness of PSENet, achieving an F-measure of 74.3% at 27 FPS.

Training Set Information

The ICDAR-2015 dataset consists of one thousand training images and five hundred testing images, which are captured by Google Glasses with a resolution of 720x1280. The text instances are labeled at the word level. The CTW1500 dataset contains 1,500 images: one thousand for training and five hundred for testing. It provides 10,751 cropped text instance images, including 3,530 with curved text.

Model Information

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific architecture. Inspect the available parameters:

In[2]:=

Out[2]=

Pick a non-default net by specifying the architecture:

In[3]:=

Out[3]=

Pick a non-default uninitialized net:

In[4]:=

Out[4]=

Evaluation function

Write an evaluation function to scale the result to the input image size and suppress the least probable detections:

In[5]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/25cb88da-9189-4f4c-b6bf-c5119995f37c"]

Basic usage

Obtain the detected bounding boxes and masks with their corresponding classes and confidences for a given image:

In[6]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/680db4d5-3580-4712-b37a-7e17a6dd8695"]

In[7]:=

The model returns "BoundingRegion" and "Scores":

In[8]:=

Out[8]=

The "BoundingRegion" is a list of Polygon expressions corresponding to the bounding regions of the detected objects:

In[9]:=

Out[9]=

"Scores" contains the confidence scores of the detected objects:

In[10]:=

Out[10]=

Visualize the bounding region for each text instance:

In[11]:=

Out[11]=

Get the individual masks via the option "Output"->"Masks":

In[12]:=

Out[12]=

Visualize the masks for each text instance with its assigned score:

In[13]:=

Out[13]=

Network result

Get a sample image:

In[14]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/b462488a-4e49-487a-b5bb-3fd1be5a171e"]

The network computes seven prototyped segmentation masks for all the text instances at different scales:

In[15]:=

In[16]:=

Out[16]=

Visualize the prototyped segmentation masks:

In[17]:=

Out[17]=

Rescale the probability map of the first segmentation mask to the original image size:

In[18]:=

$scaleResult[img_Image, orImg_Image] := Module[{inputImageDims, w, h, ratio, tRatio}, (*scale the results to match the shape of the original image*) inputImageDims = ImageDimensions[orImg]; {w, h} = ImageDimensions[img]; ratio = ImageAspectRatio[orImg]; tRatio = ImageAspectRatio[img]; If[ tRatio/ratio > 1, ImageResize[ImageCrop[img, {w, w*ratio}], inputImageDims], ImageResize[ImageCrop[img, {h /ratio, h}], inputImageDims] ] ];$

In[19]:=

Visualize the probability map of having text:

In[20]:=

Out[20]=

Threshold the results to get the masks:

In[21]:=

The first segmentation mask is used as the text mask because it has the largest scale that allows the selection of text regions. Intercept the masks with the predicted text regions:

In[22]:=

The MorphologicalComponents function can create masks for each text instance, using the final segmentation mask. This mask, which has the smallest scale, clearly separates different text instances by keeping their boundaries apart:

In[23]:=

Out[24]=

Use the SelectComponents function to split the components into different images:

In[25]:=

labels = Table[
Image@SelectComponents[labels, SameQ[#Label, i] &],
{i , Range[Max[labels]]}
]

Out[25]=

The progressive scale expansion algorithm starts from the pixels of multiple kernels and iteratively merges the adjacent text pixels avoiding the conflict of shared pixels and preserving the distinction between instances. Define a function that removes the shared pixels between kernels:

In[26]:=

$removeIntersect[labels_, labelsPast_] := Module[{labelsNew, nLabels}, nLabels = Length[labels]; labelsNew = Table[ImageSubtract[labels[[i]], ImageMultiply[labelsPast[[i]], ImageAdd[labels[[Complement[Range[nLabels], {i}]]]]]], {i, nLabels}]; Table[ ImageMultiply[labelsNew[[i]], Sequence @@ ColorNegate@labelsNew[[Complement[Range[nLabels], {i}]]]], {i, nLabels}] ];$

Apply the progressive scale expansion algorithm starting from the mask with the smallest scale and adding pixels progressively using the other masks:

In[27]:=

labels = Fold[removeIntersect[
Map[Function[x, GeodesicDilation[x, #2]], #1], #1] &, labels, Reverse[kernelMasks]];
GraphicsRow[labels]

Out[28]=

Rescale the final list of masks to the original image size and visualize:

In[29]:=

Out[30]=

It is possible to choose a bounding region type. Find the contour points of each region and select a bounding region type to enclose each piece of text:

In[31]:=

contours = Map[Values[
ComponentMeasurements[#, "PerimeterPositions", CornerNeighbors -> True]][[1]] &, masks];
regionTypes = {"MinRectangle", "MinOrientedRectangle", "MinConvexPolygon" };
regions = Table[BoundingRegion[contour[[1]], regType] , {regType, regionTypes}, {contour, contours}];

In[32]:=

MapThread[
Labeled[
HighlightImage[testImage, Thread[Range[Length[#1]] -> #1], ImageLabels -> None], #2, Top] &,
{regions, regionTypes}
]

Out[32]=

Net information

Inspect the number of parameters of all arrays in the net:

In[33]:=

Out[33]=

Obtain the total number of parameters:

In[34]:=

Out[34]=

Obtain the layer type counts:

In[35]:=

Out[35]=

Display the summary graphic:

In[36]:=

Out[36]=

Export to ONNX

Export the net to the ONNX format:

In[37]:=

Out[37]=

Get the size of the ONNX file:

In[38]:=

Out[38]=

The size is similar to the byte count of the resource object :

In[39]:=

Out[39]=

Check some metadata of the ONNX model:

In[40]:=

Out[40]=

Import the model back into Wolfram Language. However, the NetEncoder and NetDecoder will be absent because they are not supported by ONNX:

In[41]:=

Out[41]=

Resource History

Date Created: 23 January 2025

Reference

W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, S. Shao, "Shape Robust Text Detection with Progressive Scale Expansion Network," arXiv:1903.12473v2 (2019)
Available from: https://github.com/open-mmlab/mmocr
Rights: Apache 2.0 License