4 research outputs found
PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network
The reading of arbitrarily-shaped text has received increasing research
attention. However, existing text spotters are mostly built on two-stage
frameworks or character-based methods, which suffer from either Non-Maximum
Suppression (NMS), Region-of-Interest (RoI) operations, or character-level
annotations. In this paper, to address the above problems, we propose a novel
fully convolutional Point Gathering Network (PGNet) for reading
arbitrarily-shaped text in real-time. The PGNet is a single-shot text spotter,
where the pixel-level character classification map is learned with proposed
PG-CTC loss avoiding the usage of character-level annotations. With a PG-CTC
decoder, we gather high-level character classification vectors from
two-dimensional space and decode them into text symbols without NMS and RoI
operations involved, which guarantees high efficiency. Additionally, reasoning
the relations between each character and its neighbors, a graph refinement
module (GRM) is proposed to optimize the coarse recognition and improve the
end-to-end performance. Experiments prove that the proposed method achieves
competitive accuracy, meanwhile significantly improving the running speed. In
particular, in Total-Text, it runs at 46.7 FPS, surpassing the previous
spotters with a large margin.Comment: 10 pages, 8 figures, AAAI 202
PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer
We present PBFormer, an efficient yet powerful scene text detector that
unifies the transformer with a novel text shape representation Polynomial Band
(PB). The representation has four polynomial curves to fit a text's top,
bottom, left, and right sides, which can capture a text with a complex shape by
varying polynomial coefficients. PB has appealing features compared with
conventional representations: 1) It can model different curvatures with a fixed
number of parameters, while polygon-points-based methods need to utilize a
different number of points. 2) It can distinguish adjacent or overlapping texts
as they have apparent different curve coefficients, while segmentation-based or
points-based methods suffer from adhesive spatial positions. PBFormer combines
the PB with the transformer, which can directly generate smooth text contours
sampled from predicted curves without interpolation. A parameter-free
cross-scale pixel attention (CPA) module is employed to highlight the feature
map of a suitable scale while suppressing the other feature maps. The simple
operation can help detect small-scale texts and is compatible with the
one-stage DETR framework, where no postprocessing exists for NMS. Furthermore,
PBFormer is trained with a shape-contained loss, which not only enforces the
piecewise alignment between the ground truth and the predicted curves but also
makes curves' positions and shapes consistent with each other. Without bells
and whistles about text pre-training, our method is superior to the previous
state-of-the-art text detectors on the arbitrary-shaped text datasets.Comment: 9 pages, 8 figures, accepted by ACM MM 202
Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning
Due to the flexible representation of arbitrary-shaped scene text and simple
pipeline, bottom-up segmentation-based methods begin to be mainstream in
real-time scene text detection. Despite great progress, these methods show
deficiencies in robustness and still suffer from false positives and instance
adhesion. Different from existing methods which integrate multiple-granularity
features or multiple outputs, we resort to the perspective of representation
learning in which auxiliary tasks are utilized to enable the encoder to jointly
learn robust features with the main task of per-pixel classification during
optimization. For semantic representation learning, we propose global-dense
semantic contrast (GDSC), in which a vector is extracted for global semantic
representation, then used to perform element-wise contrast with the dense grid
features. To learn instance-aware representation, we propose to combine
top-down modeling (TDM) with the bottom-up framework to provide implicit
instance-level clues for the encoder. With the proposed GDSC and TDM, the
encoder network learns stronger representation without introducing any
parameters and computations during inference. Equipped with a very light
decoder, the detector can achieve more robust real-time scene text detection.
Experimental results on four public datasets show that the proposed method can
outperform or be comparable to the state-of-the-art on both accuracy and speed.
Specifically, the proposed method achieves 87.2% F-measure with 48.2 FPS on
Total-Text and 89.6% F-measure with 36.9 FPS on MSRA-TD500 on a single GeForce
RTX 2080 Ti GPU.Comment: Accepted by ACM MM 202