3,550 research outputs found
Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network
Scene text detection, an important step of scene text reading systems, has
witnessed rapid development with convolutional neural networks. Nonetheless,
two main challenges still exist and hamper its deployment to real-world
applications. The first problem is the trade-off between speed and accuracy.
The second one is to model the arbitrary-shaped text instance. Recently, some
methods have been proposed to tackle arbitrary-shaped text detection, but they
rarely take the speed of the entire pipeline into consideration, which may fall
short in practical applications.In this paper, we propose an efficient and
accurate arbitrary-shaped text detector, termed Pixel Aggregation Network
(PAN), which is equipped with a low computational-cost segmentation head and a
learnable post-processing. More specifically, the segmentation head is made up
of Feature Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM).
FPEM is a cascadable U-shaped module, which can introduce multi-level
information to guide the better segmentation. FFM can gather the features given
by the FPEMs of different depths into a final feature for segmentation. The
learnable post-processing is implemented by Pixel Aggregation (PA), which can
precisely aggregate text pixels by predicted similarity vectors. Experiments on
several standard benchmarks validate the superiority of the proposed PAN. It is
worth noting that our method can achieve a competitive F-measure of 79.9% at
84.2 FPS on CTW1500.Comment: Accept by ICCV 201
PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network
The reading of arbitrarily-shaped text has received increasing research
attention. However, existing text spotters are mostly built on two-stage
frameworks or character-based methods, which suffer from either Non-Maximum
Suppression (NMS), Region-of-Interest (RoI) operations, or character-level
annotations. In this paper, to address the above problems, we propose a novel
fully convolutional Point Gathering Network (PGNet) for reading
arbitrarily-shaped text in real-time. The PGNet is a single-shot text spotter,
where the pixel-level character classification map is learned with proposed
PG-CTC loss avoiding the usage of character-level annotations. With a PG-CTC
decoder, we gather high-level character classification vectors from
two-dimensional space and decode them into text symbols without NMS and RoI
operations involved, which guarantees high efficiency. Additionally, reasoning
the relations between each character and its neighbors, a graph refinement
module (GRM) is proposed to optimize the coarse recognition and improve the
end-to-end performance. Experiments prove that the proposed method achieves
competitive accuracy, meanwhile significantly improving the running speed. In
particular, in Total-Text, it runs at 46.7 FPS, surpassing the previous
spotters with a large margin.Comment: 10 pages, 8 figures, AAAI 202
Aggregated Text Transformer for Scene Text Detection
This paper explores the multi-scale aggregation strategy for scene text
detection in natural images. We present the Aggregated Text TRansformer(ATTR),
which is designed to represent texts in scene images with a multi-scale
self-attention mechanism. Starting from the image pyramid with multiple
resolutions, the features are first extracted at different scales with shared
weight and then fed into an encoder-decoder architecture of Transformer. The
multi-scale image representations are robust and contain rich information on
text contents of various sizes. The text Transformer aggregates these features
to learn the interaction across different scales and improve text
representation. The proposed method detects scene texts by representing each
text instance as an individual binary mask, which is tolerant of curve texts
and regions with dense instances. Extensive experiments on public scene text
detection datasets demonstrate the effectiveness of the proposed framework
PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer
We present PBFormer, an efficient yet powerful scene text detector that
unifies the transformer with a novel text shape representation Polynomial Band
(PB). The representation has four polynomial curves to fit a text's top,
bottom, left, and right sides, which can capture a text with a complex shape by
varying polynomial coefficients. PB has appealing features compared with
conventional representations: 1) It can model different curvatures with a fixed
number of parameters, while polygon-points-based methods need to utilize a
different number of points. 2) It can distinguish adjacent or overlapping texts
as they have apparent different curve coefficients, while segmentation-based or
points-based methods suffer from adhesive spatial positions. PBFormer combines
the PB with the transformer, which can directly generate smooth text contours
sampled from predicted curves without interpolation. A parameter-free
cross-scale pixel attention (CPA) module is employed to highlight the feature
map of a suitable scale while suppressing the other feature maps. The simple
operation can help detect small-scale texts and is compatible with the
one-stage DETR framework, where no postprocessing exists for NMS. Furthermore,
PBFormer is trained with a shape-contained loss, which not only enforces the
piecewise alignment between the ground truth and the predicted curves but also
makes curves' positions and shapes consistent with each other. Without bells
and whistles about text pre-training, our method is superior to the previous
state-of-the-art text detectors on the arbitrary-shaped text datasets.Comment: 9 pages, 8 figures, accepted by ACM MM 202
MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild
Detecting small scene text instances in the wild is particularly challenging,
where the influence of irregular positions and nonideal lighting often leads to
detection errors. We present MixNet, a hybrid architecture that combines the
strengths of CNNs and Transformers, capable of accurately detecting small text
from challenging natural scenes, regardless of the orientations, styles, and
lighting conditions. MixNet incorporates two key modules: (1) the Feature
Shuffle Network (FSNet) to serve as the backbone and (2) the Central
Transformer Block (CTBlock) to exploit the 1D manifold constraint of the scene
text. We first introduce a novel feature shuffling strategy in FSNet to
facilitate the exchange of features across multiple scales, generating
high-resolution features superior to popular ResNet and HRNet. The FSNet
backbone has achieved significant improvements over many existing text
detection methods, including PAN, DB, and FAST. Then we design a complementary
CTBlock to leverage center line based features similar to the medial axis of
text regions and show that it can outperform contour-based approaches in
challenging cases when small scene texts appear closely. Extensive experimental
results show that MixNet, which mixes FSNet with CTBlock, achieves
state-of-the-art results on multiple scene text detection datasets
SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation
Existing techniques for text detection can be broadly classified into two
primary groups: segmentation-based methods and regression-based methods.
Segmentation models offer enhanced robustness to font variations but require
intricate post-processing, leading to high computational overhead.
Regression-based methods undertake instance-aware prediction but face
limitations in robustness and data efficiency due to their reliance on
high-level representations. In our academic pursuit, we propose SRFormer, a
unified DETR-based model with amalgamated Segmentation and Regression, aiming
at the synergistic harnessing of the inherent robustness in segmentation
representations, along with the straightforward post-processing of
instance-level regression. Our empirical analysis indicates that favorable
segmentation predictions can be obtained at the initial decoder layers. In
light of this, we constrain the incorporation of segmentation branches to the
first few decoder layers and employ progressive regression refinement in
subsequent layers, achieving performance gains while minimizing additional
computational load from the mask. Furthermore, we propose a Mask-informed Query
Enhancement module. We take the segmentation result as a natural soft-ROI to
pool and extract robust pixel representations, which are then employed to
enhance and diversify instance queries. Extensive experimentation across
multiple benchmarks has yielded compelling findings, highlighting our method's
exceptional robustness, superior training and data efficiency, as well as its
state-of-the-art performance
CT-Net: Arbitrary-Shaped Text Detection via Contour Transformer
Contour based scene text detection methods have rapidly developed recently,
but still suffer from inaccurate frontend contour initialization, multi-stage
error accumulation, or deficient local information aggregation. To tackle these
limitations, we propose a novel arbitrary-shaped scene text detection framework
named CT-Net by progressive contour regression with contour transformers.
Specifically, we first employ a contour initialization module that generates
coarse text contours without any post-processing. Then, we adopt contour
refinement modules to adaptively refine text contours in an iterative manner,
which are beneficial for context information capturing and progressive global
contour deformation. Besides, we propose an adaptive training strategy to
enable the contour transformers to learn more potential deformation paths, and
introduce a re-score mechanism that can effectively suppress false positives.
Extensive experiments are conducted on four challenging datasets, which
demonstrate the accuracy and efficiency of our CT-Net over state-of-the-art
methods. Particularly, CT-Net achieves F-measure of 86.1 at 11.2 frames per
second (FPS) and F-measure of 87.8 at 10.1 FPS for CTW1500 and Total-Text
datasets, respectively.Comment: This paper has been accepted by IEEE Transactions on Circuits and
Systems for Video Technolog
Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning
Due to the flexible representation of arbitrary-shaped scene text and simple
pipeline, bottom-up segmentation-based methods begin to be mainstream in
real-time scene text detection. Despite great progress, these methods show
deficiencies in robustness and still suffer from false positives and instance
adhesion. Different from existing methods which integrate multiple-granularity
features or multiple outputs, we resort to the perspective of representation
learning in which auxiliary tasks are utilized to enable the encoder to jointly
learn robust features with the main task of per-pixel classification during
optimization. For semantic representation learning, we propose global-dense
semantic contrast (GDSC), in which a vector is extracted for global semantic
representation, then used to perform element-wise contrast with the dense grid
features. To learn instance-aware representation, we propose to combine
top-down modeling (TDM) with the bottom-up framework to provide implicit
instance-level clues for the encoder. With the proposed GDSC and TDM, the
encoder network learns stronger representation without introducing any
parameters and computations during inference. Equipped with a very light
decoder, the detector can achieve more robust real-time scene text detection.
Experimental results on four public datasets show that the proposed method can
outperform or be comparable to the state-of-the-art on both accuracy and speed.
Specifically, the proposed method achieves 87.2% F-measure with 48.2 FPS on
Total-Text and 89.6% F-measure with 36.9 FPS on MSRA-TD500 on a single GeForce
RTX 2080 Ti GPU.Comment: Accepted by ACM MM 202
- …