58 research outputs found
SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation
Existing techniques for text detection can be broadly classified into two
primary groups: segmentation-based methods and regression-based methods.
Segmentation models offer enhanced robustness to font variations but require
intricate post-processing, leading to high computational overhead.
Regression-based methods undertake instance-aware prediction but face
limitations in robustness and data efficiency due to their reliance on
high-level representations. In our academic pursuit, we propose SRFormer, a
unified DETR-based model with amalgamated Segmentation and Regression, aiming
at the synergistic harnessing of the inherent robustness in segmentation
representations, along with the straightforward post-processing of
instance-level regression. Our empirical analysis indicates that favorable
segmentation predictions can be obtained at the initial decoder layers. In
light of this, we constrain the incorporation of segmentation branches to the
first few decoder layers and employ progressive regression refinement in
subsequent layers, achieving performance gains while minimizing additional
computational load from the mask. Furthermore, we propose a Mask-informed Query
Enhancement module. We take the segmentation result as a natural soft-ROI to
pool and extract robust pixel representations, which are then employed to
enhance and diversify instance queries. Extensive experimentation across
multiple benchmarks has yielded compelling findings, highlighting our method's
exceptional robustness, superior training and data efficiency, as well as its
state-of-the-art performance
On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention
Scene text recognition (STR) is the task of recognizing character sequences
in natural scenes. While there have been great advances in STR methods, current
methods still fail to recognize texts in arbitrary shapes, such as heavily
curved or rotated texts, which are abundant in daily life (e.g. restaurant
signs, product labels, company logos, etc). This paper introduces a novel
architecture to recognizing texts of arbitrary shapes, named Self-Attention
Text Recognition Network (SATRN), which is inspired by the Transformer. SATRN
utilizes the self-attention mechanism to describe two-dimensional (2D) spatial
dependencies of characters in a scene text image. Exploiting the full-graph
propagation of self-attention, SATRN can recognize texts with arbitrary
arrangements and large inter-character spacing. As a result, SATRN outperforms
existing STR models by a large margin of 5.7 pp on average in "irregular text"
benchmarks. We provide empirical analyses that illustrate the inner mechanisms
and the extent to which the model is applicable (e.g. rotated and multi-line
text). We will open-source the code
- …