74 research outputs found
SPTS v2: Single-Point Scene Text Spotting
End-to-end scene text spotting has made significant progress due to its
intrinsic synergy between text detection and recognition. Previous methods
commonly regard manual annotations such as horizontal rectangles, rotated
rectangles, quadrangles, and polygons as a prerequisite, which are much more
expensive than using single-point. For the first time, we demonstrate that
training scene text spotting models can be achieved with an extremely low-cost
single-point annotation by the proposed framework, termed SPTS v2. SPTS v2
reserves the advantage of the auto-regressive Transformer with an Instance
Assignment Decoder (IAD) through sequentially predicting the center points of
all text instances inside the same predicting sequence, while with a Parallel
Recognition Decoder (PRD) for text recognition in parallel. These two decoders
share the same parameters and are interactively connected with a simple but
effective information transmission process to pass the gradient and
information. Comprehensive experiments on various existing benchmark datasets
demonstrate the SPTS v2 can outperform previous state-of-the-art single-point
text spotters with fewer parameters while achieving 19 faster inference
speed. Most importantly, within the scope of our SPTS v2, extensive experiments
further reveal an important phenomenon that single-point serves as the optimal
setting for the scene text spotting compared to non-point, rectangular bounding
box, and polygonal bounding box. Such an attempt provides a significant
opportunity for scene text spotting applications beyond the realms of existing
paradigms. Code will be available at https://github.com/bytedance/SPTSv2.Comment: arXiv admin note: text overlap with arXiv:2112.0791
DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting
End-to-end text spotting aims to integrate scene text detection and
recognition into a unified framework. Dealing with the relationship between the
two sub-tasks plays a pivotal role in designing effective spotters. Although
transformer-based methods eliminate the heuristic post-processing, they still
suffer from the synergy issue between the sub-tasks and low training
efficiency. In this paper, we present DeepSolo, a simple detection transformer
baseline that lets a single Decoder with Explicit Points Solo for text
detection and recognition simultaneously. Technically, for each text instance,
we represent the character sequence as ordered points and model them with
learnable explicit point queries. After passing a single decoder, the point
queries have encoded requisite text semantics and locations and thus can be
further decoded to the center line, boundary, script, and confidence of text
via very simple prediction heads in parallel, solving the sub-tasks in text
spotting in a unified framework. Besides, we also introduce a text-matching
criterion to deliver more accurate supervisory signals, thus enabling more
efficient training. Quantitative experiments on public benchmarks demonstrate
that DeepSolo outperforms previous state-of-the-art methods and achieves better
training efficiency. In addition, DeepSolo is also compatible with line
annotations, which require much less annotation cost than polygons. The code
will be released.Comment: The code will be available at
https://github.com/ViTAE-Transformer/DeepSol
MANGO: A Mask Attention Guided One-Stage Scene Text Spotter
Recently end-to-end scene text spotting has become a popular research topic
due to its advantages of global optimization and high maintainability in real
applications. Most methods attempt to develop various region of interest (RoI)
operations to concatenate the detection part and the sequence recognition part
into a two-stage text spotting framework. However, in such framework, the
recognition part is highly sensitive to the detected results (e.g.), the
compactness of text contours). To address this problem, in this paper, we
propose a novel Mask AttentioN Guided One-stage text spotting framework named
MANGO, in which character sequences can be directly recognized without RoI
operation. Concretely, a position-aware mask attention module is developed to
generate attention weights on each text instance and its characters. It allows
different text instances in an image to be allocated on different feature map
channels which are further grouped as a batch of instance features. Finally, a
lightweight sequence decoder is applied to generate the character sequences. It
is worth noting that MANGO inherently adapts to arbitrary-shaped text spotting
and can be trained end-to-end with only coarse position information (e.g.),
rectangular bounding box) and text annotations. Experimental results show that
the proposed method achieves competitive and even new state-of-the-art
performance on both regular and irregular text spotting benchmarks, i.e., ICDAR
2013, ICDAR 2015, Total-Text, and SCUT-CTW1500.Comment: Accepted to AAAI2021. Code is available at
https://davar-lab.github.io/publication.html or
https://github.com/hikopensource/DAVAR-Lab-OC
- …