31 research outputs found
AON: Towards Arbitrarily-Oriented Text Recognition
Recognizing text from natural images is a hot research topic in computer
vision due to its various applications. Despite the enduring research of
several decades on optical character recognition (OCR), recognizing texts from
natural images is still a challenging task. This is because scene texts are
often in irregular (e.g. curved, arbitrarily-oriented or seriously distorted)
arrangements, which have not yet been well addressed in the literature.
Existing methods on text recognition mainly work with regular (horizontal and
frontal) texts and cannot be trivially generalized to handle irregular texts.
In this paper, we develop the arbitrary orientation network (AON) to directly
capture the deep features of irregular texts, which are combined into an
attention-based decoder to generate character sequence. The whole network can
be trained end-to-end by using only images and word-level annotations.
Extensive experiments on various benchmarks, including the CUTE80,
SVT-Perspective, IIIT5k, SVT and ICDAR datasets, show that the proposed
AON-based method achieves the-state-of-the-art performance in irregular
datasets, and is comparable to major existing methods in regular datasets.Comment: Accepted by CVPR201
Dynamic Low-Resolution Distillation for Cost-Efficient End-to-End Text Spotting
End-to-end text spotting has attached great attention recently due to its
benefits on global optimization and high maintainability for real applications.
However, the input scale has always been a tough trade-off since recognizing a
small text instance usually requires enlarging the whole image, which brings
high computational costs. In this paper, to address this problem, we propose a
novel cost-efficient Dynamic Low-resolution Distillation (DLD) text spotting
framework, which aims to infer images in different small but recognizable
resolutions and achieve a better balance between accuracy and efficiency.
Concretely, we adopt a resolution selector to dynamically decide the input
resolutions for different images, which is constraint by both inference
accuracy and computational cost. Another sequential knowledge distillation
strategy is conducted on the text recognition branch, making the low-res input
obtains comparable performance to a high-res image. The proposed method can be
optimized end-to-end and adopted in any current text spotting framework to
improve the practicability. Extensive experiments on several text spotting
benchmarks show that the proposed method vastly improves the usability of
low-res models. The code is available at
https://github.com/hikopensource/DAVAR-Lab-OCR/.Comment: Accept by ECCV202
E2-AEN: End-to-End Incremental Learning with Adaptively Expandable Network
Expandable networks have demonstrated their advantages in dealing with
catastrophic forgetting problem in incremental learning. Considering that
different tasks may need different structures, recent methods design dynamic
structures adapted to different tasks via sophisticated skills. Their routine
is to search expandable structures first and then train on the new tasks,
which, however, breaks tasks into multiple training stages, leading to
suboptimal or overmuch computational cost. In this paper, we propose an
end-to-end trainable adaptively expandable network named E2-AEN, which
dynamically generates lightweight structures for new tasks without any accuracy
drop in previous tasks. Specifically, the network contains a serial of powerful
feature adapters for augmenting the previously learned representations to new
tasks, and avoiding task interference. These adapters are controlled via an
adaptive gate-based pruning strategy which decides whether the expanded
structures can be pruned, making the network structure dynamically changeable
according to the complexity of the new tasks. Moreover, we introduce a novel
sparsity-activation regularization to encourage the model to learn
discriminative features with limited parameters. E2-AEN reduces cost and can be
built upon any feed-forward architectures in an end-to-end manner. Extensive
experiments on both classification (i.e., CIFAR and VDD) and detection (i.e.,
COCO, VOC and ICCV2021 SSLAD challenge) benchmarks demonstrate the
effectiveness of the proposed method, which achieves the new remarkable
results
MANGO: A Mask Attention Guided One-Stage Scene Text Spotter
Recently end-to-end scene text spotting has become a popular research topic
due to its advantages of global optimization and high maintainability in real
applications. Most methods attempt to develop various region of interest (RoI)
operations to concatenate the detection part and the sequence recognition part
into a two-stage text spotting framework. However, in such framework, the
recognition part is highly sensitive to the detected results (e.g.), the
compactness of text contours). To address this problem, in this paper, we
propose a novel Mask AttentioN Guided One-stage text spotting framework named
MANGO, in which character sequences can be directly recognized without RoI
operation. Concretely, a position-aware mask attention module is developed to
generate attention weights on each text instance and its characters. It allows
different text instances in an image to be allocated on different feature map
channels which are further grouped as a batch of instance features. Finally, a
lightweight sequence decoder is applied to generate the character sequences. It
is worth noting that MANGO inherently adapts to arbitrary-shaped text spotting
and can be trained end-to-end with only coarse position information (e.g.),
rectangular bounding box) and text annotations. Experimental results show that
the proposed method achieves competitive and even new state-of-the-art
performance on both regular and irregular text spotting benchmarks, i.e., ICDAR
2013, ICDAR 2015, Total-Text, and SCUT-CTW1500.Comment: Accepted to AAAI2021. Code is available at
https://davar-lab.github.io/publication.html or
https://github.com/hikopensource/DAVAR-Lab-OC