147 research outputs found
TextAdaIN: Paying Attention to Shortcut Learning in Text Recognizers
Leveraging the characteristics of convolutional layers, neural networks are
extremely effective for pattern recognition tasks. However in some cases, their
decisions are based on unintended information leading to high performance on
standard benchmarks but also to a lack of generalization to challenging testing
conditions and unintuitive failures. Recent work has termed this "shortcut
learning" and addressed its presence in multiple domains. In text recognition,
we reveal another such shortcut, whereby recognizers overly depend on local
image statistics. Motivated by this, we suggest an approach to regulate the
reliance on local statistics that improves text recognition performance.
Our method, termed TextAdaIN, creates local distortions in the feature map
which prevent the network from overfitting to local statistics. It does so by
viewing each feature map as a sequence of elements and deliberately mismatching
fine-grained feature statistics between elements in a mini-batch. Despite
TextAdaIN's simplicity, extensive experiments show its effectiveness compared
to other, more complicated methods. TextAdaIN achieves state-of-the-art results
on standard handwritten text recognition benchmarks. It generalizes to multiple
architectures and to the domain of scene text recognition. Furthermore, we
demonstrate that integrating TextAdaIN improves robustness towards more
challenging testing conditions. The official Pytorch implementation can be
found at https://github.com/amazon-research/textadain-robust-recognition.Comment: 12 pages, 8 figures, Accepted to ECCV 202
Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study
The flourishing blossom of deep learning has witnessed the rapid development
of text recognition in recent years. However, the existing text recognition
methods are mainly proposed for English texts. As another widely-spoken
language, Chinese text recognition (CTR) in all ways has extensive application
markets. Based on our observations, we attribute the scarce attention on CTR to
the lack of reasonable dataset construction standards, unified evaluation
protocols, and results of the existing baselines. To fill this gap, we manually
collect CTR datasets from publicly available competitions, projects, and
papers. According to application scenarios, we divide the collected datasets
into four categories including scene, web, document, and handwriting datasets.
Besides, we standardize the evaluation protocols in CTR. With unified
evaluation protocols, we evaluate a series of representative text recognition
methods on the collected datasets to provide baselines. The experimental
results indicate that the performance of baselines on CTR datasets is not as
good as that on English datasets due to the characteristics of Chinese texts
that are quite different from the Latin alphabet. Moreover, we observe that by
introducing radical-level supervision as an auxiliary task, the performance of
baselines can be further boosted. The code and datasets are made publicly
available at https://github.com/FudanVI/benchmarking-chinese-text-recognitionComment: Code is available at
https://github.com/FudanVI/benchmarking-chinese-text-recognitio
Deep Unrestricted Document Image Rectification
In recent years, tremendous efforts have been made on document image
rectification, but existing advanced algorithms are limited to processing
restricted document images, i.e., the input images must incorporate a complete
document. Once the captured image merely involves a local text region, its
rectification quality is degraded and unsatisfactory. Our previously proposed
DocTr, a transformer-assisted network for document image rectification, also
suffers from this limitation. In this work, we present DocTr++, a novel unified
framework for document image rectification, without any restrictions on the
input distorted images. Our major technical improvements can be concluded in
three aspects. Firstly, we upgrade the original architecture by adopting a
hierarchical encoder-decoder structure for multi-scale representation
extraction and parsing. Secondly, we reformulate the pixel-wise mapping
relationship between the unrestricted distorted document images and the
distortion-free counterparts. The obtained data is used to train our DocTr++
for unrestricted document image rectification. Thirdly, we contribute a
real-world test set and metrics applicable for evaluating the rectification
quality. To our best knowledge, this is the first learning-based method for the
rectification of unrestricted document images. Extensive experiments are
conducted, and the results demonstrate the effectiveness and superiority of our
method. We hope our DocTr++ will serve as a strong baseline for generic
document image rectification, prompting the further advancement and application
of learning-based algorithms. The source code and the proposed dataset are
publicly available at https://github.com/fh2019ustc/DocTr-Plus
Decoupled Attention Network for Text Recognition
Text recognition has attracted considerable research interests because of its
various applications. The cutting-edge text recognition methods are based on
attention mechanisms. However, most of attention methods usually suffer from
serious alignment problem due to its recurrency alignment operation, where the
alignment relies on historical decoding results. To remedy this issue, we
propose a decoupled attention network (DAN), which decouples the alignment
operation from using historical decoding results. DAN is an effective, flexible
and robust end-to-end text recognizer, which consists of three components: 1) a
feature encoder that extracts visual features from the input image; 2) a
convolutional alignment module that performs the alignment operation based on
visual features from the encoder; and 3) a decoupled text decoder that makes
final prediction by jointly using the feature map and attention maps.
Experimental results show that DAN achieves state-of-the-art performance on
multiple text recognition tasks, including offline handwritten text recognition
and regular/irregular scene text recognition.Comment: 9 pages, 8 figures, 6 tables, accepted by AAAI-202
- …