Search CORE

147 research outputs found

TextAdaIN: Paying Attention to Shortcut Learning in Text Recognizers

Author: Fogel Sharon
Litman Ron
Nuriel Oren
Publication venue
Publication date: 24/07/2022
Field of study

Leveraging the characteristics of convolutional layers, neural networks are extremely effective for pattern recognition tasks. However in some cases, their decisions are based on unintended information leading to high performance on standard benchmarks but also to a lack of generalization to challenging testing conditions and unintuitive failures. Recent work has termed this "shortcut learning" and addressed its presence in multiple domains. In text recognition, we reveal another such shortcut, whereby recognizers overly depend on local image statistics. Motivated by this, we suggest an approach to regulate the reliance on local statistics that improves text recognition performance. Our method, termed TextAdaIN, creates local distortions in the feature map which prevent the network from overfitting to local statistics. It does so by viewing each feature map as a sequence of elements and deliberately mismatching fine-grained feature statistics between elements in a mini-batch. Despite TextAdaIN's simplicity, extensive experiments show its effectiveness compared to other, more complicated methods. TextAdaIN achieves state-of-the-art results on standard handwritten text recognition benchmarks. It generalizes to multiple architectures and to the domain of scene text recognition. Furthermore, we demonstrate that integrating TextAdaIN improves robustness towards more challenging testing conditions. The official Pytorch implementation can be found at https://github.com/amazon-research/textadain-robust-recognition.Comment: 12 pages, 8 figures, Accepted to ECCV 202

arXiv.org e-Print Archive

Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study

Author: Chen Jingye
Guan Mengnan
Li Bin
Ma Jianqi
Qu Shaobo
Wang Xiaocong
Xu Xixi
Xue Xiangyang
Yu Haiyang
Publication venue
Publication date: 25/11/2022
Field of study

The flourishing blossom of deep learning has witnessed the rapid development of text recognition in recent years. However, the existing text recognition methods are mainly proposed for English texts. As another widely-spoken language, Chinese text recognition (CTR) in all ways has extensive application markets. Based on our observations, we attribute the scarce attention on CTR to the lack of reasonable dataset construction standards, unified evaluation protocols, and results of the existing baselines. To fill this gap, we manually collect CTR datasets from publicly available competitions, projects, and papers. According to application scenarios, we divide the collected datasets into four categories including scene, web, document, and handwriting datasets. Besides, we standardize the evaluation protocols in CTR. With unified evaluation protocols, we evaluate a series of representative text recognition methods on the collected datasets to provide baselines. The experimental results indicate that the performance of baselines on CTR datasets is not as good as that on English datasets due to the characteristics of Chinese texts that are quite different from the Latin alphabet. Moreover, we observe that by introducing radical-level supervision as an auxiliary task, the performance of baselines can be further boosted. The code and datasets are made publicly available at https://github.com/FudanVI/benchmarking-chinese-text-recognitionComment: Code is available at https://github.com/FudanVI/benchmarking-chinese-text-recognitio

arXiv.org e-Print Archive

Deep Unrestricted Document Image Rectification

Author: Deng Jiajun
Feng Hao
Li Houqiang
Liu Shaokai
Zhou Wengang
Publication venue
Publication date: 18/04/2023
Field of study

In recent years, tremendous efforts have been made on document image rectification, but existing advanced algorithms are limited to processing restricted document images, i.e., the input images must incorporate a complete document. Once the captured image merely involves a local text region, its rectification quality is degraded and unsatisfactory. Our previously proposed DocTr, a transformer-assisted network for document image rectification, also suffers from this limitation. In this work, we present DocTr++, a novel unified framework for document image rectification, without any restrictions on the input distorted images. Our major technical improvements can be concluded in three aspects. Firstly, we upgrade the original architecture by adopting a hierarchical encoder-decoder structure for multi-scale representation extraction and parsing. Secondly, we reformulate the pixel-wise mapping relationship between the unrestricted distorted document images and the distortion-free counterparts. The obtained data is used to train our DocTr++ for unrestricted document image rectification. Thirdly, we contribute a real-world test set and metrics applicable for evaluating the rectification quality. To our best knowledge, this is the first learning-based method for the rectification of unrestricted document images. Extensive experiments are conducted, and the results demonstrate the effectiveness and superiority of our method. We hope our DocTr++ will serve as a strong baseline for generic document image rectification, prompting the further advancement and application of learning-based algorithms. The source code and the proposed dataset are publicly available at https://github.com/fh2019ustc/DocTr-Plus

arXiv.org e-Print Archive

Decoupled Attention Network for Text Recognition

Author: Cai Mingxiang
Chen Xiaoxue
Jin Lianwen
Luo Canjie
Wang Qianying
Wang Tianwei
Wu Yaqiang
Zhu Yuanzhi
Publication venue
Publication date: 21/12/2019
Field of study

Text recognition has attracted considerable research interests because of its various applications. The cutting-edge text recognition methods are based on attention mechanisms. However, most of attention methods usually suffer from serious alignment problem due to its recurrency alignment operation, where the alignment relies on historical decoding results. To remedy this issue, we propose a decoupled attention network (DAN), which decouples the alignment operation from using historical decoding results. DAN is an effective, flexible and robust end-to-end text recognizer, which consists of three components: 1) a feature encoder that extracts visual features from the input image; 2) a convolutional alignment module that performs the alignment operation based on visual features from the encoder; and 3) a decoupled text decoder that makes final prediction by jointly using the feature map and attention maps. Experimental results show that DAN achieves state-of-the-art performance on multiple text recognition tasks, including offline handwritten text recognition and regular/irregular scene text recognition.Comment: 9 pages, 8 figures, 6 tables, accepted by AAAI-202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications