117 research outputs found
GTC: Guided Training of CTC Towards Efficient and Accurate Scene Text Recognition
Connectionist Temporal Classification (CTC) and attention mechanism are two
main approaches used in recent scene text recognition works. Compared with
attention-based methods, CTC decoder has a much shorter inference time, yet a
lower accuracy. To design an efficient and effective model, we propose the
guided training of CTC (GTC), where CTC model learns a better alignment and
feature representations from a more powerful attentional guidance. With the
benefit of guided training, CTC model achieves robust and accurate prediction
for both regular and irregular scene text while maintaining a fast inference
speed. Moreover, to further leverage the potential of CTC decoder, a graph
convolutional network (GCN) is proposed to learn the local correlations of
extracted features. Extensive experiments on standard benchmarks demonstrate
that our end-to-end model achieves a new state-of-the-art for regular and
irregular scene text recognition and needs 6 times shorter inference time than
attentionbased methods.Comment: Accepted by AAAI 202
Diversity vs. Recognizability: Human-like generalization in one-shot generative models
Robust generalization to new concepts has long remained a distinctive feature
of human intelligence. However, recent progress in deep generative models has
now led to neural architectures capable of synthesizing novel instances of
unknown visual concepts from a single training example. Yet, a more precise
comparison between these models and humans is not possible because existing
performance metrics for generative models (i.e., FID, IS, likelihood) are not
appropriate for the one-shot generation scenario. Here, we propose a new
framework to evaluate one-shot generative models along two axes: sample
recognizability vs. diversity (i.e., intra-class variability). Using this
framework, we perform a systematic evaluation of representative one-shot
generative models on the Omniglot handwritten dataset. We first show that
GAN-like and VAE-like models fall on opposite ends of the
diversity-recognizability space. Extensive analyses of the effect of key model
parameters further revealed that spatial attention and context integration have
a linear contribution to the diversity-recognizability trade-off. In contrast,
disentanglement transports the model along a parabolic curve that could be used
to maximize recognizability. Using the diversity-recognizability framework, we
were able to identify models and parameters that closely approximate human
data
On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention
Scene text recognition (STR) is the task of recognizing character sequences
in natural scenes. While there have been great advances in STR methods, current
methods still fail to recognize texts in arbitrary shapes, such as heavily
curved or rotated texts, which are abundant in daily life (e.g. restaurant
signs, product labels, company logos, etc). This paper introduces a novel
architecture to recognizing texts of arbitrary shapes, named Self-Attention
Text Recognition Network (SATRN), which is inspired by the Transformer. SATRN
utilizes the self-attention mechanism to describe two-dimensional (2D) spatial
dependencies of characters in a scene text image. Exploiting the full-graph
propagation of self-attention, SATRN can recognize texts with arbitrary
arrangements and large inter-character spacing. As a result, SATRN outperforms
existing STR models by a large margin of 5.7 pp on average in "irregular text"
benchmarks. We provide empirical analyses that illustrate the inner mechanisms
and the extent to which the model is applicable (e.g. rotated and multi-line
text). We will open-source the code
Deep Learning for Scene Text Detection, Recognition, and Understanding
Detecting and recognizing texts in images is a long-standing task in computer vision. The goal of this task is to extract textual information from images and videos, such as recognizing license plates. Despite that the great progresses have been made in recent years, it still remains challenging due to the wide range of variations in text appearance. In this thesis, we aim to review the existing issues that hinder current Optical Character Recognition (OCR) development and explore potential solutions. Specifically, we first investigate the phenomenon of unfair comparisons between different OCR algorithms caused due to the lack of a consistent evaluation framework. Such an absence of a unified evaluation protocol leads to inconsistent and unreliable results, making it difficult to compare and improve upon existing methods. To tackle this issue, we design a new evaluation framework from the aspect of datasets, metrics, and models, enabling consistent and fair comparisons between OCR systems. Another issue existing in the field is the imbalanced distribution of training samples. In particular, the sample distribution largely depended on where and how the data was collected, and the resulting data bias may lead to poor performance and low generalizability on under-represented classes. To address this problem, we took the driving license plate recognition task as an example and proposed a text-to-image model that is able to synthesize photo-realistic text samples. By using this model, we synthesized more than one million samples to augment the training dataset, significantly improving the generalization capability of OCR models. Additionally, this thesis also explores the application of text vision question answering, which is a new and emerging research topic among the OCR community. This task challenges the OCR models to understand the relationships between the text and backgrounds and to answer the given questions. In this thesis, we propose to investigate evidence-based text VQA, which involves designing models that can provide reasonable evidence for their predictions, thus improving the generalization ability.Thesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 202
- …