45,219 research outputs found
Text detection and recognition in natural scene images
This thesis addresses the problem of end-to-end text detection and recognition in
natural scene images based on deep neural networks. Scene text detection and recognition
aim to find regions in an image that are considered as text by human beings,
generate a bounding box for each word and output a corresponding sequence of
characters. As a useful task in image analysis, scene text detection and recognition
attract much attention in computer vision field. In this thesis, we tackle this problem
by taking advantage of the success in deep learning techniques.
Car license plates can be viewed as a spacial case of scene text, as they both consist
of characters and appear in natural scenes. Nevertheless, they have their respective
specificities. During the research progress, we start from car license plate detection
and recognition. Then we extend the methods to general scene text, with additional
ideas proposed.
For both tasks, we develop two approaches respectively: a stepwise one and
an integrated one. Stepwise methods tackle text detection and recognition step by
step by respective models; while integrated methods handle both text detection and
recognition simultaneously via one model. All approaches are based on the powerful
deep Convolutional Neural Networks (CNNs) and Recurrent Neural Networks
(RNNs), considering the tremendous breakthroughs they brought into the computer
vision community.
To begin with, a stepwise framework is proposed to tackle text detection and
recognition, with its application to car license plates and general scene text respectively.
A character CNN classifier is well trained to detect characters from an image
in a sliding window manner. The detected characters are then grouped together as
license plates or text lines according to some heuristic rules. A sequence labeling
based method is proposed to recognize the whole license plate or text line without
character level segmentation.
On the basis of the sequence labeling based recognition method, to accelerate the
processing speed, an integrated deep neural network is then proposed to address
car license plate detection and recognition concurrently. It integrates both CNNs
and RNNs in one network, and can be trained end-to-end. Both car license plate
bounding boxes and their labels are generated in a single forward evaluation of the
network. The whole process involves no heuristic rule, and avoids intermediate
procedures like image cropping or feature recalculation, which not only prevents
error accumulation, but also reduces computation burden.
Lastly, the unified network is extended to simultaneous general text detection and
recognition in natural scene. In contrast to the one for car license plates, some innovations
are proposed to accommodate the special characteristics of general text. A
varying-size RoI encoding method is proposed to handle the various aspect ratios of general text. An attention-based sequence-to-sequence learning structure is adopted
for word recognition. It is expected that a character-level language model can be
learnt in this manner. The whole framework can be trained end-to-end, requiring
only images, the ground-truth bounding boxes and text labels. Through end-to-end
training, the learned features can be more discriminative, which improves the overall
performance. The convolutional features are calculated only once and shared by both
detection and recognition, which saves the processing time. The proposed method
has achieved state-of-the-art performance on several standard benchmark datasets.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201
AON: Towards Arbitrarily-Oriented Text Recognition
Recognizing text from natural images is a hot research topic in computer
vision due to its various applications. Despite the enduring research of
several decades on optical character recognition (OCR), recognizing texts from
natural images is still a challenging task. This is because scene texts are
often in irregular (e.g. curved, arbitrarily-oriented or seriously distorted)
arrangements, which have not yet been well addressed in the literature.
Existing methods on text recognition mainly work with regular (horizontal and
frontal) texts and cannot be trivially generalized to handle irregular texts.
In this paper, we develop the arbitrary orientation network (AON) to directly
capture the deep features of irregular texts, which are combined into an
attention-based decoder to generate character sequence. The whole network can
be trained end-to-end by using only images and word-level annotations.
Extensive experiments on various benchmarks, including the CUTE80,
SVT-Perspective, IIIT5k, SVT and ICDAR datasets, show that the proposed
AON-based method achieves the-state-of-the-art performance in irregular
datasets, and is comparable to major existing methods in regular datasets.Comment: Accepted by CVPR201
Enhancing Energy Minimization Framework for Scene Text Recognition with Top-Down Cues
Recognizing scene text is a challenging problem, even more so than the
recognition of scanned documents. This problem has gained significant attention
from the computer vision community in recent years, and several methods based
on energy minimization frameworks and deep learning approaches have been
proposed. In this work, we focus on the energy minimization framework and
propose a model that exploits both bottom-up and top-down cues for recognizing
cropped words extracted from street images. The bottom-up cues are derived from
individual character detections from an image. We build a conditional random
field model on these detections to jointly model the strength of the detections
and the interactions between them. These interactions are top-down cues
obtained from a lexicon-based prior, i.e., language statistics. The optimal
word represented by the text image is obtained by minimizing the energy
function corresponding to the random field model. We evaluate our proposed
algorithm extensively on a number of cropped scene text benchmark datasets,
namely Street View Text, ICDAR 2003, 2011 and 2013 datasets, and IIIT 5K-word,
and show better performance than comparable methods. We perform a rigorous
analysis of all the steps in our approach and analyze the results. We also show
that state-of-the-art convolutional neural network features can be integrated
in our framework to further improve the recognition performance
WordFences: Text localization and recognition
En col·laboració amb la Universitat de Barcelona (UB) i la Universitat Rovira i Virgili (URV)In recent years, text recognition has achieved remarkable success in recognizing scanned
document text. However, word recognition in natural images is still an open problem,
which generally requires time consuming post-processing steps. We present a novel architecture
for individual word detection in scene images based on semantic segmentation.
Our contributions are twofold: the concept of WordFence, which detects border areas
surrounding each individual word and a unique pixelwise weighted softmax loss function
which penalizes background and emphasizes small text regions. WordFence ensures that
each word is detected individually, and the new loss function provides a strong training
signal to both text and word border localization. The proposed technique avoids intensive
post-processing by combining semantic word segmentation with a voting scheme
for merging segmentations of multiple scales, producing an end-to-end word detection
system. We achieve superior localization recall on common benchmark datasets - 92%
recall on ICDAR11 and ICDAR13 and 63% recall on SVT. Furthermore, end-to-end
word recognition achieves state-of-the-art 86% F-Score on ICDAR13
- …