135,276 research outputs found
Text detection and recognition in natural scene images
This thesis addresses the problem of end-to-end text detection and recognition in
natural scene images based on deep neural networks. Scene text detection and recognition
aim to find regions in an image that are considered as text by human beings,
generate a bounding box for each word and output a corresponding sequence of
characters. As a useful task in image analysis, scene text detection and recognition
attract much attention in computer vision field. In this thesis, we tackle this problem
by taking advantage of the success in deep learning techniques.
Car license plates can be viewed as a spacial case of scene text, as they both consist
of characters and appear in natural scenes. Nevertheless, they have their respective
specificities. During the research progress, we start from car license plate detection
and recognition. Then we extend the methods to general scene text, with additional
ideas proposed.
For both tasks, we develop two approaches respectively: a stepwise one and
an integrated one. Stepwise methods tackle text detection and recognition step by
step by respective models; while integrated methods handle both text detection and
recognition simultaneously via one model. All approaches are based on the powerful
deep Convolutional Neural Networks (CNNs) and Recurrent Neural Networks
(RNNs), considering the tremendous breakthroughs they brought into the computer
vision community.
To begin with, a stepwise framework is proposed to tackle text detection and
recognition, with its application to car license plates and general scene text respectively.
A character CNN classifier is well trained to detect characters from an image
in a sliding window manner. The detected characters are then grouped together as
license plates or text lines according to some heuristic rules. A sequence labeling
based method is proposed to recognize the whole license plate or text line without
character level segmentation.
On the basis of the sequence labeling based recognition method, to accelerate the
processing speed, an integrated deep neural network is then proposed to address
car license plate detection and recognition concurrently. It integrates both CNNs
and RNNs in one network, and can be trained end-to-end. Both car license plate
bounding boxes and their labels are generated in a single forward evaluation of the
network. The whole process involves no heuristic rule, and avoids intermediate
procedures like image cropping or feature recalculation, which not only prevents
error accumulation, but also reduces computation burden.
Lastly, the unified network is extended to simultaneous general text detection and
recognition in natural scene. In contrast to the one for car license plates, some innovations
are proposed to accommodate the special characteristics of general text. A
varying-size RoI encoding method is proposed to handle the various aspect ratios of general text. An attention-based sequence-to-sequence learning structure is adopted
for word recognition. It is expected that a character-level language model can be
learnt in this manner. The whole framework can be trained end-to-end, requiring
only images, the ground-truth bounding boxes and text labels. Through end-to-end
training, the learned features can be more discriminative, which improves the overall
performance. The convolutional features are calculated only once and shared by both
detection and recognition, which saves the processing time. The proposed method
has achieved state-of-the-art performance on several standard benchmark datasets.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201
Cascaded Segmentation-Detection Networks for Word-Level Text Spotting
We introduce an algorithm for word-level text spotting that is able to
accurately and reliably determine the bounding regions of individual words of
text "in the wild". Our system is formed by the cascade of two convolutional
neural networks. The first network is fully convolutional and is in charge of
detecting areas containing text. This results in a very reliable but possibly
inaccurate segmentation of the input image. The second network (inspired by the
popular YOLO architecture) analyzes each segment produced in the first stage,
and predicts oriented rectangular regions containing individual words. No
post-processing (e.g. text line grouping) is necessary. With execution time of
450 ms for a 1000-by-560 image on a Titan X GPU, our system achieves the
highest score to date among published algorithms on the ICDAR 2015 Incidental
Scene Text dataset benchmark.Comment: 7 pages, 8 figure
COCO_TS Dataset: Pixel-level Annotations Based on Weak Supervision for Scene Text Segmentation
The absence of large scale datasets with pixel-level supervisions is a
significant obstacle for the training of deep convolutional networks for scene
text segmentation. For this reason, synthetic data generation is normally
employed to enlarge the training dataset. Nonetheless, synthetic data cannot
reproduce the complexity and variability of natural images. In this paper, a
weakly supervised learning approach is used to reduce the shift between
training on real and synthetic data. Pixel-level supervisions for a text
detection dataset (i.e. where only bounding-box annotations are available) are
generated. In particular, the COCO-Text-Segmentation (COCO_TS) dataset, which
provides pixel-level supervisions for the COCO-Text dataset, is created and
released. The generated annotations are used to train a deep convolutional
neural network for semantic segmentation. Experiments show that the proposed
dataset can be used instead of synthetic data, allowing us to use only a
fraction of the training samples and significantly improving the performances
- …