256,118 research outputs found
Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes
Recently, models based on deep neural networks have dominated the fields of
scene text detection and recognition. In this paper, we investigate the problem
of scene text spotting, which aims at simultaneous text detection and
recognition in natural images. An end-to-end trainable neural network model for
scene text spotting is proposed. The proposed model, named as Mask TextSpotter,
is inspired by the newly published work Mask R-CNN. Different from previous
methods that also accomplish text spotting with end-to-end trainable deep
neural networks, Mask TextSpotter takes advantage of simple and smooth
end-to-end learning procedure, in which precise text detection and recognition
are acquired via semantic segmentation. Moreover, it is superior to previous
methods in handling text instances of irregular shapes, for example, curved
text. Experiments on ICDAR2013, ICDAR2015 and Total-Text demonstrate that the
proposed method achieves state-of-the-art results in both scene text detection
and end-to-end text recognition tasks.Comment: To appear in ECCV 201
Unconstrained Scene Text and Video Text Recognition for Arabic Script
Building robust recognizers for Arabic has always been challenging. We
demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid
architecture in recognizing Arabic text in videos and natural scenes. We
outperform previous state-of-the-art on two publicly available video text
datasets - ALIF and ACTIV. For the scene text recognition task, we introduce a
new Arabic scene text dataset and establish baseline results. For scripts like
Arabic, a major challenge in developing robust recognizers is the lack of large
quantity of annotated data. We overcome this by synthesising millions of Arabic
text images from a large vocabulary of Arabic words and phrases. Our
implementation is built on top of the model introduced here [37] which is
proven quite effective for English scene text recognition. The model follows a
segmentation-free, sequence to sequence transcription approach. The network
transcribes a sequence of convolutional features from the input image to a
sequence of target labels. This does away with the need for segmenting input
image into constituent characters/glyphs, which is often difficult for Arabic
script. Further, the ability of RNNs to model contextual dependencies yields
superior recognition results.Comment: 5 page
Towards Detecting, Recognizing, and Parsing the Address Information from Bangla Signboard: A Deep Learning-based Approach
Retrieving textual information from natural scene images is an active
research area in the field of computer vision with numerous practical
applications. Detecting text regions and extracting text from signboards is a
challenging problem due to special characteristics like reflecting lights,
uneven illumination, or shadows found in real-life natural scene images. With
the advent of deep learning-based methods, different sophisticated techniques
have been proposed for text detection and text recognition from the natural
scene. Though a significant amount of effort has been devoted to extracting
natural scene text for resourceful languages like English, little has been done
for low-resource languages like Bangla. In this research work, we have proposed
an end-to-end system with deep learning-based models for efficiently detecting,
recognizing, correcting, and parsing address information from Bangla
signboards. We have created manually annotated datasets and synthetic datasets
to train signboard detection, address text detection, address text recognition,
address text correction, and address text parser models. We have conducted a
comparative study among different CTC-based and Encoder-Decoder model
architectures for Bangla address text recognition. Moreover, we have designed a
novel address text correction model using a sequence-to-sequence
transformer-based network to improve the performance of Bangla address text
recognition model by post-correction. Finally, we have developed a Bangla
address text parser using the state-of-the-art transformer-based pre-trained
language model
Text detection and recognition in natural scene images
This thesis addresses the problem of end-to-end text detection and recognition in
natural scene images based on deep neural networks. Scene text detection and recognition
aim to find regions in an image that are considered as text by human beings,
generate a bounding box for each word and output a corresponding sequence of
characters. As a useful task in image analysis, scene text detection and recognition
attract much attention in computer vision field. In this thesis, we tackle this problem
by taking advantage of the success in deep learning techniques.
Car license plates can be viewed as a spacial case of scene text, as they both consist
of characters and appear in natural scenes. Nevertheless, they have their respective
specificities. During the research progress, we start from car license plate detection
and recognition. Then we extend the methods to general scene text, with additional
ideas proposed.
For both tasks, we develop two approaches respectively: a stepwise one and
an integrated one. Stepwise methods tackle text detection and recognition step by
step by respective models; while integrated methods handle both text detection and
recognition simultaneously via one model. All approaches are based on the powerful
deep Convolutional Neural Networks (CNNs) and Recurrent Neural Networks
(RNNs), considering the tremendous breakthroughs they brought into the computer
vision community.
To begin with, a stepwise framework is proposed to tackle text detection and
recognition, with its application to car license plates and general scene text respectively.
A character CNN classifier is well trained to detect characters from an image
in a sliding window manner. The detected characters are then grouped together as
license plates or text lines according to some heuristic rules. A sequence labeling
based method is proposed to recognize the whole license plate or text line without
character level segmentation.
On the basis of the sequence labeling based recognition method, to accelerate the
processing speed, an integrated deep neural network is then proposed to address
car license plate detection and recognition concurrently. It integrates both CNNs
and RNNs in one network, and can be trained end-to-end. Both car license plate
bounding boxes and their labels are generated in a single forward evaluation of the
network. The whole process involves no heuristic rule, and avoids intermediate
procedures like image cropping or feature recalculation, which not only prevents
error accumulation, but also reduces computation burden.
Lastly, the unified network is extended to simultaneous general text detection and
recognition in natural scene. In contrast to the one for car license plates, some innovations
are proposed to accommodate the special characteristics of general text. A
varying-size RoI encoding method is proposed to handle the various aspect ratios of general text. An attention-based sequence-to-sequence learning structure is adopted
for word recognition. It is expected that a character-level language model can be
learnt in this manner. The whole framework can be trained end-to-end, requiring
only images, the ground-truth bounding boxes and text labels. Through end-to-end
training, the learned features can be more discriminative, which improves the overall
performance. The convolutional features are calculated only once and shared by both
detection and recognition, which saves the processing time. The proposed method
has achieved state-of-the-art performance on several standard benchmark datasets.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201
AON: Towards Arbitrarily-Oriented Text Recognition
Recognizing text from natural images is a hot research topic in computer
vision due to its various applications. Despite the enduring research of
several decades on optical character recognition (OCR), recognizing texts from
natural images is still a challenging task. This is because scene texts are
often in irregular (e.g. curved, arbitrarily-oriented or seriously distorted)
arrangements, which have not yet been well addressed in the literature.
Existing methods on text recognition mainly work with regular (horizontal and
frontal) texts and cannot be trivially generalized to handle irregular texts.
In this paper, we develop the arbitrary orientation network (AON) to directly
capture the deep features of irregular texts, which are combined into an
attention-based decoder to generate character sequence. The whole network can
be trained end-to-end by using only images and word-level annotations.
Extensive experiments on various benchmarks, including the CUTE80,
SVT-Perspective, IIIT5k, SVT and ICDAR datasets, show that the proposed
AON-based method achieves the-state-of-the-art performance in irregular
datasets, and is comparable to major existing methods in regular datasets.Comment: Accepted by CVPR201
- …