10,391 research outputs found
A review on handwritten character and numeral recognition for Roman, Arabic, Chinese and Indian scripts
There are a lot of intensive researches on handwritten character recognition
(HCR) for almost past four decades. The research has been done on some of
popular scripts such as Roman, Arabic, Chinese and Indian. In this paper we
present a review on HCR work on the four popular scripts. We have summarized
most of the published paper from 2005 to recent and also analyzed the various
methods in creating a robust HCR system. We also added some future direction of
research on HCR.Comment: 8 page
Scene Text Recognition with Sliding Convolutional Character Models
Scene text recognition has attracted great interests from the computer vision
and pattern recognition community in recent years. State-of-the-art methods use
concolutional neural networks (CNNs), recurrent neural networks with long
short-term memory (RNN-LSTM) or the combination of them. In this paper, we
investigate the intrinsic characteristics of text recognition, and inspired by
human cognition mechanisms in reading texts, we propose a scene text
recognition method with character models on convolutional feature map. The
method simultaneously detects and recognizes characters by sliding the text
line image with character models, which are learned end-to-end on text line
images labeled with text transcripts. The character classifier outputs on the
sliding windows are normalized and decoded with Connectionist Temporal
Classification (CTC) based algorithm. Compared to previous methods, our method
has a number of appealing properties: (1) It avoids the difficulty of character
segmentation which hinders the performance of segmentation-based recognition
methods; (2) The model can be trained simply and efficiently because it avoids
gradient vanishing/exploding in training RNN-LSTM based models; (3) It bases on
character models trained free of lexicon, and can recognize unknown words. (4)
The recognition process is highly parallel and enables fast recognition. Our
experiments on several challenging English and Chinese benchmarks, including
the IIIT-5K, SVT, ICDAR03/13 and TRW15 datasets, demonstrate that the proposed
method yields superior or comparable performance to state-of-the-art methods
while the model size is relatively small.Comment: 10 pages,4 figure
Reading Scene Text with Attention Convolutional Sequence Modeling
Reading text in the wild is a challenging task in the field of computer
vision. Existing approaches mainly adopted Connectionist Temporal
Classification (CTC) or Attention models based on Recurrent Neural Network
(RNN), which is computationally expensive and hard to train. In this paper, we
present an end-to-end Attention Convolutional Network for scene text
recognition. Firstly, instead of RNN, we adopt the stacked convolutional layers
to effectively capture the contextual dependencies of the input sequence, which
is characterized by lower computational complexity and easier parallel
computation. Compared to the chain structure of recurrent networks, the
Convolutional Neural Network (CNN) provides a natural way to capture long-term
dependencies between elements, which is 9 times faster than Bidirectional Long
Short-Term Memory (BLSTM). Furthermore, in order to enhance the representation
of foreground text and suppress the background noise, we incorporate the
residual attention modules into a small densely connected network to improve
the discriminability of CNN features. We validate the performance of our
approach on the standard benchmarks, including the Street View Text, IIIT5K and
ICDAR datasets. As a result, state-of-the-art or highly-competitive performance
and efficiency show the superiority of the proposed approach
Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition
Although attention based end-to-end models have achieved promising
performance in speech recognition, the multi-pass forward computation in
beam-search increases inference time cost, which limits their practical
applications. To address this issue, we propose a non-autoregressive end-to-end
speech recognition system called LASO (listen attentively, and spell once).
Because of the non-autoregressive property, LASO predicts a textual token in
the sequence without the dependence on other tokens. Without beam-search, the
one-pass propagation much reduces inference time cost of LASO. And because the
model is based on the attention based feedforward structure, the computation
can be implemented in parallel efficiently. We conduct experiments on publicly
available Chinese dataset AISHELL-1. LASO achieves a character error rate of
6.4%, which outperforms the state-of-the-art autoregressive transformer model
(6.7%). The average inference latency is 21 ms, which is 1/50 of the
autoregressive transformer model.Comment: accepted by INTERSPEECH202
Going Wider: Recurrent Neural Network With Parallel Cells
Recurrent Neural Network (RNN) has been widely applied for sequence modeling.
In RNN, the hidden states at current step are full connected to those at
previous step, thus the influence from less related features at previous step
may potentially decrease model's learning ability. We propose a simple
technique called parallel cells (PCs) to enhance the learning ability of
Recurrent Neural Network (RNN). In each layer, we run multiple small RNN cells
rather than one single large cell. In this paper, we evaluate PCs on 2 tasks.
On language modeling task on PTB (Penn Tree Bank), our model outperforms state
of art models by decreasing perplexity from 78.6 to 75.3. On Chinese-English
translation task, our model increases BLEU score for 0.39 points than baseline
model
SCAN: Sliding Convolutional Attention Network for Scene Text Recognition
Scene text recognition has drawn great attentions in the community of
computer vision and artificial intelligence due to its challenges and wide
applications. State-of-the-art recurrent neural networks (RNN) based models map
an input sequence to a variable length output sequence, but are usually applied
in a black box manner and lack of transparency for further improvement, and the
maintaining of the entire past hidden states prevents parallel computation in a
sequence. In this paper, we investigate the intrinsic characteristics of text
recognition, and inspired by human cognition mechanisms in reading texts, we
propose a scene text recognition method with sliding convolutional attention
network (SCAN). Similar to the eye movement during reading, the process of SCAN
can be viewed as an alternation between saccades and visual fixations. Compared
to the previous recurrent models, computations over all elements of SCAN can be
fully parallelized during training. Experimental results on several challenging
benchmarks, including the IIIT5k, SVT and ICDAR 2003/2013 datasets, demonstrate
the superiority of SCAN over state-of-the-art methods in terms of both the
model interpretability and performance
An optimized system to solve text-based CAPTCHA
CAPTCHA(Completely Automated Public Turing test to Tell Computers and Humans
Apart) can be used to protect data from auto bots. Countless kinds of CAPTCHAs
are thus designed, while we most frequently utilize text-based scheme because
of most convenience and user-friendly way \cite{bursztein2011text}. Currently,
various types of CAPTCHAs need corresponding segmentation to identify single
character due to the numerous different segmentation ways. Our goal is to
defeat the CAPTCHA, thus firstly the CAPTCHAs need to be split into character
by character. There isn't a regular segmentation algorithm to obtain the
divided characters in all kinds of examples, which means that we have to treat
the segmentation individually. In this paper, we build a whole system to defeat
the CAPTCHAs as well as achieve state-of-the-art performance. In detail, we
present our self-adaptive algorithm to segment different kinds of characters
optimally, and then utilize both the existing methods and our own constructed
convolutional neural network as an extra classifier. Results are provided
showing how our system work well towards defeating these CAPTCHAs
Writer-Aware CNN for Parsimonious HMM-Based Offline Handwritten Chinese Text Recognition
Recently, the hybrid convolutional neural network hidden Markov model
(CNN-HMM) has been introduced for offline handwritten Chinese text recognition
(HCTR) and has achieved state-of-the-art performance. However, modeling each of
the large vocabulary of Chinese characters with a uniform and fixed number of
hidden states requires high memory and computational costs and makes the tens
of thousands of HMM state classes confusing. Another key issue of CNN-HMM for
HCTR is the diversified writing style, which leads to model strain and a
significant performance decline for specific writers. To address these issues,
we propose a writer-aware CNN based on parsimonious HMM (WCNN-PHMM). First,
PHMM is designed using a data-driven state-tying algorithm to greatly reduce
the total number of HMM states, which not only yields a compact CNN by state
sharing of the same or similar radicals among different Chinese characters but
also improves the recognition accuracy due to the more accurate modeling of
tied states and the lower confusion among them. Second, WCNN integrates each
convolutional layer with one adaptive layer fed by a writer-dependent vector,
namely, the writer code, to extract the irrelevant variability in writer
information to improve recognition performance. The parameters of
writer-adaptive layers are jointly optimized with other network parameters in
the training stage, while a multiple-pass decoding strategy is adopted to learn
the writer code and generate recognition results. Validated on the ICDAR 2013
competition of CASIA-HWDB database, the more compact WCNN-PHMM of a 7360-class
vocabulary can achieve a relative character error rate (CER) reduction of 16.6%
over the conventional CNN-HMM without considering language modeling. By
adopting a powerful hybrid language model (N-gram language model and recurrent
neural network language model), the CER of WCNN-PHMM is reduced to 3.17%
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
We show that an end-to-end deep learning approach can be used to recognize
either English or Mandarin Chinese speech--two vastly different languages.
Because it replaces entire pipelines of hand-engineered components with neural
networks, end-to-end learning allows us to handle a diverse variety of speech
including noisy environments, accents and different languages. Key to our
approach is our application of HPC techniques, resulting in a 7x speedup over
our previous system. Because of this efficiency, experiments that previously
took weeks now run in days. This enables us to iterate more quickly to identify
superior architectures and algorithms. As a result, in several cases, our
system is competitive with the transcription of human workers when benchmarked
on standard datasets. Finally, using a technique called Batch Dispatch with
GPUs in the data center, we show that our system can be inexpensively deployed
in an online setting, delivering low latency when serving users at scale
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
Scene text recognition has attracted a great many researches due to its
importance to various applications. Existing methods mainly adopt recurrence or
convolution based networks. Though have obtained good performance, these
methods still suffer from two limitations: slow training speed due to the
internal recurrence of RNNs, and high complexity due to stacked convolutional
layers for long-term feature extraction. This paper, for the first time,
proposes a no-recurrence sequence-to-sequence text recognizer, named NRTR, that
dispenses with recurrences and convolutions entirely. NRTR follows the
encoder-decoder paradigm, where the encoder uses stacked self-attention to
extract image features, and the decoder applies stacked self-attention to
recognize texts based on encoder output. NRTR relies solely on self-attention
mechanism thus could be trained with more parallelization and less complexity.
Considering scene image has large variation in text and background, we further
design a modality-transform block to effectively transform 2D input images to
1D sequences, combined with the encoder to extract more discriminative
features. NRTR achieves state-of-the-art or highly competitive performance on
both regular and irregular benchmarks, while requires only a small fraction of
training time compared to the best model from the literature (at least 8 times
faster).Comment: 6 pages, 3 figures, 3 table
- …