6,298 research outputs found
Joint Energy-based Detection and Classificationon of Multilingual Text Lines
This paper proposes a new hierarchical MDL-based model for a joint detection
and classification of multilingual text lines in im- ages taken by hand-held
cameras. The majority of related text detec- tion methods assume alphabet-based
writing in a single language, e.g. in Latin. They use simple clustering
heuristics specific to such texts: prox- imity between letters within one line,
larger distance between separate lines, etc. We are interested in a
significantly more ambiguous problem where images combine alphabet and
logographic characters from multiple languages and typographic rules vary a lot
(e.g. English, Korean, and Chinese). Complexity of detecting and classifying
text lines in multiple languages calls for a more principled approach based on
information- theoretic principles. Our new MDL model includes data costs
combining geometric errors with classification likelihoods and a hierarchical
sparsity term based on label costs. This energy model can be efficiently
minimized by fusion moves. We demonstrate robustness of the proposed algorithm
on a large new database of multilingual text images collected in the pub- lic
transit system of Seoul
TextProposals: a Text-specific Selective Search Algorithm for Word Spotting in the Wild
Motivated by the success of powerful while expensive techniques to recognize
words in a holistic way, object proposals techniques emerge as an alternative
to the traditional text detectors. In this paper we introduce a novel object
proposals method that is specifically designed for text. We rely on a
similarity based region grouping algorithm that generates a hierarchy of word
hypotheses. Over the nodes of this hierarchy it is possible to apply a holistic
word recognition method in an efficient way.
Our experiments demonstrate that the presented method is superior in its
ability of producing good quality word proposals when compared with
class-independent algorithms. We show impressive recall rates with a few
thousand proposals in different standard benchmarks, including focused or
incidental text datasets, and multi-language scenarios. Moreover, the
combination of our object proposals with existing whole-word recognizers shows
competitive performance in end-to-end word spotting, and, in some benchmarks,
outperforms previously published results. Concretely, in the challenging
ICDAR2015 Incidental Text dataset, we overcome in more than 10 percent f-score
the best-performing method in the last ICDAR Robust Reading Competition. Source
code of the complete end-to-end system is available at
https://github.com/lluisgomez/TextProposal
Accurate Text Localization in Natural Image with Cascaded Convolutional Text Network
We introduce a new top-down pipeline for scene text detection. We propose a
novel Cascaded Convolutional Text Network (CCTN) that joints two customized
convolutional networks for coarse-to-fine text localization. The CCTN fast
detects text regions roughly from a low-resolution image, and then accurately
localizes text lines from each enlarged region. We cast previous character
based detection into direct text region estimation, avoiding multiple bottom-
up post-processing steps. It exhibits surprising robustness and discriminative
power by considering whole text region as detection object which provides
strong semantic information. We customize convolutional network by develop- ing
rectangle convolutions and multiple in-network fusions. This enables it to
handle multi-shape and multi-scale text efficiently. Furthermore, the CCTN is
computationally efficient by sharing convolutional computations, and high-level
property allows it to be invariant to various languages and multiple
orientations. It achieves 0.84 and 0.86 F-measures on the ICDAR 2011 and ICDAR
2013, delivering substantial improvements over state-of-the-art results [23,
1]
Stroke extraction for offline handwritten mathematical expression recognition
Offline handwritten mathematical expression recognition is often considered
much harder than its online counterpart due to the absence of temporal
information. In order to take advantage of the more mature methods for online
recognition and save resources, an oversegmentation approach is proposed to
recover strokes from textual bitmap images automatically. The proposed
algorithm first breaks down the skeleton of a binarized image into junctions
and segments, then segments are merged to form strokes, finally stroke order is
normalized by using recursive projection and topological sort. Good offline
accuracy was obtained in combination with ordinary online recognizers, which
are not specially designed for extracted strokes. Given a ready-made
state-of-the-art online handwritten mathematical expression recognizer, the
proposed procedure correctly recognized 58.22%, 65.65%, and 65.22% of the
offline formulas rendered from the datasets of the Competitions on Recognition
of Online Handwritten Mathematical Expressions(CROHME) in 2014, 2016, and 2019
respectively. Furthermore, given a trainable online recognition system,
retraining it with extracted strokes resulted in an offline recognizer with the
same level of accuracy. On the other hand, the speed of the entire pipeline was
fast enough to facilitate on-device recognition on mobile phones with limited
resources. To conclude, stroke extraction provides an attractive way to build
optical character recognition software.Comment: 22 pages, 7 figure
A Fast Hierarchical Method for Multi-script and Arbitrary Oriented Scene Text Extraction
Typography and layout lead to the hierarchical organisation of text in words,
text lines, paragraphs. This inherent structure is a key property of text in
any script and language, which has nonetheless been minimally leveraged by
existing text detection methods. This paper addresses the problem of text
segmentation in natural scenes from a hierarchical perspective. Contrary to
existing methods, we make explicit use of text structure, aiming directly to
the detection of region groupings corresponding to text within a hierarchy
produced by an agglomerative similarity clustering process over individual
regions. We propose an optimal way to construct such an hierarchy introducing a
feature space designed to produce text group hypotheses with high recall and a
novel stopping rule combining a discriminative classifier and a probabilistic
measure of group meaningfulness based in perceptual organization. Results
obtained over four standard datasets, covering text in variable orientations
and different languages, demonstrate that our algorithm, while being trained in
a single mixed dataset, outperforms state of the art methods in unconstrained
scenarios.Comment: Manuscript Preprint. 11 pages. This work has been submitted to the
IEEE for possible publication. Copyright may be transferred without notice,
after which this version may no longer be accessibl
Reading Text in the Wild with Convolutional Neural Networks
In this work we present an end-to-end system for text spotting -- localising
and recognising text in natural scene images -- and text based image retrieval.
This system is based on a region proposal mechanism for detection and deep
convolutional neural networks for recognition. Our pipeline uses a novel
combination of complementary proposal generation techniques to ensure high
recall, and a fast subsequent filtering stage for improving precision. For the
recognition and ranking of proposals, we train very large convolutional neural
networks to perform word recognition on the whole proposal region at the same
time, departing from the character classifier based systems of the past. These
networks are trained solely on data produced by a synthetic text generation
engine, requiring no human labelled data.
Analysing the stages of our pipeline, we show state-of-the-art performance
throughout. We perform rigorous experiments across a number of standard
end-to-end text spotting benchmarks and text-based image retrieval datasets,
showing a large improvement over all previous methods. Finally, we demonstrate
a real-world application of our text spotting system to allow thousands of
hours of news footage to be instantly searchable via a text query
Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection
Detecting incidental scene text is a challenging task because of
multi-orientation, perspective distortion, and variation of text size, color
and scale. Retrospective research has only focused on using rectangular
bounding box or horizontal sliding window to localize text, which may result in
redundant background noise, unnecessary overlap or even information loss. To
address these issues, we propose a new Convolutional Neural Networks (CNNs)
based method, named Deep Matching Prior Network (DMPNet), to detect text with
tighter quadrangle. First, we use quadrilateral sliding windows in several
specific intermediate convolutional layers to roughly recall the text with
higher overlapping area and then a shared Monte-Carlo method is proposed for
fast and accurate computing of the polygonal areas. After that, we designed a
sequential protocol for relative regression which can exactly predict text with
compact quadrangle. Moreover, a auxiliary smooth Ln loss is also proposed for
further regressing the position of text, which has better overall performance
than L2 loss and smooth L1 loss in terms of robustness and stability. The
effectiveness of our approach is evaluated on a public word-level,
multi-oriented scene text database, ICDAR 2015 Robust Reading Competition
Challenge 4 "Incidental scene text localization". The performance of our method
is evaluated by using F-measure and found to be 70.64%, outperforming the
existing state-of-the-art method with F-measure 63.76%.Comment: 8 Pages, 7 figures. Accepted to appear in CVPR 201
End-to-End Text Recognition with Hybrid HMM Maxout Models
The problem of detecting and recognizing text in natural scenes has proved to
be more challenging than its counterpart in documents, with most of the
previous work focusing on a single part of the problem. In this work, we
propose new solutions to the character and word recognition problems and then
show how to combine these solutions in an end-to-end text-recognition system.
We do so by leveraging the recently introduced Maxout networks along with
hybrid HMM models that have proven useful for voice recognition. Using these
elements, we build a tunable and highly accurate recognition system that beats
state-of-the-art results on all the sub-problems for both the ICDAR 2003 and
SVT benchmark datasets.Comment: 9 pages, 7 figure
Overlay Text Extraction From TV News Broadcast
The text data present in overlaid bands convey brief descriptions of news
events in broadcast videos. The process of text extraction becomes challenging
as overlay text is presented in widely varying formats and often with animation
effects. We note that existing edge density based methods are well suited for
our application on account of their simplicity and speed of operation. However,
these methods are sensitive to thresholds and have high false positive rates.
In this paper, we present a contrast enhancement based preprocessing stage for
overlay text detection and a parameter free edge density based scheme for
efficient text band detection. The second contribution of this paper is a novel
approach for multiple text region tracking with a formal identification of all
possible detection failure cases. The tracking stage enables us to establish
the temporal presence of text bands and their linking over time. The third
contribution is the adoption of Tesseract OCR for the specific task of overlay
text recognition using web news articles. The proposed approach is tested and
found superior on news videos acquired from three Indian English television
news channels along with benchmark datasets.Comment: Published in INDICON 201
Scene Text Detection via Holistic, Multi-Channel Prediction
Recently, scene text detection has become an active research topic in
computer vision and document analysis, because of its great importance and
significant challenge. However, vast majority of the existing methods detect
text within local regions, typically through extracting character, word or line
level candidates followed by candidate aggregation and false positive
elimination, which potentially exclude the effect of wide-scope and long-range
contextual cues in the scene. To take full advantage of the rich information
available in the whole natural image, we propose to localize text in a holistic
manner, by casting scene text detection as a semantic segmentation problem. The
proposed algorithm directly runs on full images and produces global, pixel-wise
prediction maps, in which detections are subsequently formed. To better make
use of the properties of text, three types of information regarding text
region, individual characters and their relationship are estimated, with a
single Fully Convolutional Network (FCN) model. With such predictions of text
properties, the proposed algorithm can simultaneously handle horizontal,
multi-oriented and curved text in real-world natural images. The experiments on
standard benchmarks, including ICDAR 2013, ICDAR 2015 and MSRA-TD500,
demonstrate that the proposed algorithm substantially outperforms previous
state-of-the-art approaches. Moreover, we report the first baseline result on
the recently-released, large-scale dataset COCO-Text.Comment: 10 pages, 9 figures, 5 table
- …