2,511 research outputs found
A Cost Efficient Approach to Correct OCR Errors in Large Document Collections
Word error rate of an ocr is often higher than its character error rate. This
is especially true when ocrs are designed by recognizing characters. High word
accuracies are critical to tasks like the creation of content in digital
libraries and text-to-speech applications. In order to detect and correct the
misrecognised words, it is common for an ocr module to employ a post-processor
to further improve the word accuracy. However, conventional approaches to
post-processing like looking up a dictionary or using a statistical language
model (slm), are still limited. In many such scenarios, it is often required to
remove the outstanding errors manually. We observe that the traditional
post-processing schemes look at error words sequentially since ocrs process
documents one at a time. We propose a cost-efficient model to address the error
words in batches rather than correcting them individually. We exploit the fact
that a collection of documents, unlike a single document, has a structure
leading to repetition of words. Such words, if efficiently grouped together and
corrected as a whole can lead to a significant reduction in the cost.
Correction can be fully automatic or with a human in the loop. Towards this, we
employ a novel clustering scheme to obtain fairly homogeneous clusters. We
compare the performance of our model with various baseline approaches including
the case where all the errors are removed by a human. We demonstrate the
efficacy of our solution empirically by reporting more than 70% reduction in
the human effort with near perfect error correction. We validate our method on
Books from multiple languages
Measuring Human Perception to Improve Handwritten Document Transcription
The subtleties of human perception, as measured by vision scientists through
the use of psychophysics, are important clues to the internal workings of
visual recognition. For instance, measured reaction time can indicate whether a
visual stimulus is easy for a subject to recognize, or whether it is hard. In
this paper, we consider how to incorporate psychophysical measurements of
visual perception into the loss function of a deep neural network being trained
for a recognition task, under the assumption that such information can enforce
consistency with human behavior. As a case study to assess the viability of
this approach, we look at the problem of handwritten document transcription.
While good progress has been made towards automatically transcribing modern
handwriting, significant challenges remain in transcribing historical
documents. Here we describe a general enhancement strategy, underpinned by the
new loss formulation, which can be applied to the training regime of any deep
learning-based document transcription system. Through experimentation, reliable
performance improvement is demonstrated for the standard IAM and RIMES datasets
for three different network architectures. Further, we go on to show
feasibility for our approach on a new dataset of digitized Latin manuscripts,
originally produced by scribes in the Cloister of St. Gall in the the 9th
century
FontCode: Embedding Information in Text Documents using Glyph Perturbation
We introduce FontCode, an information embedding technique for text documents.
Provided a text document with specific fonts, our method embeds user-specified
information in the text by perturbing the glyphs of text characters while
preserving the text content. We devise an algorithm to chooses unobtrusive yet
machine-recognizable glyph perturbations, leveraging a recently developed
generative model that alters the glyphs of each character continuously on a
font manifold. We then introduce an algorithm that embeds a user-provided
message in the text document and produces an encoded document whose appearance
is minimally perturbed from the original document. We also present a glyph
recognition method that recovers the embedded information from an encoded
document stored as a vector graphic or pixel image, or even on a printed paper.
In addition, we introduce a new error-correction coding scheme that rectifies a
certain number of recognition errors. Lastly, we demonstrate that our technique
enables a wide array of applications, using it as a text document metadata
holder, an unobtrusive optical barcode, a cryptographic message embedding
scheme, and a text document signature
A Study of Sindhi Related and Arabic Script Adapted languages Recognition
A large number of publications are available for the Optical Character
Recognition (OCR). Significant researches, as well as articles are present for
the Latin, Chinese and Japanese scripts. Arabic script is also one of mature
script from OCR perspective. The adaptive languages which share Arabic script
or its extended characters; still lacking the OCRs for their language. In this
paper we present the efforts of researchers on Arabic and its related and
adapted languages. This survey is organized in different sections, in which
introduction is followed by properties of Sindhi Language. OCR process
techniques and methods used by various researchers are presented. The last
section is dedicated for future work and conclusion is also discussed.Comment: 11 pages, 8 Figures, Sindh Univ. Res. Jour. (Sci. Ser.
A review on handwritten character and numeral recognition for Roman, Arabic, Chinese and Indian scripts
There are a lot of intensive researches on handwritten character recognition
(HCR) for almost past four decades. The research has been done on some of
popular scripts such as Roman, Arabic, Chinese and Indian. In this paper we
present a review on HCR work on the four popular scripts. We have summarized
most of the published paper from 2005 to recent and also analyzed the various
methods in creating a robust HCR system. We also added some future direction of
research on HCR.Comment: 8 page
Exploring the Daschle Collection using Text Mining
A U.S. Senator from South Dakota donated documents that were accumulated
during his service as a house representative and senator to be housed at the
Bridges library at South Dakota State University. This project investigated the
utility of quantitative statistical methods to explore some portions of this
vast document collection. The available scanned documents and emails from
constituents are analyzed using natural language processing methods including
the Latent Dirichlet Allocation (LDA) model. This model identified major topics
being discussed in a given collection of documents. Important events and
popular issues from the Senator Daschles career are reflected in the changing
topics from the model. These quantitative statistical methods provide a summary
of the massive amount of text without requiring significant human effort or
time and can be applied to similar collections
Similarity-based Text Recognition by Deeply Supervised Siamese Network
In this paper, we propose a new text recognition model based on measuring the
visual similarity of text and predicting the content of unlabeled texts. First
a Siamese convolutional network is trained with deep supervision on a labeled
training dataset. This network projects texts into a similarity manifold. The
Deeply Supervised Siamese network learns visual similarity of texts. Then a
K-nearest neighbor classifier is used to predict unlabeled text based on
similarity distance to labeled texts. The performance of the model is evaluated
on three datasets of machine-print and hand-written text combined. We
demonstrate that the model reduces the cost of human estimation by .
The error of the system is less than . The proposed model outperform
conventional Siamese network by finding visually-similar barely-readable and
readable text, e.g. machine-printed, handwritten, due to deep supervision. The
results also demonstrate that the predicted labels are sometimes better than
human labels e.g. spelling correction.Comment: Accepted for presenting at Future Technologies Conference - (FTC
2016) San Francisco, December 6-7, 201
Telugu OCR Framework using Deep Learning
In this paper, we address the task of Optical Character Recognition(OCR) for
the Telugu script. We present an end-to-end framework that segments the text
image, classifies the characters and extracts lines using a language model. The
segmentation is based on mathematical morphology. The classification module,
which is the most challenging task of the three, is a deep convolutional neural
network. The language is modelled as a third degree markov chain at the glyph
level. Telugu script is a complex alphasyllabary and the language is
agglutinative, making the problem hard. In this paper we apply the latest
advances in neural networks to achieve state-of-the-art error rates. We also
review convolutional neural networks in great detail and expound the
statistical justification behind the many tricks needed to make Deep Learning
work
An open diachronic corpus of historical Spanish: annotation criteria and automatic modernisation of spelling
The IMPACT-es diachronic corpus of historical Spanish compiles over one
hundred books --containing approximately 8 million words-- in addition to a
complementary lexicon which links more than 10 thousand lemmas with
attestations of the different variants found in the documents. This textual
corpus and the accompanying lexicon have been released under an open license
(Creative Commons by-nc-sa) in order to permit their intensive exploitation in
linguistic research. Approximately 7% of the words in the corpus (a selection
aimed at enhancing the coverage of the most frequent word forms) have been
annotated with their lemma, part of speech, and modern equivalent. This paper
describes the annotation criteria followed and the standards, based on the Text
Encoding Initiative recommendations, used to the represent the texts in digital
form. As an illustration of the possible synergies between diachronic textual
resources and linguistic research, we describe the application of statistical
machine translation techniques to infer probabilistic context-sensitive rules
for the automatic modernisation of spelling. The automatic modernisation with
this type of statistical methods leads to very low character error rates when
the output is compared with the supervised modern version of the text.Comment: The part of this paper describing the IMPACT-es corpus has been
accepted for publication in the journal Language Resources and Evaluation
(http://link.springer.com/article/10.1007/s10579-013-9239-y
Sentence Correction Based on Large-scale Language Modelling
With the further development of informatization, more and more data is stored
in the form of text. There are some loss of text during their generation and
transmission. The paper aims to establish a language model based on the
large-scale corpus to complete the restoration of missing text. In this paper,
we introduce a novel measurement to find the missing words, and a way of
establishing a comprehensive candidate lexicon to insert the correct choice of
words. The paper also introduces some effective optimization methods, which
largely improve the efficiency of the text restoration and shorten the time of
dealing with 1000 sentences into 3.6 seconds. \keywords{ language model,
sentence correction, word imputation, parallel optimizatio
- …