2,861 research outputs found
OCR Error Correction Using Character Correction and Feature-Based Word Classification
This paper explores the use of a learned classifier for post-OCR text
correction. Experiments with the Arabic language show that this approach, which
integrates a weighted confusion matrix and a shallow language model, improves
the vast majority of segmentation and recognition errors, the most frequent
types of error on our dataset.Comment: Proceedings of the 12th IAPR International Workshop on Document
Analysis Systems (DAS2016), Santorini, Greece, April 11-14, 201
OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion
With the advent of digital optical scanners, a lot of paper-based books,
textbooks, magazines, articles, and documents are being transformed into an
electronic version that can be manipulated by a computer. For this purpose,
OCR, short for Optical Character Recognition was developed to translate scanned
graphical text into editable computer text. Unfortunately, OCR is still
imperfect as it occasionally mis-recognizes letters and falsely identifies
scanned text, leading to misspellings and linguistics errors in the OCR output
text. This paper proposes a post-processing context-based error correction
algorithm for detecting and correcting OCR non-word and real-word errors. The
proposed algorithm is based on Google's online spelling suggestion which
harnesses an internal database containing a huge collection of terms and word
sequences gathered from all over the web, convenient to suggest possible
replacements for words that have been misspelled during the OCR process.
Experiments carried out revealed a significant improvement in OCR error
correction rate. Future research can improve upon the proposed algorithm so
much so that it can be parallelized and executed over multiprocessing
platforms.Comment: LACSC - Lebanese Association for Computational Sciences,
http://www.lacsc.org/; Journal of Emerging Trends in Computing and
Information Sciences, Vol. 3, No. 1, January 201
A Complete Workflow for Development of Bangla OCR
Developing a Bangla OCR requires bunch of algorithm and methods. There were
many effort went on for developing a Bangla OCR. But all of them failed to
provide an error free Bangla OCR. Each of them has some lacking. We discussed
about the problem scope of currently existing Bangla OCR's. In this paper, we
present the basic steps required for developing a Bangla OCR and a complete
workflow for development of a Bangla OCR with mentioning all the possible
algorithms required
NAT: Noise-Aware Training for Robust Neural Sequence Labeling
Sequence labeling systems should perform reliably not only under ideal
conditions but also with corrupted inputs - as these systems often process
user-generated text or follow an error-prone upstream component. To this end,
we formulate the noisy sequence labeling problem, where the input may undergo
an unknown noising process and propose two Noise-Aware Training (NAT)
objectives that improve robustness of sequence labeling performed on perturbed
input: Our data augmentation method trains a neural model using a mixture of
clean and noisy samples, whereas our stability training algorithm encourages
the model to create a noise-invariant latent representation. We employ a
vanilla noise model at training time. For evaluation, we use both the original
data and its variants perturbed with real OCR errors and misspellings.
Extensive experiments on English and German named entity recognition benchmarks
confirmed that NAT consistently improved robustness of popular sequence
labeling models, preserving accuracy on the original input. We make our code
and data publicly available for the research community.Comment: Accepted to appear at ACL 202
Lipi Gnani - A Versatile OCR for Documents in any Language Printed in Kannada Script
A Kannada OCR, named Lipi Gnani, has been designed and developed from
scratch, with the motivation of it being able to convert printed text or poetry
in Kannada script, without any restriction on vocabulary. The training and test
sets have been collected from over 35 books published between the period 1970
to 2002, and this includes books written in Halegannada and pages containing
Sanskrit slokas written in Kannada script. The coverage of the OCR is nearly
complete in the sense that it recognizes all the punctuation marks, special
symbols, Indo-Arabic and Kannada numerals and also the interspersed English
words. Several minor and major original contributions have been done in
developing this OCR at the different processing stages such as binarization,
line and character segmentation, recognition and Unicode mapping. This has
created a Kannada OCR that performs as good as, and in some cases, better than
the Google's Tesseract OCR, as shown by the results. To the knowledge of the
authors, this is the maiden report of a complete Kannada OCR, handling all the
issues involved. Currently, there is no dictionary based postprocessing, and
the obtained results are due solely to the recognition process. Four benchmark
test databases containing scanned pages from books in Kannada, Sanskrit,
Konkani and Tulu languages, but all of them printed in Kannada script, have
been created. The word level recognition accuracy of Lipi Gnani is 4% higher on
the Kannada dataset than that of Google's Tesseract OCR, 8% higher on the
datasets of Tulu and Sanskrit, and 25% higher on the Konkani dataset.Comment: 21 pages, 16 figures, 12 tables, submitted to ACM Transactions on
Asian and Low-Resource Language Information Processin
Detecting Figures and Part Labels in Patents: Competition-Based Development of Image Processing Algorithms
We report the findings of a month-long online competition in which
participants developed algorithms for augmenting the digital version of patent
documents published by the United States Patent and Trademark Office (USPTO).
The goal was to detect figures and part labels in U.S. patent drawing pages.
The challenge drew 232 teams of two, of which 70 teams (30%) submitted
solutions. Collectively, teams submitted 1,797 solutions that were compiled on
the competition servers. Participants reported spending an average of 63 hours
developing their solutions, resulting in a total of 5,591 hours of development
time. A manually labeled dataset of 306 patents was used for training, online
system tests, and evaluation. The design and performance of the top-5 systems
are presented, along with a system developed after the competition which
illustrates that winning teams produced near state-of-the-art results under
strict time and computation constraints. For the 1st place system, the harmonic
mean of recall and precision (f-measure) was 88.57% for figure region
detection, 78.81% for figure regions with correctly recognized figure titles,
and 70.98% for part label detection and character recognition. Data and
software from the competition are available through the online UCI Machine
Learning repository to inspire follow-on work by the image processing
community
Measuring Human Perception to Improve Handwritten Document Transcription
The subtleties of human perception, as measured by vision scientists through
the use of psychophysics, are important clues to the internal workings of
visual recognition. For instance, measured reaction time can indicate whether a
visual stimulus is easy for a subject to recognize, or whether it is hard. In
this paper, we consider how to incorporate psychophysical measurements of
visual perception into the loss function of a deep neural network being trained
for a recognition task, under the assumption that such information can enforce
consistency with human behavior. As a case study to assess the viability of
this approach, we look at the problem of handwritten document transcription.
While good progress has been made towards automatically transcribing modern
handwriting, significant challenges remain in transcribing historical
documents. Here we describe a general enhancement strategy, underpinned by the
new loss formulation, which can be applied to the training regime of any deep
learning-based document transcription system. Through experimentation, reliable
performance improvement is demonstrated for the standard IAM and RIMES datasets
for three different network architectures. Further, we go on to show
feasibility for our approach on a new dataset of digitized Latin manuscripts,
originally produced by scribes in the Cloister of St. Gall in the the 9th
century
A Review of Research on Devnagari Character Recognition
English Character Recognition (CR) has been extensively studied in the last
half century and progressed to a level, sufficient to produce technology driven
applications. But same is not the case for Indian languages which are
complicated in terms of structure and computations. Rapidly growing
computational power may enable the implementation of Indic CR methodologies.
Digital document processing is gaining popularity for application to office and
library automation, bank and postal services, publishing houses and
communication technology. Devnagari being the national language of India,
spoken by more than 500 million people, should be given special attention so
that document retrieval and analysis of rich ancient and modern Indian
literature can be effectively done. This article is intended to serve as a
guide and update for the readers, working in the Devnagari Optical Character
Recognition (DOCR) area. An overview of DOCR systems is presented and the
available DOCR techniques are reviewed. The current status of DOCR is discussed
and directions for future research are suggested.Comment: 8 pages, 1 Figure, 8 Tables, Journal pape
A review on handwritten character and numeral recognition for Roman, Arabic, Chinese and Indian scripts
There are a lot of intensive researches on handwritten character recognition
(HCR) for almost past four decades. The research has been done on some of
popular scripts such as Roman, Arabic, Chinese and Indian. In this paper we
present a review on HCR work on the four popular scripts. We have summarized
most of the published paper from 2005 to recent and also analyzed the various
methods in creating a robust HCR system. We also added some future direction of
research on HCR.Comment: 8 page
A Study of Sindhi Related and Arabic Script Adapted languages Recognition
A large number of publications are available for the Optical Character
Recognition (OCR). Significant researches, as well as articles are present for
the Latin, Chinese and Japanese scripts. Arabic script is also one of mature
script from OCR perspective. The adaptive languages which share Arabic script
or its extended characters; still lacking the OCRs for their language. In this
paper we present the efforts of researchers on Arabic and its related and
adapted languages. This survey is organized in different sections, in which
introduction is followed by properties of Sindhi Language. OCR process
techniques and methods used by various researchers are presented. The last
section is dedicated for future work and conclusion is also discussed.Comment: 11 pages, 8 Figures, Sindh Univ. Res. Jour. (Sci. Ser.
- …