Search CORE

2,861 research outputs found

OCR Error Correction Using Character Correction and Feature-Based Word Classification

Author: Dershowitz Nachum
Kissos Ido
Publication venue
Publication date: 21/04/2016
Field of study

This paper explores the use of a learned classifier for post-OCR text correction. Experiments with the Arabic language show that this approach, which integrates a weighted confusion matrix and a shallow language model, improves the vast majority of segmentation and recognition errors, the most frequent types of error on our dataset.Comment: Proceedings of the 12th IAPR International Workshop on Document Analysis Systems (DAS2016), Santorini, Greece, April 11-14, 201

arXiv.org e-Print Archive

OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion

Author: Alwani Mohammad
Bassil Youssef
Publication venue
Publication date: 01/04/2012
Field of study

With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for Optical Character Recognition was developed to translate scanned graphical text into editable computer text. Unfortunately, OCR is still imperfect as it occasionally mis-recognizes letters and falsely identifies scanned text, leading to misspellings and linguistics errors in the OCR output text. This paper proposes a post-processing context-based error correction algorithm for detecting and correcting OCR non-word and real-word errors. The proposed algorithm is based on Google's online spelling suggestion which harnesses an internal database containing a huge collection of terms and word sequences gathered from all over the web, convenient to suggest possible replacements for words that have been misspelled during the OCR process. Experiments carried out revealed a significant improvement in OCR error correction rate. Future research can improve upon the proposed algorithm so much so that it can be parallelized and executed over multiprocessing platforms.Comment: LACSC - Lebanese Association for Computational Sciences, http://www.lacsc.org/; Journal of Emerging Trends in Computing and Information Sciences, Vol. 3, No. 1, January 201

arXiv.org e-Print Archive

A Complete Workflow for Development of Bangla OCR

Author: Bikas Md. Abu Naser
Himel Shiam Shabbir
Omee Farjana Yeasmin
Publication venue
Publication date: 05/04/2012
Field of study

Developing a Bangla OCR requires bunch of algorithm and methods. There were many effort went on for developing a Bangla OCR. But all of them failed to provide an error free Bangla OCR. Each of them has some lacking. We discussed about the problem scope of currently existing Bangla OCR's. In this paper, we present the basic steps required for developing a Bangla OCR and a complete workflow for development of a Bangla OCR with mentioning all the possible algorithms required

arXiv.org e-Print Archive

NAT: Noise-Aware Training for Robust Neural Sequence Labeling

Author: Behnke Sven
Köhler Joachim
Namysl Marcin
Publication venue
Publication date: 14/05/2020
Field of study

Sequence labeling systems should perform reliably not only under ideal conditions but also with corrupted inputs - as these systems often process user-generated text or follow an error-prone upstream component. To this end, we formulate the noisy sequence labeling problem, where the input may undergo an unknown noising process and propose two Noise-Aware Training (NAT) objectives that improve robustness of sequence labeling performed on perturbed input: Our data augmentation method trains a neural model using a mixture of clean and noisy samples, whereas our stability training algorithm encourages the model to create a noise-invariant latent representation. We employ a vanilla noise model at training time. For evaluation, we use both the original data and its variants perturbed with real OCR errors and misspellings. Extensive experiments on English and German named entity recognition benchmarks confirmed that NAT consistently improved robustness of popular sequence labeling models, preserving accuracy on the original input. We make our code and data publicly available for the research community.Comment: Accepted to appear at ACL 202

arXiv.org e-Print Archive

Lipi Gnani - A Versatile OCR for Documents in any Language Printed in Kannada Script

Author: G Ramakrishnan A
R Shiva Kumar H
Publication venue
Publication date: 02/01/2019
Field of study

A Kannada OCR, named Lipi Gnani, has been designed and developed from scratch, with the motivation of it being able to convert printed text or poetry in Kannada script, without any restriction on vocabulary. The training and test sets have been collected from over 35 books published between the period 1970 to 2002, and this includes books written in Halegannada and pages containing Sanskrit slokas written in Kannada script. The coverage of the OCR is nearly complete in the sense that it recognizes all the punctuation marks, special symbols, Indo-Arabic and Kannada numerals and also the interspersed English words. Several minor and major original contributions have been done in developing this OCR at the different processing stages such as binarization, line and character segmentation, recognition and Unicode mapping. This has created a Kannada OCR that performs as good as, and in some cases, better than the Google's Tesseract OCR, as shown by the results. To the knowledge of the authors, this is the maiden report of a complete Kannada OCR, handling all the issues involved. Currently, there is no dictionary based postprocessing, and the obtained results are due solely to the recognition process. Four benchmark test databases containing scanned pages from books in Kannada, Sanskrit, Konkani and Tulu languages, but all of them printed in Kannada script, have been created. The word level recognition accuracy of Lipi Gnani is 4% higher on the Kannada dataset than that of Google's Tesseract OCR, 8% higher on the datasets of Tulu and Sanskrit, and 25% higher on the Konkani dataset.Comment: 21 pages, 16 figures, 12 tables, submitted to ACM Transactions on Asian and Low-Resource Language Information Processin

arXiv.org e-Print Archive

Detecting Figures and Part Labels in Patents: Competition-Based Development of Image Processing Algorithms

Author: Crusan Jason
Hearst Marti A.
Lakhani Karim R.
Menietti Michael
Metelsky Ivan
Riedl Christoph
Zanibbi Richard
Zhu Siyu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 11/11/2014
Field of study

We report the findings of a month-long online competition in which participants developed algorithms for augmenting the digital version of patent documents published by the United States Patent and Trademark Office (USPTO). The goal was to detect figures and part labels in U.S. patent drawing pages. The challenge drew 232 teams of two, of which 70 teams (30%) submitted solutions. Collectively, teams submitted 1,797 solutions that were compiled on the competition servers. Participants reported spending an average of 63 hours developing their solutions, resulting in a total of 5,591 hours of development time. A manually labeled dataset of 306 patents was used for training, online system tests, and evaluation. The design and performance of the top-5 systems are presented, along with a system developed after the competition which illustrates that winning teams produced near state-of-the-art results under strict time and computation constraints. For the 1st place system, the harmonic mean of recall and precision (f-measure) was 88.57% for figure region detection, 78.81% for figure regions with correctly recognized figure titles, and 70.98% for part label detection and character recognition. Data and software from the competition are available through the online UCI Machine Learning repository to inspire follow-on work by the image processing community

arXiv.org e-Print Archive

Measuring Human Perception to Improve Handwritten Document Transcription

Author: Chiang David
Grieggs Samuel
Li Pei
Ma Jiaqi
Price Brian
Rauch Greta
Scheirer Walter J.
Shen Bingyu
Publication venue
Publication date: 17/08/2020
Field of study

The subtleties of human perception, as measured by vision scientists through the use of psychophysics, are important clues to the internal workings of visual recognition. For instance, measured reaction time can indicate whether a visual stimulus is easy for a subject to recognize, or whether it is hard. In this paper, we consider how to incorporate psychophysical measurements of visual perception into the loss function of a deep neural network being trained for a recognition task, under the assumption that such information can enforce consistency with human behavior. As a case study to assess the viability of this approach, we look at the problem of handwritten document transcription. While good progress has been made towards automatically transcribing modern handwriting, significant challenges remain in transcribing historical documents. Here we describe a general enhancement strategy, underpinned by the new loss formulation, which can be applied to the training regime of any deep learning-based document transcription system. Through experimentation, reliable performance improvement is demonstrated for the standard IAM and RIMES datasets for three different network architectures. Further, we go on to show feasibility for our approach on a new dataset of digitized Latin manuscripts, originally produced by scribes in the Cloister of St. Gall in the the 9th century

arXiv.org e-Print Archive

A Review of Research on Devnagari Character Recognition

Author: Dongre V J
Mankar V H
Publication venue
Publication date: 12/01/2011
Field of study

English Character Recognition (CR) has been extensively studied in the last half century and progressed to a level, sufficient to produce technology driven applications. But same is not the case for Indian languages which are complicated in terms of structure and computations. Rapidly growing computational power may enable the implementation of Indic CR methodologies. Digital document processing is gaining popularity for application to office and library automation, bank and postal services, publishing houses and communication technology. Devnagari being the national language of India, spoken by more than 500 million people, should be given special attention so that document retrieval and analysis of rich ancient and modern Indian literature can be effectively done. This article is intended to serve as a guide and update for the readers, working in the Devnagari Optical Character Recognition (DOCR) area. An overview of DOCR systems is presented and the available DOCR techniques are reviewed. The current status of DOCR is discussed and directions for future research are suggested.Comment: 8 pages, 1 Figure, 8 Tables, Journal pape

arXiv.org e-Print Archive

CiteSeerX

A review on handwritten character and numeral recognition for Roman, Arabic, Chinese and Indian scripts

Author: Azmi Aini Najwa
Nasien Dewi
Shamsuddin Siti Mariyam
Publication venue
Publication date: 22/08/2013
Field of study

There are a lot of intensive researches on handwritten character recognition (HCR) for almost past four decades. The research has been done on some of popular scripts such as Roman, Arabic, Chinese and Indian. In this paper we present a review on HCR work on the four popular scripts. We have summarized most of the published paper from 2005 to recent and also analyzed the various methods in creating a robust HCR system. We also added some future direction of research on HCR.Comment: 8 page

arXiv.org e-Print Archive

A Study of Sindhi Related and Arabic Script Adapted languages Recognition

Author: Bhatti Zeeshan
Hakro Dil Nawaz
Moja G. N.
Talib A. Z.
Publication venue
Publication date: 13/12/2014
Field of study

A large number of publications are available for the Optical Character Recognition (OCR). Significant researches, as well as articles are present for the Latin, Chinese and Japanese scripts. Arabic script is also one of mature script from OCR perspective. The adaptive languages which share Arabic script or its extended characters; still lacking the OCRs for their language. In this paper we present the efforts of researchers on Arabic and its related and adapted languages. This survey is organized in different sections, in which introduction is followed by properties of Sindhi Language. OCR process techniques and methods used by various researchers are presented. The last section is dedicated for future work and conclusion is also discussed.Comment: 11 pages, 8 Figures, Sindh Univ. Res. Jour. (Sci. Ser.

arXiv.org e-Print Archive