Search CORE

63 research outputs found

The Bible, Truth, and Multilingual OCR Evaluation

Author: Philip Resnik
Philip Resnik
Philip Resnik
Tapas Kanungo
Tapas Kanungo
Tapas Kanungo
Publication venue
Publication date
Field of study

this paper we propose to use the Bible as a dataset for comparing OCR accuracy across languages. Besides being available in a wide range of languages, Bible translations are closely parallel in content, carefully translated, surprisingly relevant with respect to modern-day language, and quite inexpensive. A project at the University of Maryland is currently implementing this idea. Wehave created a scanned image dataset with groundtruth from an Arabic Bible. Wehave also used image degradation models to create synthetically degraded images of a FrenchBible. We hope to generate similar Bible datasets for other languages, and we are exploring alternative corpora suchasthe Koran and the Bhagavad Gita that have similar properties. Quantitative OCR evaluation based on the Arabic Bible dataset is currently in progres

CiteSeerX

Document Degradation Models and a Methodology for Degradation Model Validation

Author: Tapas Kanungo
Tapas Kanungo
To Offer Degree
Publication venue
Publication date
Field of study

Document Degradation Models and a Methodology for Degradation Model Validation by Tapas Kanungo Chairperson of Supervisory Committee: Professor Robert M. Haralick Department of Electrical Engineering Printing, photocopying and scanning processes degrade the image quality of a document. Although research in document understanding started in the sixties, only two document degradation models have been proposed thus far. Furthermore, no attempts have been made to rigorously validate them. In document understanding research, models for image degradations are crucial in many ways. Models allow us to (i) conduct controlled experiments to study the break-down points of the systems, (ii) create large data sets with groundtruth for training classifiers, (iii) design optimal noise removal algorithms, (iv) choose values for the free parameters of the algorithms, etc. In this thesis two document degradation models are described. The first model accounts for local pixel-level degradations that occu..

CiteSeerX

ON THE USE OF HIERARCHY INFORMATION IN MAPPING PATENTS TO BIOMEDICAL ONTOLOGIES Abstract

Author: Luo Si
Luo Si
Tapas Kanungo
Tapas Kanungo
Publication venue
Publication date
Field of study

been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). Copies may be requested from IBM T. J. Watson Research Center, P

CiteSeerX

An Automatic Closed-Loop Methodology for Generating Character Groundtruth

Author: Robert M. Haralick
Robert M. Haralick
Tapas Kanungo
Tapas Kanungo
Publication venue
Publication date
Field of study

Character groundtruth for real, scanned document images is extremely useful for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not possible because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii) it is extremely laborious and time consuming and (iii) the manual labor required for this task is prohibitively expensive. In this paper we give a closed-loop methodology for collecting very accurate (within a pixel error) groundtruth for scanned documents. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The ideal document is then printed, photocopied and then scanned. A registration algorithm estimates the global geometric transformation and then performs a robust local bitmap match to register the ideal documen..

CiteSeerX