The Bible, Truth, and Multilingual OCR Evaluation

Philip Resnik; Philip Resnik; Philip Resnik; Tapas Kanungo; Tapas Kanungo; Tapas Kanungo

The Bible, Truth, and Multilingual OCR Evaluation

Authors: Philip Resnik
Philip Resnik
Philip Resnik
Tapas Kanungo
Tapas Kanungo
Tapas Kanungo
Publication date
Publisher

Abstract

this paper we propose to use the Bible as a dataset for comparing OCR accuracy across languages. Besides being available in a wide range of languages, Bible translations are closely parallel in content, carefully translated, surprisingly relevant with respect to modern-day language, and quite inexpensive. A project at the University of Maryland is currently implementing this idea. Wehave created a scanned image dataset with groundtruth from an Arabic Bible. Wehave also used image degradation models to create synthetically degraded images of a FrenchBible. We hope to generate similar Bible datasets for other languages, and we are exploring alternative corpora suchasthe Koran and the Bhagavad Gita that have similar properties. Quantitative OCR evaluation based on the Arabic Bible dataset is currently in progres

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.58.87

Last time updated on 22/10/2014