2,908 research outputs found
A tool for facilitating OCR postediting in historical documents
Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom. As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention
Integrating optical character recognition and machine translation of historical documents
Machine Translation (MT) plays a critical role in expanding capacity in the translation industry.
However, many valuable documents, including digital documents, are encoded in non-accessible
formats for machine processing (e.g., Historical or Legal documents). Such documents must be
passed through a process of Optical Character Recognition (OCR) to render the text suitable for
MT. No matter how good the OCR is, this process introduces recognition errors, which often
renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding
a new OCR error correction module to enhance the overall quality of translation. Experimentation shows that our new system correction based on the combination of Language Modeling and
Translation methods outperforms the baseline system by nearly 30% relative improvement
Using SMT for OCR error correction of historical texts
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of
paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated
by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital
text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error
Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a
qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows
that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly
13% relative improvement compared to the initial baseline
- …