Search CORE

1,971 research outputs found

Flexible text recovery from degraded typewritten historical documents

Author: Antonacopoulos A
Casado Castilla C
Publication venue: IEEE Computer Society
Publication date: 01/01/2006
Field of study

University of Salford Institutional Repository

Crossref

Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Author: Jones Gareth J.F.
Lam-Adesina Adenike M.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2006
Field of study

Important legacy paper documents are digitized and collected in online accessible archives. This enables the preservation, sharing, and significantly the searching of these documents. The text contents of these document images can be transcribed automatically using OCR systems and then stored in an information retrieval system. However, OCR systems make errors in character recognition which have previously been shown to impact on document retrieval behaviour. In particular relevance feedback query-expansion methods, which are often effective for improving electronic text retrieval, are observed to be less reliable for retrieval of scanned document images. Our experimental examination of the effects of character recognition errors on an ad hoc OCR retrieval task demonstrates that, while baseline information retrieval can remain relatively unaffected by transcription errors, relevance feedback via query expansion becomes highly unstable. This paper examines the reason for this behaviour, and introduces novel modifications to standard relevance feedback methods. These methods are shown experimentally to improve the effectiveness of relevance feedback for errorful OCR transcriptions. The new methods combine similar recognised character strings based on term collection frequency and a string edit-distance measure. The techniques are domain independent and make no use of external resources such as dictionaries or training data

Irish Universities

DCU Online Research Access Service

A tool for facilitating OCR postediting in historical documents

Author: Aboomar Mohammad
Buts Jan
Hadley James
Poncelas Alberto
Way Andy
Publication venue: LREC
Publication date: 23/04/2020
Field of study

Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom. As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention

arXiv.org e-Print Archive

DCU Online Research Access Service