3 research outputs found
Using SMT for OCR error correction of historical texts
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of
paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated
by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital
text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error
Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a
qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows
that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly
13% relative improvement compared to the initial baseline
Domain adaptation for social localisation-based SMT: a Case study using the Trommons platform
Social localisation is a kind of community action, which matches communities and the content
they need, and supports their localisation efforts. The goal of social localisation-based statistical machine translation (SL-SMT) is to support and bridge global communities exchanging
any type of digital content across different languages and cultures. Trommons is an open
platform maintained by The Rosetta Foundation to connect non-profit translation projects and
organisations with the skills and interests of volunteer translators, where they can translate,
post-edit or proofread different types of documents. Using Trommons as the experimental
platform, this paper focuses on domain adaptation techniques to augment SL-SMT to facilitate
translators/post-editors. Specifically, the Cross Entropy Difference algorithm is used to adapt
Europarl data to the social localisation data. Experimental results on English–Spanish show
that the domain adaptation techniques can significantly improve translation performance by
6.82 absolute BLEU points and 5.99 absolute TER points compared to the baseline
Using SMT for OCR error correction of historical texts
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of
paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated
by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital
text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error
Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a
qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows
that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly
13% relative improvement compared to the initial baseline