Search CORE

3 research outputs found

Using SMT for OCR error correction of historical texts

Author: Afli Haithem
Qui Zhengwei
Sheridan Páraic
Way Andy
Publication venue: European Language Resource Association
Publication date: 01/05/2016
Field of study

A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly 13% relative improvement compared to the initial baseline

Irish Universities

DCU Online Research Access Service

Domain adaptation for social localisation-based SMT: a Case study using the Trommons platform

Author: Du Jinhua
Qui Zhengwei
Schäler Reinhard
Wasala Asanka
Way Andy
Publication venue: Association for Machine Translation in the Americas (AMTA)
Publication date: 01/11/2015
Field of study

Social localisation is a kind of community action, which matches communities and the content they need, and supports their localisation efforts. The goal of social localisation-based statistical machine translation (SL-SMT) is to support and bridge global communities exchanging any type of digital content across different languages and cultures. Trommons is an open platform maintained by The Rosetta Foundation to connect non-profit translation projects and organisations with the skills and interests of volunteer translators, where they can translate, post-edit or proofread different types of documents. Using Trommons as the experimental platform, this paper focuses on domain adaptation techniques to augment SL-SMT to facilitate translators/post-editors. Specifically, the Cross Entropy Difference algorithm is used to adapt Europarl data to the social localisation data. Experimental results on English–Spanish show that the domain adaptation techniques can significantly improve translation performance by 6.82 absolute BLEU points and 5.99 absolute TER points compared to the baseline

DCU Online Research Access Service

Using SMT for OCR error correction of historical texts

Author: Afli Haithem
Qui Zhengwei
Sheridan Páraic
Way Andy
Publication venue: European Language Resource Association
Publication date: 01/05/2016
Field of study

Irish Universities