3 research outputs found

    Using SMT for OCR error correction of historical texts

    Get PDF
    A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly 13% relative improvement compared to the initial baseline

    Domain adaptation for social localisation-based SMT: a Case study using the Trommons platform

    Get PDF
    Social localisation is a kind of community action, which matches communities and the content they need, and supports their localisation efforts. The goal of social localisation-based statistical machine translation (SL-SMT) is to support and bridge global communities exchanging any type of digital content across different languages and cultures. Trommons is an open platform maintained by The Rosetta Foundation to connect non-profit translation projects and organisations with the skills and interests of volunteer translators, where they can translate, post-edit or proofread different types of documents. Using Trommons as the experimental platform, this paper focuses on domain adaptation techniques to augment SL-SMT to facilitate translators/post-editors. Specifically, the Cross Entropy Difference algorithm is used to adapt Europarl data to the social localisation data. Experimental results on English–Spanish show that the domain adaptation techniques can significantly improve translation performance by 6.82 absolute BLEU points and 5.99 absolute TER points compared to the baseline

    Using SMT for OCR error correction of historical texts

    No full text
    A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly 13% relative improvement compared to the initial baseline
    corecore