Search CORE

145 research outputs found

Using SMT for OCR error correction of historical texts

Author: Afli Haithem
Qui Zhengwei
Sheridan Páraic
Way Andy
Publication venue: European Language Resource Association
Publication date: 01/05/2016
Field of study

A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly 13% relative improvement compared to the initial baseline

Irish Universities

DCU Online Research Access Service

Integrating optical character recognition and machine translation of historical documents

Author: Afli Haithem
Way Andy
Publication venue: COLING 2016 Organizing Committee Committee
Publication date: 01/12/2016
Field of study

Machine Translation (MT) plays a critical role in expanding capacity in the translation industry. However, many valuable documents, including digital documents, are encoded in non-accessible formats for machine processing (e.g., Historical or Legal documents). Such documents must be passed through a process of Optical Character Recognition (OCR) to render the text suitable for MT. No matter how good the OCR is, this process introduces recognition errors, which often renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding a new OCR error correction module to enhance the overall quality of translation. Experimentation shows that our new system correction based on the combination of Language Modeling and Translation methods outperforms the baseline system by nearly 30% relative improvement

Irish Universities

DCU Online Research Access Service

A tool for facilitating OCR postediting in historical documents

Author: Aboomar Mohammad
Buts Jan
Hadley James
Poncelas Alberto
Way Andy
Publication venue: LREC
Publication date: 23/04/2020
Field of study

Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom. As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention

arXiv.org e-Print Archive

DCU Online Research Access Service

Optimizing digital archiving: An artificial intelligence approach for OCR error correction

Author: Fernandes Bruno Daniel Alho
Publication venue
Publication date: 13/04/2024
Field of study

Project Work presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsThis thesis research scopes the knowledge gap for effective ways to address OCR errors and the importance to have training datasets adequated size and quality, to promote digital documents OCR recognition efficiency. The main goal is to examine the effects regarding the following dimensions of sourcing data: input size vs performance vs time efficiency, and to propose a new design that includes a machine translation model, to automate the errors correction caused by OCR scan. The study implemented various LSTM, with different thresholds, to recover errors generated by OCR systems. However, the results did not overcomed the performance of existing OCR systems, due to dataset size limitations, a step further was achieved. A relationship between performance and input size was established, providing meaningful insights for future digital archiving systems optimisation. This dissertation creates a new approach, to deal with OCR problems and implementation considerations, that can be further followed, to optimise digital archive systems efficiency and results

Repositório da Universidade Nova de Lisboa

A transformer-based standardisation system for Scottish Gaelic

Author: Alex Beatrice
Bauer Michael
Huang Junfan
Jasin D.S.
Lamb Will
Thomas Robert
Yuchao Liang
Publication venue
Publication date: 19/08/2023
Field of study

The transition from rule-based to neural-based architectures has made it more difficult for low-resource languages like Scottish Gaelic to participate in modern language technologies. The performance of deep-learning approaches correlates with the availability of training data, and low-resource languages have limited data reserves by definition. Historical and non-standard orthographic texts could be used to supplement training data, but manual conversion of these texts is expensive and timeconsuming. This paper describes the development of a neuralbased orthographic standardisation system for Scottish Gaelic and compares it to an earlier rule-based system. The best performance yielded a precision of 93.92, a recall of 92.20 and a word error rate of 11.01. This was obtained using a transformerbased mixed teacher model which was trained with augmented dat

Edinburgh Research Explorer

A Large-Scale Comparison of Historical Text Normalization Systems

Author: Bollmann Marcel
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

There is no consensus on the state-of-the-art approach to historical text normalization. Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder--decoder models, but studies have used different datasets, different evaluation methods, and have come to different conclusions. This paper presents the largest study of historical text normalization done so far. We critically survey the existing literature and report experiments on eight languages, comparing systems spanning all categories of proposed normalization techniques, analysing the effect of training data quantity, and using different evaluation methods. The datasets and scripts are made publicly available.Comment: Accepted at NAACL 201

arXiv.org e-Print Archive

Crossref

Publikationer från Linköpings universitet

Copenhagen University Research Information System

Digitala Vetenskapliga Arkivet - Academic Archive On-line

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

Author: Duong Quan
Hengchen Simon
Hämäläinen Mika
Publication venue: 'Linkoping University Electronic Press'
Publication date: 03/11/2020
Field of study

Peer reviewe

arXiv.org e-Print Archive

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Helsingin yliopiston digitaalinen arkisto

From Arabic user-generated content to machine translation: integrating automatic error correction

Author: Afli Haithem
Aransa Walid
Lohar Pintu
Way Andy
Publication venue
Publication date: 01/01/2016
Field of study

With the wide spread of the social media and online forums, individual users have been able to actively participate in the generation of online content in different languages and dialects. Arabic is one of the fastest growing languages used on Internet, but dialects (like Egyptian and Saudi Arabian) have a big share of the Arabic online content. There are many differences between Dialectal Arabic and Modern Standard Arabic which cause many challenges for Machine Translation of informal Arabic language. In this paper, we investigate the use of Automatic Error Correction method to improve the quality of Arabic User-Generated texts and its automatic translation. Our experiments show that the new system with automatic correction module outperforms the baseline system by nearly 22.59% of relative improvement

Irish Universities

DCU Online Research Access Service