145 research outputs found
Using SMT for OCR error correction of historical texts
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of
paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated
by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital
text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error
Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a
qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows
that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly
13% relative improvement compared to the initial baseline
Integrating optical character recognition and machine translation of historical documents
Machine Translation (MT) plays a critical role in expanding capacity in the translation industry.
However, many valuable documents, including digital documents, are encoded in non-accessible
formats for machine processing (e.g., Historical or Legal documents). Such documents must be
passed through a process of Optical Character Recognition (OCR) to render the text suitable for
MT. No matter how good the OCR is, this process introduces recognition errors, which often
renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding
a new OCR error correction module to enhance the overall quality of translation. Experimentation shows that our new system correction based on the combination of Language Modeling and
Translation methods outperforms the baseline system by nearly 30% relative improvement
A tool for facilitating OCR postediting in historical documents
Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom. As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention
Optimizing digital archiving: An artificial intelligence approach for OCR error correction
Project Work presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsThis thesis research scopes the knowledge gap for effective ways to address OCR errors and the importance to have training datasets adequated size and quality, to promote digital documents OCR recognition efficiency. The main goal is to examine the effects regarding the following dimensions of sourcing data: input size vs performance vs time efficiency, and to propose a new design that includes a machine translation model, to automate the errors correction caused by OCR scan. The study implemented various LSTM, with different thresholds, to recover errors generated by OCR systems. However, the results did not overcomed the performance of existing OCR systems, due to dataset size limitations, a step further was achieved. A relationship between performance and input size was established, providing meaningful insights for future digital archiving systems optimisation. This dissertation creates a new approach, to deal with OCR problems and implementation considerations, that can be further followed, to optimise digital archive systems efficiency and results
A transformer-based standardisation system for Scottish Gaelic
The transition from rule-based to neural-based architectures has made it more difficult for low-resource languages like Scottish Gaelic to participate in modern language technologies. The performance of deep-learning approaches correlates with the availability of training data, and low-resource languages have limited data reserves by definition. Historical and non-standard orthographic texts could be used to supplement training data, but manual conversion of these texts is expensive and timeconsuming. This paper describes the development of a neuralbased orthographic standardisation system for Scottish Gaelic and compares it to an earlier rule-based system. The best performance yielded a precision of 93.92, a recall of 92.20 and a word error rate of 11.01. This was obtained using a transformerbased mixed teacher model which was trained with augmented dat
A Large-Scale Comparison of Historical Text Normalization Systems
There is no consensus on the state-of-the-art approach to historical text
normalization. Many techniques have been proposed, including rule-based
methods, distance metrics, character-based statistical machine translation, and
neural encoder--decoder models, but studies have used different datasets,
different evaluation methods, and have come to different conclusions. This
paper presents the largest study of historical text normalization done so far.
We critically survey the existing literature and report experiments on eight
languages, comparing systems spanning all categories of proposed normalization
techniques, analysing the effect of training data quantity, and using different
evaluation methods. The datasets and scripts are made publicly available.Comment: Accepted at NAACL 201
From Arabic user-generated content to machine translation: integrating automatic error correction
With the wide spread of the social media and online forums,
individual users have been able to actively participate in the generation
of online content in different languages and dialects. Arabic is one of the
fastest growing languages used on Internet, but dialects (like Egyptian
and Saudi Arabian) have a big share of the Arabic online content. There
are many differences between Dialectal Arabic and Modern Standard
Arabic which cause many challenges for Machine Translation of informal
Arabic language. In this paper, we investigate the use of Automatic Error Correction method to improve the quality of Arabic User-Generated
texts and its automatic translation. Our experiments show that the new
system with automatic correction module outperforms the baseline system by nearly 22.59% of relative improvement
- …