Search CORE

6 research outputs found

The Adaptability of a Transformer-Based OCR Model for Historical Documents

Author: Boente Walter
Hodel Tobias
Ströbel Phillip
Volk Martin
Publication venue: Springer
Publication date: 01/01/2023
Field of study

Machine Learning for handwriting text recognition in historical documents

Author: Aradillas Jaramillo Jose Carlos
Publication venue
Publication date: 17/12/2021
Field of study

Olmos ABSTRACT In this thesis, we focus on the handwriting text recognition task over historical documents that are difficult to read for any person that is not an expert in ancient languages and writing style. We aim to take advantage and improve the neural networks architectures and techniques that other authors are proposing for handwriting text recognition in modern handwritten documents. These models perform this task very precisely when a large amount of data is available. However, the low availability of labeled data is a widespread problem in historical documents. The type of writing is singular, and it is pretty expensive to hire an expert to transcribe a large number of pages. After investigating and analyzing the state-of-the-art, we propose the efficient application of methods such as transfer learning and data augmentation. We also contribute an algorithm for purging mislabeled samples that affect the learning of models. Finally, we develop a variational auto encoder method for generating synthetic samples of handwritten text images for data augmentation. Experiments are performed on various historical handwritten text databases to validate the performance of the proposed algorithms. The various included analyses focus on the evolution of the character and word error rate (CER and WER) as we increase the training dataset. One of the most important results is the participation in a contest for transcription of historical handwritten text. The organizers provided us with a dataset of documents to train the model, then just a few labeled pages of 5 new documents were handled to adjust the solution further. Finally, the transcription of nonlabeled images was requested to evaluate the algorithm. Our method raked second in this contest

idUS. Depósito de Investigación Universidad de Sevilla

Flexible Techniques for Automatic Text Recognition of Historical Documents

Author: Ströbel Phillip
Publication venue
Publication date: 01/01/2023
Field of study

ZORA

ICFHR 2018 Competition on recognition of historical Arabic scientific manuscripts - RASM2018

Author: Antonacopoulos A
Clausner C
McGregor N
Wilson-Nunn D
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 20/12/2018
Field of study

This paper presents an objective comparative evaluation of page analysis and recognition methods for historical scientific manuscripts with text in Arabic language and script. It describes the competition (modus operandi, dataset and evaluation methodology) held in the context of ICFHR2018, presenting the results of the evaluation of six methods – three submitted and three baseline systems. The challenges for the participants included page segmentation, text line detection, and optical character recognition (OCR). Different evaluation metrics were used to gain an insight into the algorithms, including new character accuracy metrics to better reflect the difficult circumstances presented by the documents. The results indicate that, despite the challenging nature of the material, useful digitisation outputs can be produced

University of Salford Institutional Repository

Crossref

Handwritten text generation and strikethrough characters augmentation

Author: Chertok A.V.
Dimitrov D.V.
Karachev D.K.
Novopoltsev M.Y.
Potanin M.S.
Shonenkov A.V.
Publication venue: 'Samara State National Research University'
Publication date: 01/06/2022
Field of study

We introduce two data augmentation techniques, which, used with a Resnet-BiLSTM-CTC network, significantly reduce Word Error Rate and Character Error Rate beyond best-reported results on handwriting text recognition tasks. We apply a novel augmentation that simulates strikethrough text (HandWritten Blots) and a handwritten text generation method based on printed text (StackMix), which proved to be very effective in handwriting text recognition tasks. StackMix uses weakly-supervised framework to get character boundaries. Because these data augmentation techniques are independent of the network used, they could also be applied to enhance the performance of other networks and approaches to handwriting text recognition. Extensive experiments on ten handwritten text datasets show that HandWritten Blots augmentation and StackMix significantly improve the quality of handwriting text recognition models

Samara University