6 research outputs found
Machine Learning for handwriting text recognition in historical documents
Olmos
ABSTRACT
In this thesis, we focus on the handwriting text recognition task over historical
documents that are difficult to read for any person that is not an expert in ancient
languages and writing style.
We aim to take advantage and improve the neural networks architectures and
techniques that other authors are proposing for handwriting text recognition in
modern handwritten documents. These models perform this task very precisely
when a large amount of data is available. However, the low availability of labeled
data is a widespread problem in historical documents. The type of writing is
singular, and it is pretty expensive to hire an expert to transcribe a large number
of pages.
After investigating and analyzing the state-of-the-art, we propose the efficient
application of methods such as transfer learning and data augmentation. We also
contribute an algorithm for purging mislabeled samples that affect the learning of
models. Finally, we develop a variational auto encoder method for generating
synthetic samples of handwritten text images for data augmentation.
Experiments are performed on various historical handwritten text databases to
validate the performance of the proposed algorithms. The various included
analyses focus on the evolution of the character and word error rate (CER and
WER) as we increase the training dataset.
One of the most important results is the participation in a contest for transcription
of historical handwritten text. The organizers provided us with a dataset of
documents to train the model, then just a few labeled pages of 5 new documents
were handled to adjust the solution further. Finally, the transcription of nonlabeled
images was requested to evaluate the algorithm. Our method raked
second in this contest
ICFHR 2018 Competition on recognition of historical Arabic scientific manuscripts - RASM2018
This paper presents an objective comparative evaluation of page analysis and recognition methods for historical scientific manuscripts with text in Arabic language and script. It describes the competition (modus operandi, dataset and evaluation methodology) held in the context of ICFHR2018, presenting the results of the evaluation of six methods – three submitted and three baseline systems. The challenges for the participants included page segmentation, text line detection, and optical character recognition (OCR). Different evaluation metrics were used to gain an insight into the algorithms, including new character accuracy metrics to better reflect the difficult circumstances presented by the documents. The results indicate that, despite the challenging nature of the material, useful digitisation outputs can be produced
Handwritten text generation and strikethrough characters augmentation
We introduce two data augmentation techniques, which, used with a Resnet-BiLSTM-CTC network, significantly reduce Word Error Rate and Character Error Rate beyond best-reported results on handwriting text recognition tasks. We apply a novel augmentation that simulates strikethrough text (HandWritten Blots) and a handwritten text generation method based on printed text (StackMix), which proved to be very effective in handwriting text recognition tasks. StackMix uses weakly-supervised framework to get character boundaries. Because these data augmentation techniques are independent of the network used, they could also be applied to enhance the performance of other networks and approaches to handwriting text recognition. Extensive experiments on ten handwritten text datasets show that HandWritten Blots augmentation and StackMix significantly improve the quality of handwriting text recognition models