6 research outputs found

    Machine Learning for handwriting text recognition in historical documents

    Get PDF
    Olmos ABSTRACT In this thesis, we focus on the handwriting text recognition task over historical documents that are difficult to read for any person that is not an expert in ancient languages and writing style. We aim to take advantage and improve the neural networks architectures and techniques that other authors are proposing for handwriting text recognition in modern handwritten documents. These models perform this task very precisely when a large amount of data is available. However, the low availability of labeled data is a widespread problem in historical documents. The type of writing is singular, and it is pretty expensive to hire an expert to transcribe a large number of pages. After investigating and analyzing the state-of-the-art, we propose the efficient application of methods such as transfer learning and data augmentation. We also contribute an algorithm for purging mislabeled samples that affect the learning of models. Finally, we develop a variational auto encoder method for generating synthetic samples of handwritten text images for data augmentation. Experiments are performed on various historical handwritten text databases to validate the performance of the proposed algorithms. The various included analyses focus on the evolution of the character and word error rate (CER and WER) as we increase the training dataset. One of the most important results is the participation in a contest for transcription of historical handwritten text. The organizers provided us with a dataset of documents to train the model, then just a few labeled pages of 5 new documents were handled to adjust the solution further. Finally, the transcription of nonlabeled images was requested to evaluate the algorithm. Our method raked second in this contest

    ICFHR 2018 Competition on recognition of historical Arabic scientific manuscripts - RASM2018

    Get PDF
    This paper presents an objective comparative evaluation of page analysis and recognition methods for historical scientific manuscripts with text in Arabic language and script. It describes the competition (modus operandi, dataset and evaluation methodology) held in the context of ICFHR2018, presenting the results of the evaluation of six methods – three submitted and three baseline systems. The challenges for the participants included page segmentation, text line detection, and optical character recognition (OCR). Different evaluation metrics were used to gain an insight into the algorithms, including new character accuracy metrics to better reflect the difficult circumstances presented by the documents. The results indicate that, despite the challenging nature of the material, useful digitisation outputs can be produced

    Handwritten text generation and strikethrough characters augmentation

    Get PDF
    We introduce two data augmentation techniques, which, used with a Resnet-BiLSTM-CTC network, significantly reduce Word Error Rate and Character Error Rate beyond best-reported results on handwriting text recognition tasks. We apply a novel augmentation that simulates strikethrough text (HandWritten Blots) and a handwritten text generation method based on printed text (StackMix), which proved to be very effective in handwriting text recognition tasks. StackMix uses weakly-supervised framework to get character boundaries. Because these data augmentation techniques are independent of the network used, they could also be applied to enhance the performance of other networks and approaches to handwriting text recognition. Extensive experiments on ten handwritten text datasets show that HandWritten Blots augmentation and StackMix significantly improve the quality of handwriting text recognition models
    corecore