401 research outputs found
Machine Learning for handwriting text recognition in historical documents
Olmos
ABSTRACT
In this thesis, we focus on the handwriting text recognition task over historical
documents that are difficult to read for any person that is not an expert in ancient
languages and writing style.
We aim to take advantage and improve the neural networks architectures and
techniques that other authors are proposing for handwriting text recognition in
modern handwritten documents. These models perform this task very precisely
when a large amount of data is available. However, the low availability of labeled
data is a widespread problem in historical documents. The type of writing is
singular, and it is pretty expensive to hire an expert to transcribe a large number
of pages.
After investigating and analyzing the state-of-the-art, we propose the efficient
application of methods such as transfer learning and data augmentation. We also
contribute an algorithm for purging mislabeled samples that affect the learning of
models. Finally, we develop a variational auto encoder method for generating
synthetic samples of handwritten text images for data augmentation.
Experiments are performed on various historical handwritten text databases to
validate the performance of the proposed algorithms. The various included
analyses focus on the evolution of the character and word error rate (CER and
WER) as we increase the training dataset.
One of the most important results is the participation in a contest for transcription
of historical handwritten text. The organizers provided us with a dataset of
documents to train the model, then just a few labeled pages of 5 new documents
were handled to adjust the solution further. Finally, the transcription of nonlabeled
images was requested to evaluate the algorithm. Our method raked
second in this contest
Towards robust real-world historical handwriting recognition
In this thesis, we make a bridge from the past to the future by using artificial-intelligence methods for text recognition in a historical Dutch collection of the Natuurkundige Commissie that explored Indonesia (1820-1850). In spite of the successes of systems like 'ChatGPT', reading historical handwriting is still quite challenging for AI. Whereas GPT-like methods work on digital texts, historical manuscripts are only available as an extremely diverse collections of (pixel) images. Despite the great results, current DL methods are very data greedy, time consuming, heavily dependent on the human expert from the humanities for labeling and require machine-learning experts for designing the models. Ideally, the use of deep learning methods should require minimal human effort, have an algorithm observe the evolution of the training process, and avoid inefficient use of the already sparse amount of labeled data. We present several approaches towards dealing with these problems, aiming to improve the robustness of current methods and to improve the autonomy in training. We applied our novel word and line text recognition approaches on nine data sets differing in time period, language, and difficulty: three locally collected historical Latin-based data sets from Naturalis, Leiden; four public Latin-based benchmark data sets for comparability with other approaches; and two Arabic data sets. Using ensemble voting of just five neural networks, a level of accuracy was achieved which required hundreds of neural networks in earlier studies. Moreover, we increased the speed of evaluation of each training epoch without the need of labeled data
- …