114 research outputs found

    Spotting Keywords in Offline Handwritten Documents Using Hausdorff Edit Distance

    Get PDF
    Keyword spotting has become a crucial topic in handwritten document recognition, by enabling content-based retrieval of scanned documents using search terms. With a query keyword, one can search and index the digitized handwriting which in turn facilitates understanding of manuscripts. Common automated techniques address the keyword spotting problem through statistical representations. Structural representations such as graphs apprehend the complex structure of handwriting. However, they are rarely used, particularly for keyword spotting techniques, due to high computational costs. The graph edit distance, a powerful and versatile method for matching any type of labeled graph, has exponential time complexity to calculate the similarities of graphs. Hence, the use of graph edit distance is constrained to small size graphs. The recently developed Hausdorff edit distance algorithm approximates the graph edit distance with quadratic time complexity by efficiently matching local substructures. This dissertation speculates using Hausdorff edit distance could be a promising alternative to other template-based keyword spotting approaches in term of computational time and accuracy. Accordingly, the core contribution of this thesis is investigation and development of a graph-based keyword spotting technique based on the Hausdorff edit distance algorithm. The high representational power of graphs combined with the efficiency of the Hausdorff edit distance for graph matching achieves remarkable speedup as well as accuracy. In a comprehensive experimental evaluation, we demonstrate the solid performance of the proposed graph-based method when compared with state of the art, both, concerning precision and speed. The second contribution of this thesis is a keyword spotting technique which incorporates dynamic time warping and Hausdorff edit distance approaches. The structural representation of graph-based approach combined with statistical geometric features representation compliments each other in order to provide a more accurate system. The proposed system has been extensively evaluated with four types of handwriting graphs and geometric features vectors on benchmark datasets. The experiments demonstrate a performance boost in which outperforms individual systems

    Towards robust real-world historical handwriting recognition

    Get PDF
    In this thesis, we make a bridge from the past to the future by using artificial-intelligence methods for text recognition in a historical Dutch collection of the Natuurkundige Commissie that explored Indonesia (1820-1850). In spite of the successes of systems like 'ChatGPT', reading historical handwriting is still quite challenging for AI. Whereas GPT-like methods work on digital texts, historical manuscripts are only available as an extremely diverse collections of (pixel) images. Despite the great results, current DL methods are very data greedy, time consuming, heavily dependent on the human expert from the humanities for labeling and require machine-learning experts for designing the models. Ideally, the use of deep learning methods should require minimal human effort, have an algorithm observe the evolution of the training process, and avoid inefficient use of the already sparse amount of labeled data. We present several approaches towards dealing with these problems, aiming to improve the robustness of current methods and to improve the autonomy in training. We applied our novel word and line text recognition approaches on nine data sets differing in time period, language, and difficulty: three locally collected historical Latin-based data sets from Naturalis, Leiden; four public Latin-based benchmark data sets for comparability with other approaches; and two Arabic data sets. Using ensemble voting of just five neural networks, a level of accuracy was achieved which required hundreds of neural networks in earlier studies. Moreover, we increased the speed of evaluation of each training epoch without the need of labeled data

    Feature design and lexicon reduction for efficient offline handwriting recognition

    Get PDF
    This thesis establishes a pattern recognition framework for offline word recognition systems. It focuses on the image level features because they greatly influence the recognition performance. In particular, we consider two complementary aspects of prominent features impact: lexicon reduction and the actual recognition. The first aspect, lexicon reduction, consists in the design of a weak classifier which outputs a set of candidate word hypotheses given a word image. Its main purpose is to reduce the recognition computational time while maintaining (or even improving) the recognition rate. The second aspect is the actual recognition system itself. In fact, several features exist in the literature based on different fields of research, but no consensus exists concerning the most promising ones. The goal of the proposed framework is to improve our understanding of relevant features in order to build better recognition systems. For this purpose, we addressed two specific problems: 1) feature design for lexicon reduction (application to Arabic script), and 2) feature evaluation for cursive handwriting recognition (application to Latin and Arabic scripts). Few methods exist for lexicon reduction in Arabic script, unlike Latin script. Existing methods use salient features of Arabic words such as the number of subwords and diacritics, but totally ignore the shape of the subwords. Therefore, our first goal is to perform lexicon reductionn based on subwords shape. Our approach is based on shape indexing, where the shape of a query subword is compared to a labeled database of sample subwords. For efficient comparison with a low computational overhead, we proposed the weighted topological signature vector (W-TSV) framework, where the subword shape is modeled as a weighted directed acyclic graph (DAG) from which the W-TSV vector is extracted for efficient indexing. The main contributions of this work are to extend the existing TSV framework to weighted DAG and to propose a shape indexing approach for lexicon reduction. Good performance for lexicon reduction is achieved for Arabic subwords. Nevertheless, the performance remains modest for Arabic words. Considering the results of our first work on Arabic lexicon reduction, we propose to build a new index for better performance at the word level. The subword shape and the number of subwords and diacritics are all important components of Arabic word shape. We therefore propose the Arabic word descriptor (AWD) which integrates all the aforementioned components. It is built in two steps. First, a structural descriptor (SD) is computed for each connected component (CC) of the word image. It describes the CC shape using the bag-of-words model, where each visual word represents a different local shape structure. Then, the AWD is formed by concatenating the SDs using an efficient heuristic, implicitly discriminating between subwords and diacritics. In the context of lexicon reduction, the AWD is used to index a reference database. The main contribution of this work is the design of the AWD, which integrates lowlevel cues (subword shape structure) and symbolic information (subword counts and diacritics) into a single descriptor. The proposed method has a low computational overhead, it is simple to implement and it provides state-of-the-art performance for lexicon reduction on two Arabic databases, namely the Ibn Sina database of subwords and the IFN/ENIT database of words. The last part of this thesis focuses on features for word recognition. A large body of features exist in the literature, each of them being motivated by different fields, such as pattern recognition, computer vision or machine learning. Identifying the most promising approaches would improve the design of the next generation of features. Nevertheless, because they are based on different concepts, it is difficult to compare them on a theoretical ground and efficient empirical tools are needed. Therefore, the last objective of the thesis is to provide a method for feature evaluation that assesses the strength and complementarity of existing features. A combination scheme has been designed for this purpose, in which each feature is evaluated through a reference recognition system, based on recurrent neural networks. More precisely, each feature is represented by an agent, which is an instance of the recognition system trained with that feature. The decisions of all the agents are combined using a weighted vote. The weights are jointly optimized during a training phase in order to increase the weighted vote of the true word label. Therefore, they reflect the strength and complementarity of the agents and their features for the given task. Finally, they are converted into a numerical score assigned to each feature, which is easy to interpret under this combination model. To the best of our knowledge, this is the first feature evaluation method able to quantify the importance of each feature, instead of providing a ranking based on the recognition rate. Five state-of-the-art features have been tested, and our results provide interesting insight for future feature design

    Machine Learning for handwriting text recognition in historical documents

    Get PDF
    Olmos ABSTRACT In this thesis, we focus on the handwriting text recognition task over historical documents that are difficult to read for any person that is not an expert in ancient languages and writing style. We aim to take advantage and improve the neural networks architectures and techniques that other authors are proposing for handwriting text recognition in modern handwritten documents. These models perform this task very precisely when a large amount of data is available. However, the low availability of labeled data is a widespread problem in historical documents. The type of writing is singular, and it is pretty expensive to hire an expert to transcribe a large number of pages. After investigating and analyzing the state-of-the-art, we propose the efficient application of methods such as transfer learning and data augmentation. We also contribute an algorithm for purging mislabeled samples that affect the learning of models. Finally, we develop a variational auto encoder method for generating synthetic samples of handwritten text images for data augmentation. Experiments are performed on various historical handwritten text databases to validate the performance of the proposed algorithms. The various included analyses focus on the evolution of the character and word error rate (CER and WER) as we increase the training dataset. One of the most important results is the participation in a contest for transcription of historical handwritten text. The organizers provided us with a dataset of documents to train the model, then just a few labeled pages of 5 new documents were handled to adjust the solution further. Finally, the transcription of nonlabeled images was requested to evaluate the algorithm. Our method raked second in this contest

    Automatic handwriter identification using advanced machine learning

    Get PDF
    Handwriter identification a challenging problem especially for forensic investigation. This topic has received significant attention from the research community and several handwriter identification systems were developed for various applications including forensic science, document analysis and investigation of the historical documents. This work is part of an investigation to develop new tools and methods for Arabic palaeography, which is is the study of handwritten material, particularly ancient manuscripts with missing writers, dates, and/or places. In particular, the main aim of this research project is to investigate and develop new techniques and algorithms for the classification and analysis of ancient handwritten documents to support palaeographic studies. Three contributions were proposed in this research. The first is concerned with the development of a text line extraction algorithm on colour and greyscale historical manuscripts. The idea uses a modified bilateral filtering approach to adaptively smooth the images while still preserving the edges through a nonlinear combination of neighboring image values. The proposed algorithm aims to compute a median and a separating seam and has been validated to deal with both greyscale and colour historical documents using different datasets. The results obtained suggest that our proposed technique yields attractive results when compared against a few similar algorithms. The second contribution proposes to deploy a combination of Oriented Basic Image features and the concept of graphemes codebook in order to improve the recognition performances. The proposed algorithm is capable to effectively extract the most distinguishing handwriter’s patterns. The idea consists of judiciously combining a multiscale feature extraction with the concept of grapheme to allow for the extraction of several discriminating features such as handwriting curvature, direction, wrinkliness and various edge-based features. The technique was validated for identifying handwriters using both Arabic and English writings captured as scanned images using the IAM dataset for English handwriting and ICFHR 2012 dataset for Arabic handwriting. The results obtained clearly demonstrate the effectiveness of the proposed method when compared against some similar techniques. The third contribution is concerned with an offline handwriter identification approach based on the convolutional neural network technology. At the first stage, the Alex-Net architecture was employed to learn image features (handwritten scripts) and the features obtained from the fully connected layers of the model. Then, a Support vector machine classifier is deployed to classify the writing styles of the various handwriters. In this way, the test scripts can be classified by the CNN training model for further classification. The proposed approach was evaluated based on Arabic Historical datasets; Islamic Heritage Project (IHP) and Qatar National Library (QNL). The obtained results demonstrated that the proposed model achieved superior performances when compared to some similar method
    • …
    corecore