7 research outputs found

    A novel image matching approach for word spotting

    Get PDF
    Word spotting has been adopted and used by various researchers as a complementary technique to Optical Character Recognition for document analysis and retrieval. The various applications of word spotting include document indexing, image retrieval and information filtering. The important factors in word spotting techniques are pre-processing, selection and extraction of proper features and image matching algorithms. The Correlation Similarity Measure (CORR) algorithm is considered to be a faster matching algorithm, originally defined for finding similarities between binary patterns. In the word spotting literature the CORR algorithm has been used successfully to compare the GSC binary features extracted from binary word images, i.e., Gradient, Structural and Concavity (GSC) features. However, the problem with this approach is that binarization of images leads to a loss of very useful information. Furthermore, before extracting GSC binary features the word images must be skew corrected and slant normalized, which is not only difficult but in some cases impossible in Arabic and modified Arabic scripts. We present a new approach in which the Correlation Similarity Measure (CORR) algorithm has been used innovatively to compare Gray-scale word images. In this approach, binarization of images, skew correction and slant normalization of word images are not required at all. The various features, i.e., projection profiles, word profiles and transitional features are extracted from the Gray-scale word images and converted into their binary equivalents, which are compared via CORR algorithm with greater speed and higher accuracy. The experiments have been conducted on Gray-scale versions of newly created handwritten databases of Pashto and Dari languages, written in modified Arabic scripts. For each of these languages we have used 4599 words relating to 21 different word classes collected from 219 writers. The average precision rates achieved for Pashto and Dari languages were 93.18 % and 93.75 %, respectively. The time taken for matching a pair of images was 1.43 milli-seconds. In addition, we will present the handwritten databases for two well-known Indo- Iranian languages, i.e., Pashto and Dari languages. These are large databases which contain six types of data, i.e., Dates, Isolated Digits, Numeral Strings, Isolated Characters, Different Words and Special Symbols, written by native speakers of the corresponding languages

    Alpha-Numerical Sequences Extraction in Handwritten Documents

    Full text link
    International audienceIn this paper, we introduce an alpha-numerical sequences extraction system (keywords, numerical fields or alpha-numerical sequences) in unconstrained handwritten documents. Contrary to most of the approaches presented in the literature, our system relies on a global handwriting line model describing two kinds of information : i) the relevant information and ii) the irrelevant information represented by a shallow parsing model. The shallow parsing of isolated text lines allows quick information extraction in any document while rejecting at the same time irrelevant information. Results on a public french incoming mails database show the efficiency of the approach

    Design of an Offline Handwriting Recognition System Tested on the Bangla and Korean Scripts

    Get PDF
    This dissertation presents a flexible and robust offline handwriting recognition system which is tested on the Bangla and Korean scripts. Offline handwriting recognition is one of the most challenging and yet to be solved problems in machine learning. While a few popular scripts (like Latin) have received a lot of attention, many other widely used scripts (like Bangla) have seen very little progress. Features such as connectedness and vowels structured as diacritics make it a challenging script to recognize. A simple and robust design for offline recognition is presented which not only works reliably, but also can be used for almost any alphabetic writing system. The framework has been rigorously tested for Bangla and demonstrated how it can be transformed to apply to other scripts through experiments on the Korean script whose two-dimensional arrangement of characters makes it a challenge to recognize. The base of this design is a character spotting network which detects the location of different script elements (such as characters, diacritics) from an unsegmented word image. A transcript is formed from the detected classes based on their corresponding location information. This is the first reported lexicon-free offline recognition system for Bangla and achieves a Character Recognition Accuracy (CRA) of 94.8%. This is also one of the most flexible architectures ever presented. Recognition of Korean was achieved with a 91.2% CRA. Also, a powerful technique of autonomous tagging was developed which can drastically reduce the effort of preparing a dataset for any script. The combination of the character spotting method and the autonomous tagging brings the entire offline recognition problem very close to a singular solution. Additionally, a database named the Boise State Bangla Handwriting Dataset was developed. This is one of the richest offline datasets currently available for Bangla and this has been made publicly accessible to accelerate the research progress. Many other tools were developed and experiments were conducted to more rigorously validate this framework by evaluating the method against external datasets (CMATERdb 1.1.1, Indic Word Dataset and REID2019: Early Indian Printed Documents). Offline handwriting recognition is an extremely promising technology and the outcome of this research moves the field significantly ahead

    Use of Markov processes in writing recognition

    Get PDF
    In this paper, we present a brief survey on the use of different types of Markov models in writing recognition . Recognition is done by a posteriori pattern class probability calculus . This computation implies several terms which, according to the dependency hypotheses akin to the considered application, can be decomposed in elementary conditional probabilities . Under the assumption that the pattern may be modeled as a uni- or two-dimensional stochastic process (random field) presenting Markovian properties, local maximisations of these probabilities result in maximum pattern likelihood . We have studied throughout the article several cases of subpattern probability conditioning. Each case is accompanied by practical illustrations related to the field of writing recognition .Dans cet article, nous présentons une étude sur l'emploi de différents types de modèles de Markov en reconnaissance de l'écriture. La reconnaissance est obtenue par calcul de la probabilité a posteriori de la classe d'une forme. Ce calcul fait intervenir plusieurs termes qui, suivant certaines hypothèses de dépendance liées à l'application traitée, peuvent se décomposer en probabilités conditionnelles élémentaires. Si l'on suppose que la forme suit un processus stochastique uni- ou bidimensionnel qui de plus vérifie les propriétés de Markov, alors la maximisation locale de ces probabilités permet l'atteinte d'un maximum de la vraisemblance de la forme. Nous avons étudié plusieurs cas de conditionnement des probabilités élémentaires des sous-formes. Chaque étude est accompagnée d'illustrations pratiques relatives au domaine de la reconnaissance de l'écriture imprimée et/ou manuscrite

    Détection de mots clés et d'expressions régulières en vue de la reconnaissance d'entités nommées dans des documents manuscrits

    Get PDF
    This document presents a study on keyword and regular expression detection in handwritten documents, dedicated to a further named entity detection stage. Named entities such as name, surname, company name or numerical values often constitutes the main informative part of a document. Therefore, their detection may lead to a deep document understanding. Named entity detection is a difficult problem due to their variability, even on electronical texts. When dealing with image of handwritten documents, the problem is also faced with the recognition issue: intrinsic handwriting variability, noise, etc.The forst contribution of this manuscript is a handwriting recognition engine based on CRF. The second contribution is a generic word and regular expression spotting system. a benchmark of discriminative models is proposed, showing that the BLSTM-CTC clearly outperforms other hybrid methods.Les travaux présentés dans cette thèse concernent la détection de mots clés et d’expressions régulières en vue de la reconnaissance d’entités nommées dans des documents manuscrits non contraints. Les entités nommées telles que les noms et prénoms, les noms de compagnies ou les montants numériques constituent généralement une majeure partie de l’information d’un document. D’un point de vue industriel, la détection et la reconnaissance de ces entités nommées permettrait donc d’avoir une compréhension profonde du document traité. Les entités nommées sont des informations très variables, dont la définition dépend fortement du problème considéré. Les entités nommées liées à une problématique de tri du courier (nom et prénom de personne, type et nom de voie, nom de ville, code postal) sont par exemple différentes de celles liées à un problème de catégorisation de documents (lexique de mots clefs liés au domaine). Cette variabilité rend la détection des entitées nommées difficile. Lorsque l’on considère des images de documents, la détection et la reconnaissance des entités nommées est également confrontée à la problématiquede reconnaissance du texte, perturbée par la variablité de l’écriture (notamment sur les documents manuscrits), ainsi qu’au bruit lié à la numérisation.La première contribution de cette thèse est un système de reconnaissance de mots isolés basé sur un Champs Aléatoire Conditionnel (CAC), ce qui d’après notre bibliographie n’a pas encore été proposé. La deuxième contribution est un système générique de détection de mots clés et d’expressions régulières permettant de détecter n’importe quelle séquence dans une ligne de texte. Une structure se démarque des autres par ses performances etsa capacité à traiter des requêtes très difficiles, le BLSTM-CTC. Cette dernière semble être la clé de la résolution du problème initial
    corecore