68 research outputs found

    Applying Genetic Algorithm in Multi Language\u27s Characters Recognition

    Get PDF

    Non-english and non-latin signature verification systems: A survey

    Full text link
    Signatures continue to be an important biometric because they remain widely used as a means of personal verification and therefore an automatic verification system is needed. Manual signature-based authentication of a large number of documents is a difficult and time consuming task. Consequently for many years, in the field of protected communication and financial applications, we have observed an explosive growth in biometric personal authentication systems that are closely connected with measurable unique physical characteristics (e.g. hand geometry, iris scan, finger prints or DNA) or behavioural features. Substantial research has been undertaken in the field of signature verification involving English signatures, but to the best of our knowledge, very few works have considered non-English signatures such as Chinese, Japanese, Arabic etc. In order to convey the state-of-the-art in the field to researchers, in this paper we present a survey of non-English and non-Latin signature verification systems

    Content Recognition and Context Modeling for Document Analysis and Retrieval

    Get PDF
    The nature and scope of available documents are changing significantly in many areas of document analysis and retrieval as complex, heterogeneous collections become accessible to virtually everyone via the web. The increasing level of diversity presents a great challenge for document image content categorization, indexing, and retrieval. Meanwhile, the processing of documents with unconstrained layouts and complex formatting often requires effective leveraging of broad contextual knowledge. In this dissertation, we first present a novel approach for document image content categorization, using a lexicon of shape features. Each lexical word corresponds to a scale and rotation invariant local shape feature that is generic enough to be detected repeatably and is segmentation free. A concise, structurally indexed shape lexicon is learned by clustering and partitioning feature types through graph cuts. Our idea finds successful application in several challenging tasks, including content recognition of diverse web images and language identification on documents composed of mixed machine printed text and handwriting. Second, we address two fundamental problems in signature-based document image retrieval. Facing continually increasing volumes of documents, detecting and recognizing unique, evidentiary visual entities (\eg, signatures and logos) provides a practical and reliable supplement to the OCR recognition of printed text. We propose a novel multi-scale framework to detect and segment signatures jointly from document images, based on the structural saliency under a signature production model. We formulate the problem of signature retrieval in the unconstrained setting of geometry-invariant deformable shape matching and demonstrate state-of-the-art performance in signature matching and verification. Third, we present a model-based approach for extracting relevant named entities from unstructured documents. In a wide range of applications that require structured information from diverse, unstructured document images, processing OCR text does not give satisfactory results due to the absence of linguistic context. Our approach enables learning of inference rules collectively based on contextual information from both page layout and text features. Finally, we demonstrate the importance of mining general web user behavior data for improving document ranking and other web search experience. The context of web user activities reveals their preferences and intents, and we emphasize the analysis of individual user sessions for creating aggregate models. We introduce a novel algorithm for estimating web page and web site importance, and discuss its theoretical foundation based on an intentional surfer model. We demonstrate that our approach significantly improves large-scale document retrieval performance

    Learning-Based Arabic Word Spotting Using a Hierarchical Classifier

    Get PDF
    The effective retrieval of information from scanned and written documents is becoming essential with the increasing amounts of digitized documents, and therefore developing efficient means of analyzing and recognizing these documents is of significant interest. Among these methods is word spotting, which has recently become an active research area. Such systems have been implemented for Latin-based and Chinese languages, while few of them have been implemented for Arabic handwriting. The fact that Arabic writing is cursive by nature and unconstrained, with no clear white space between words, makes the processing of Arabic handwritten documents a more challenging problem. In this thesis, the design and implementation of a learning-based Arabic handwritten word spotting system is presented. This incorporates the aspects of text line extraction, handwritten word recognition, partial segmentation of words, word spotting and finally validation of the spotted words. The Arabic text line is more unconstrained than that of other scripts, essentially since it also includes small connected components such as dots and diacritics that are usually located between lines. Thus, a robust method to extract text lines that takes into consideration the challenges in the Arabic handwriting is proposed. The method is evaluated on two Arabic handwritten documents databases, and the results are compared with those of two other methods for text line extraction. The results show that the proposed method is effective, and compares favorably with the other methods. Word spotting is an automatic process to search for words within a document. Applying this process to handwritten Arabic documents is challenging due to the absence of a clear space between handwritten words. To address this problem, an effective learning-based method for Arabic handwritten word spotting is proposed and presented in this thesis. For this process, sub-words or pieces of Arabic words form the basic components of the search process, and a hierarchical classifier is implemented to integrate statistical language models with the segmentation of an Arabic text line into sub-words. The holistic and analytical paradigms (for word recognition and spotting) are studied, and verification models based on combining these two paradigms have been proposed and implemented to refine the outcomes of the analytical classifier that spots words. Finally, a series of evaluation and testing experiments have been conducted to evaluate the effectiveness of the proposed systems, and these show that promising results have been obtained

    ONLINE ARABIC TEXT RECOGNITION USING STATISTICAL TECHNIQUES

    Get PDF

    A Reevaluation and Benchmark of Hidden Markov Models

    Get PDF
    Hidden Markov models are frequently used in handwriting-recognition applications. While a large number of methodological variants have been developed to accommodate different use cases, the core concepts have not been changed much. In this paper, we develop a number of datasets to benchmark our own implementation as well as various other tool kits. We introduce a gradual scale of difficulty that allows comparison of datasets in terms of separability of classes. Two experiments are performed to review the basic HMM functions, especially aimed at evaluating the role of the transition probability matrix. We found that the transition matrix may be far less important than the observation probabilities. Furthermore, the traditional training methods are not always able to find the proper (true) topology of the transition matrix. These findings support the view that the quality of the features may require more attention than the aspect of temporal modelling addressed by HMMs
    corecore