49 research outputs found

    Large vocabulary recognition for online Turkish handwriting with sublexical units

    Get PDF
    We present a system for large vocabulary recognition of online Turkish handwriting, using hidden Markov models. While using a traditional approach for the recognizer, we have identified and developed solutions for the main problems specific to Turkish handwriting recognition. First, since large amounts of Turkish handwriting samples are not available, the system is trained and optimized using the large UNIPEN dataset of English handwriting, before extending it to Turkish using a small Turkish dataset. The delayed strokes, which pose a significant source of variation in writing order due to the large number of diacritical marks in Turkish, are removed during preprocessing. Finally, as a solution to the high out-of-vocabulary rates encountered when using a fixed size lexicon in general purpose recognition, a lexicon is constructed from sublexical units (stems and endings) learned from a large Turkish corpus. A statistical bigram language model learned from the same corpus is also applied during the decoding process. The system obtains a 91.7% word recognition rate when tested on a small Turkish handwritten word dataset using a medium sized (1950 words) lexicon corresponding to the vocabulary of the test set and 63.8% using a large, general purpose lexicon (130,000 words). However, with the proposed stem+ending lexicon (12,500 words) and bigram language model with lattice expansion, a 67.9% word recognition accuracy is obtained, surpassing the results obtained with the general purpose lexicon while using a much smaller one

    A large vocabulary online handwriting recognition system for Turkish

    Get PDF
    Handwriting recognition in general and online handwriting recognition in particular has been an active research area for several decades. Most of the research have been focused on English and recently on other scripts like Arabic and Chinese. There is a lack of research on recognition in Turkish text and this work primarily fills that gap with a state-of-the-art recognizer for the first time. It contains design and implementation details of a complete recognition system for recognition of Turkish isolated words. Based on the Hidden Markov Models, the system comprises pre-processing, feature extraction, optical modeling and language modeling modules. It considers the recognition of unconstrained handwriting with a limited vocabulary size first and then evolves to a large vocabulary system. Turkish script has many similarities with other Latin scripts, like English, which makes it possible to adapt strategies that work for them. However, there are some other issues which are particular to Turkish that should be taken into consideration separately. Two of the challenging issues in recognition of Turkish text are determined as delayed strokes which introduce an extra source of variation in the sequence order of the handwritten input and high Out-of-Vocabulary (OOV) rate of Turkish when words are used as vocabulary units in the decoding process. This work examines the problems and alternative solutions at depth and proposes suitable solutions for Turkish script particularly. In delayed stroke handling, first a clear definition of the delayed strokes is developed and then using that definition some alternative handling methods are evaluated extensively on the UNIPEN and Turkish datasets. The best results are obtained by removing all delayed strokes, with up to 2.13% and 2.03% points recognition accuracy increases, over the respective baselines of English and Turkish. The overall system performances are assessed as 86.1% with a 1,000-word lexicon and 83.0% with a 3,500-word lexicon on the UNIPEN dataset and 91.7% on the Turkish dataset. Alternative decoding vocabularies are designed with grammatical sub-lexical units in order to solve the problem of high OOV rate. Additionally, statistical bi-gram and tri-gram language models are applied during the decoding process. The best performance, 67.9% is obtained by the large stem-ending vocabulary that is expanded with a bi-gram model on the Turkish dataset. This result is superior to the accuracy of the word-based vocabulary (63.8%) with the same coverage of 95% on the BOUN Web Corpus

    Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey

    Full text link
    Optical character recognition (OCR) is a vital process that involves the extraction of handwritten or printed text from scanned or printed images, converting it into a format that can be understood and processed by machines. This enables further data processing activities such as searching and editing. The automatic extraction of text through OCR plays a crucial role in digitizing documents, enhancing productivity, improving accessibility, and preserving historical records. This paper seeks to offer an exhaustive review of contemporary applications, methodologies, and challenges associated with Arabic Optical Character Recognition (OCR). A thorough analysis is conducted on prevailing techniques utilized throughout the OCR process, with a dedicated effort to discern the most efficacious approaches that demonstrate enhanced outcomes. To ensure a thorough evaluation, a meticulous keyword-search methodology is adopted, encompassing a comprehensive analysis of articles relevant to Arabic OCR, including both backward and forward citation reviews. In addition to presenting cutting-edge techniques and methods, this paper critically identifies research gaps within the realm of Arabic OCR. By highlighting these gaps, we shed light on potential areas for future exploration and development, thereby guiding researchers toward promising avenues in the field of Arabic OCR. The outcomes of this study provide valuable insights for researchers, practitioners, and stakeholders involved in Arabic OCR, ultimately fostering advancements in the field and facilitating the creation of more accurate and efficient OCR systems for the Arabic language

    A limited-size ensemble of homogeneous CNN/LSTMs for high-performance word classification

    Get PDF
    The strength of long short-term memory neural networks (LSTMs) that have been applied is more located in handling sequences of variable length than in handling geometric variability of the image patterns. In this paper, an end-to-end convolutional LSTM neural network is used to handle both geometric variation and sequence variability. The best results for LSTMs are often based on large-scale training of an ensemble of network instances. We show that high performances can be reached on a common benchmark set by using proper data augmentation for just five such networks using a proper coding scheme and a proper voting scheme. The networks have similar architectures (convolutional neural network (CNN): five layers, bidirectional LSTM (BiLSTM): three layers followed by a connectionist temporal classification (CTC) processing step). The approach assumes differently scaled input images and different feature map sizes. Three datasets are used: the standard benchmark RIMES dataset (French); a historical handwritten dataset KdK (Dutch); the standard benchmark George Washington (GW) dataset (English). Final performance obtained for the word-recognition test of RIMES was 96.6%, a clear improvement over other state-of-the-art approaches which did not use a pre-trained network. On the KdK and GW datasets, our approach also shows good results. The proposed approach is deployed in the Monk search engine for historical-handwriting collections

    Learning-Based Arabic Word Spotting Using a Hierarchical Classifier

    Get PDF
    The effective retrieval of information from scanned and written documents is becoming essential with the increasing amounts of digitized documents, and therefore developing efficient means of analyzing and recognizing these documents is of significant interest. Among these methods is word spotting, which has recently become an active research area. Such systems have been implemented for Latin-based and Chinese languages, while few of them have been implemented for Arabic handwriting. The fact that Arabic writing is cursive by nature and unconstrained, with no clear white space between words, makes the processing of Arabic handwritten documents a more challenging problem. In this thesis, the design and implementation of a learning-based Arabic handwritten word spotting system is presented. This incorporates the aspects of text line extraction, handwritten word recognition, partial segmentation of words, word spotting and finally validation of the spotted words. The Arabic text line is more unconstrained than that of other scripts, essentially since it also includes small connected components such as dots and diacritics that are usually located between lines. Thus, a robust method to extract text lines that takes into consideration the challenges in the Arabic handwriting is proposed. The method is evaluated on two Arabic handwritten documents databases, and the results are compared with those of two other methods for text line extraction. The results show that the proposed method is effective, and compares favorably with the other methods. Word spotting is an automatic process to search for words within a document. Applying this process to handwritten Arabic documents is challenging due to the absence of a clear space between handwritten words. To address this problem, an effective learning-based method for Arabic handwritten word spotting is proposed and presented in this thesis. For this process, sub-words or pieces of Arabic words form the basic components of the search process, and a hierarchical classifier is implemented to integrate statistical language models with the segmentation of an Arabic text line into sub-words. The holistic and analytical paradigms (for word recognition and spotting) are studied, and verification models based on combining these two paradigms have been proposed and implemented to refine the outcomes of the analytical classifier that spots words. Finally, a series of evaluation and testing experiments have been conducted to evaluate the effectiveness of the proposed systems, and these show that promising results have been obtained

    A new representation for matching words

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Sciences of Bilkent University, 2007.Thesis (Master's) -- Bilkent University, 2007.Includes bibliographical references leaves 77-82.Large archives of historical documents are challenging to many researchers all over the world. However, these archives remain inaccessible since manual indexing and transcription of such a huge volume is difficult. In addition, electronic imaging tools and image processing techniques gain importance with the rapid increase in digitalization of materials in libraries and archives. In this thesis, a language independent method is proposed for representation of word images, which leads to retrieval and indexing of documents. While character recognition methods suffer from preprocessing and overtraining, we make use of another method, which is based on extracting words from documents and representing each word image with the features of invariant regions. The bag-of-words approach, which is shown to be successful to classify objects and scenes, is adapted for matching words. Since the curvature or connection points, or the dots are important visual features to distinct two words from each other, we make use of the salient points which are shown to be successful in representing such distinctive areas and heavily used for matching. Difference of Gaussian (DoG) detector, which is able to find scale invariant regions, and Harris Affine detector, which detects affine invariant regions, are used for detection of such areas and detected keypoints are described with Scale Invariant Feature Transform (SIFT) features. Then, each word image is represented by a set of visual terms which are obtained by vector quantization of SIFT descriptors and similar words are matched based on the similarity of these representations by using different distance measures. These representations are used both for document retrieval and word spotting. The experiments are carried out on Arabic, Latin and Ottoman datasets, which included different writing styles and different writers. The results show that the proposed method is successful on retrieval and indexing of documents even if with different scripts and different writers and since it is language independent, it can be easily adapted to other languages as well. Retrieval performance of the system is comparable to the state of the art methods in this field. In addition, the system is succesfull on capturing semantic similarities, which is useful for indexing, and it does not include any supervising step.Ataer, EsraM.S

    Advances in Document Layout Analysis

    Full text link
    [EN] Handwritten Text Segmentation (HTS) is a task within the Document Layout Analysis field that aims to detect and extract the different page regions of interest found in handwritten documents. HTS remains an active topic, that has gained importance with the years, due to the increasing demand to provide textual access to the myriads of handwritten document collections held by archives and libraries. This thesis considers HTS as a task that must be tackled in two specialized phases: detection and extraction. We see the detection phase fundamentally as a recognition problem that yields the vertical positions of each region of interest as a by-product. The extraction phase consists in calculating the best contour coordinates of the region using the position information provided by the detection phase. Our proposed detection approach allows us to attack both higher level regions: paragraphs, diagrams, etc., and lower level regions like text lines. In the case of text line detection we model the problem to ensure that the system's yielded vertical position approximates the fictitious line that connects the lower part of the grapheme bodies in a text line, commonly known as the baseline. One of the main contributions of this thesis, is that the proposed modelling approach allows us to include prior information regarding the layout of the documents being processed. This is performed via a Vertical Layout Model (VLM). We develop a Hidden Markov Model (HMM) based framework to tackle both region detection and classification as an integrated task and study the performance and ease of use of the proposed approach in many corpora. We review the modelling simplicity of our approach to process regions at different levels of information: text lines, paragraphs, titles, etc. We study the impact of adding deterministic and/or probabilistic prior information and restrictions via the VLM that our approach provides. Having a separate phase that accurately yields the detection position (base- lines in the case of text lines) of each region greatly simplifies the problem that must be tackled during the extraction phase. In this thesis we propose to use a distance map that takes into consideration the grey-scale information in the image. This allows us to yield extraction frontiers which are equidistant to the adjacent text regions. We study how our approach escalates its accuracy proportionally to the quality of the provided detection vertical position. Our extraction approach gives near perfect results when human reviewed baselines are provided.[ES] La Segmentaci贸n de Texto Manuscrito (STM) es una tarea dentro del campo de investigaci贸n de An谩lisis de Estructura de Documentos (AED) que tiene como objetivo detectar y extraer las diferentes regiones de inter茅s de las p谩ginas que se encuentran en documentos manuscritos. La STM es un tema de investigaci贸n activo que ha ganado importancia con los a帽os debido a la creciente demanda de proporcionar acceso textual a las miles de colecciones de documentos manuscritos que se conservan en archivos y bibliotecas. Esta tesis entiende la STM como una tarea que debe ser abordada en dos fases especializadas: detecci贸n y extracci贸n. Consideramos que la fase de detecci贸n es, fundamentalmente, un problema de clasificaci贸n cuyo subproducto son las posiciones verticales de cada regi贸n de inter茅s. Por su parte, la fase de extracci贸n consiste en calcular las mejores coordenadas de contorno de la regi贸n utilizando la informaci贸n de posici贸n proporcionada por la fase de detecci贸n. Nuestro enfoque de detecci贸n nos permite atacar tanto regiones de alto nivel (p谩rrafos, diagramas驴) como regiones de nivel bajo (l铆neas de texto principalmente). En el caso de la detecci贸n de l铆neas de texto, modelamos el problema para asegurar que la posici贸n vertical estimada por el sistema se aproxime a la l铆nea ficticia que conecta la parte inferior de los cuerpos de los grafemas en una l铆nea de texto, com煤nmente conocida como l铆nea base. Una de las principales aportaciones de esta tesis es que el enfoque de modelizaci贸n propuesto nos permite incluir informaci贸n conocida a priori sobre la disposici贸n de los documentos que se est谩n procesando. Esto se realiza mediante un Modelo de Estructura Vertical (MEV). Desarrollamos un marco de trabajo basado en los Modelos Ocultos de Markov (MOM) para abordar tanto la detecci贸n de regiones como su clasificaci贸n de forma integrada, as铆 como para estudiar el rendimiento y la facilidad de uso del enfoque propuesto en numerosos corpus. As铆 mismo, revisamos la simplicidad del modelado de nuestro enfoque para procesar regiones en diferentes niveles de informaci贸n: l铆neas de texto, p谩rrafos, t铆tulos, etc. Finalmente, estudiamos el impacto de a帽adir informaci贸n y restricciones previas deterministas o probabilistas a trav茅s de el MEV propuesto que nuestro enfoque proporciona. Disponer de un m茅todo independiente que obtiene con precisi贸n la posici贸n de cada regi贸n detectada (l铆neas base en el caso de las l铆neas de texto) simplifica enormemente el problema que debe abordarse durante la fase de extracci贸n. En esta tesis proponemos utilizar un mapa de distancias que tiene en cuenta la informaci贸n de escala de grises de la imagen. Esto nos permite obtener fronteras de extracci贸n que son equidistantes a las regiones de texto adyacentes. Estudiamos como nuestro enfoque aumenta su precisi贸n de manera proporcional a la calidad de la detecci贸n y descubrimos que da resultados casi perfectos cuando se le proporcionan l铆neas de base revisadas por humanos.[CA] La Segmentaci贸 de Text Manuscrit (STM) 茅s una tasca dins del camp d'investigaci贸 d'An脿lisi d'Estructura de Documents (AED) que t茅 com a objectiu detectar I extraure les diferents regions d'inter猫s de les p脿gines que es troben en documents manuscrits. La STM 茅s un tema d'investigaci贸 actiu que ha guanyat import脿ncia amb els anys a causa de la creixent demanda per proporcionar acc茅s textual als milers de col路leccions de documents manuscrits que es conserven en arxius i biblioteques. Aquesta tesi ent茅n la STM com una tasca que ha de ser abordada en dues fases especialitzades: detecci贸 i extracci贸. Considerem que la fase de detecci贸 茅s, fonamentalment, un problema de classificaci贸 el subproducte de la qual s贸n les posicions verticals de cada regi贸 d'inter猫s. Per la seva part, la fase d'extracci贸 consisteix a calcular les millors coordenades de contorn de la regi贸 utilitzant la informaci贸 de posici贸 proporcionada per la fase de detecci贸. El nostre enfocament de detecci贸 ens permet atacar tant regions d'alt nivell (par脿grafs, diagrames ...) com regions de nivell baix (l铆nies de text principalment). En el cas de la detecci贸 de l铆nies de text, modelem el problema per a assegurar que la posici贸 vertical estimada pel sistema s'aproximi a la l铆nia fict铆cia que connecta la part inferior dels cossos dels grafemes en una l铆nia de text, comunament coneguda com a l铆nia base. Una de les principals aportacions d'aquesta tesi 茅s que l'enfocament de modelitzaci贸 proposat ens permet incloure informaci贸 coneguda a priori sobre la disposici贸 dels documents que s'estan processant. Aix貌 es realitza mitjan莽ant un Model d'Estructura Vertical (MEV). Desenvolupem un marc de treball basat en els Models Ocults de Markov (MOM) per a abordar tant la detecci贸 de regions com la seva classificaci贸 de forma integrada, aix铆 com per a estudiar el rendiment i la facilitat d'煤s de l'enfocament proposat en nombrosos corpus. Aix铆 mateix, revisem la simplicitat del modelatge del nostre enfocament per a processar regions en diferents nivells d'informaci贸: l铆nies de text, par脿grafs, t铆tols, etc. Finalment, estudiem l'impacte d'afegir informaci贸 i restriccions pr猫vies deterministes o probabilistes a trav茅s del MEV que el nostre m猫tode proporciona. Disposar d'un m猫tode independent que obt茅 amb precisi贸 la posici贸 de cada regi贸 detectada (l铆nies base en el cas de les l铆nies de text) simplifica enormement el problema que ha d'abordar-se durant la fase d'extracci贸. En aquesta tesi proposem utilitzar un mapa de dist脿ncies que t茅 en compte la informaci贸 d'escala de grisos de la imatge. Aix貌 ens permet obtenir fronteres d'extracci贸 que s贸n equidistants de les regions de text adjacents. Estudiem com el nostre enfocament augmenta la seva precisi贸 de manera proporcional a la qualitat de la detecci贸 i descobrim que dona resultats quasi perfectes quan se li proporcionen l铆nies de base revisades per humans.Bosch Campos, V. (2020). Advances in Document Layout Analysis [Tesis doctoral no publicada]. Universitat Polit猫cnica de Val猫ncia. https://doi.org/10.4995/Thesis/10251/138397TESI
    corecore