2,571 research outputs found

    Cross-document word matching for segmentation and retrieval of Ottoman divans

    Get PDF
    Cataloged from PDF version of article.Motivated by the need for the automatic indexing and analysis of huge number of documents in Ottoman divan poetry, and for discovering new knowledge to preserve and make alive this heritage, in this study we propose a novel method for segmenting and retrieving words in Ottoman divans. Documents in Ottoman are dif- ficult to segment into words without a prior knowledge of the word. In this study, using the idea that divans have multiple copies (versions) by different writers in different writing styles, and word segmentation in some of those versions may be relatively easier to achieve than in other versions, segmentation of the versions (which are difficult, if not impossible, with traditional techniques) is performed using information carried from the simpler version. One version of a document is used as the source dataset and the other version of the same document is used as the target dataset. Words in the source dataset are automatically extracted and used as queries to be spotted in the target dataset for detecting word boundaries. We present the idea of cross-document word matching for a novel task of segmenting historical documents into words. We propose a matching scheme based on possible combinations of sequence of sub-words. We improve the performance of simple features through considering the words in a context. The method is applied on two versions of Layla and Majnun divan by Fuzuli. The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents

    A line-based representation for matching words in historical manuscripts

    Get PDF
    Cataloged from PDF version of article.In this study, we propose a new method for retrieving and recognizing words in historical documents. We represent word images with a set of line segments. Then we provide a criterion for word matching based on matching the lines. We carry out experiments on a benchmark dataset consisting of manuscripts by George Washington, as well as on Ottoman manuscripts. (C) 2011 Elsevier B.V. All rights reserved

    End-Shape Analysis for Automatic Segmentation of Arabic Handwritten Texts

    Get PDF
    Word segmentation is an important task for many methods that are related to document understanding especially word spotting and word recognition. Several approaches of word segmentation have been proposed for Latin-based languages while a few of them have been introduced for Arabic texts. The fact that Arabic writing is cursive by nature and unconstrained with no clear boundaries between the words makes the processing of Arabic handwritten text a more challenging problem. In this thesis, the design and implementation of an End-Shape Letter (ESL) based segmentation system for Arabic handwritten text is presented. This incorporates four novel aspects: (i) removal of secondary components, (ii) baseline estimation, (iii) ESL recognition, and (iv) the creation of a new off-line CENPARMI ESL database. Arabic texts include small connected components, also called secondary components. Removing these components can improve the performance of several systems such as baseline estimation. Thus, a robust method to remove secondary components that takes into consideration the challenges in the Arabic handwriting is introduced. The methods reconstruct the image based on some criteria. The results of this method were subsequently compared with those of two other methods that used the same database. The results show that the proposed method is effective. Baseline estimation is a challenging task for Arabic texts since it includes ligature, overlapping, and secondary components. Therefore, we propose a learning-based approach that addresses these challenges. Our method analyzes the image and extracts baseline dependent features. Then, the baseline is estimated using a classifier. Algorithms dealing with text segmentation usually analyze the gaps between connected components. These algorithms are based on metric calculation, finding threshold, and/or gap classification. We use two well-known metrics: bounding box and convex hull to test metric-based method on Arabic handwritten texts, and to include this technique in our approach. To determine the threshold, an unsupervised learning approach, known as the Gaussian Mixture Model, is used. Our ESL-based segmentation approach extracts the final letter of a word using rule-based technique and recognizes these letters using the implemented ESL classifier. To demonstrate the benefit of text segmentation, a holistic word spotting system is implemented. For this system, a word recognition system is implemented. A series of experiments with different sets of features are conducted. The system shows promising results

    OTS: A One-shot Learning Approach for Text Spotting in Historical Manuscripts

    Full text link
    Historical manuscript processing poses challenges like limited annotated training data and novel class emergence. To address this, we propose a novel One-shot learning-based Text Spotting (OTS) approach that accurately and reliably spots novel characters with just one annotated support sample. Drawing inspiration from cognitive research, we introduce a spatial alignment module that finds, focuses on, and learns the most discriminative spatial regions in the query image based on one support image. Especially, since the low-resource spotting task often faces the problem of example imbalance, we propose a novel loss function called torus loss which can make the embedding space of distance metric more discriminative. Our approach is highly efficient and requires only a few training samples while exhibiting the remarkable ability to handle novel characters, and symbols. To enhance dataset diversity, a new manuscript dataset that contains the ancient Dongba hieroglyphics (DBH) is created. We conduct experiments on publicly available VML-HD, TKH, NC datasets, and the new proposed DBH dataset. The experimental results demonstrate that OTS outperforms the state-of-the-art methods in one-shot text spotting. Overall, our proposed method offers promising applications in the field of text spotting in historical manuscripts

    A new representation for matching words

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Sciences of Bilkent University, 2007.Thesis (Master's) -- Bilkent University, 2007.Includes bibliographical references leaves 77-82.Large archives of historical documents are challenging to many researchers all over the world. However, these archives remain inaccessible since manual indexing and transcription of such a huge volume is difficult. In addition, electronic imaging tools and image processing techniques gain importance with the rapid increase in digitalization of materials in libraries and archives. In this thesis, a language independent method is proposed for representation of word images, which leads to retrieval and indexing of documents. While character recognition methods suffer from preprocessing and overtraining, we make use of another method, which is based on extracting words from documents and representing each word image with the features of invariant regions. The bag-of-words approach, which is shown to be successful to classify objects and scenes, is adapted for matching words. Since the curvature or connection points, or the dots are important visual features to distinct two words from each other, we make use of the salient points which are shown to be successful in representing such distinctive areas and heavily used for matching. Difference of Gaussian (DoG) detector, which is able to find scale invariant regions, and Harris Affine detector, which detects affine invariant regions, are used for detection of such areas and detected keypoints are described with Scale Invariant Feature Transform (SIFT) features. Then, each word image is represented by a set of visual terms which are obtained by vector quantization of SIFT descriptors and similar words are matched based on the similarity of these representations by using different distance measures. These representations are used both for document retrieval and word spotting. The experiments are carried out on Arabic, Latin and Ottoman datasets, which included different writing styles and different writers. The results show that the proposed method is successful on retrieval and indexing of documents even if with different scripts and different writers and since it is language independent, it can be easily adapted to other languages as well. Retrieval performance of the system is comparable to the state of the art methods in this field. In addition, the system is succesfull on capturing semantic similarities, which is useful for indexing, and it does not include any supervising step.Ataer, EsraM.S

    Cross-document word matching for segmentation and retrieval of Ottoman divans

    Get PDF
    Motivated by the need for the automatic indexing and analysis of huge number of documents in Ottoman divan poetry, and for discovering new knowledge to preserve and make alive this heritage, in this study we propose a novel method for segmenting and retrieving words in Ottoman divans. Documents in Ottoman are difficult to segment into words without a prior knowledge of the word. In this study, using the idea that divans have multiple copies (versions) by different writers in different writing styles, and word segmentation in some of those versions may be relatively easier to achieve than in other versions, segmentation of the versions (which are difficult, if not impossible, with traditional techniques) is performed using information carried from the simpler version. One version of a document is used as the source dataset and the other version of the same document is used as the target dataset. Words in the source dataset are automatically extracted and used as queries to be spotted in the target dataset for detecting word boundaries. We present the idea of cross-document word matching for a novel task of segmenting historical documents into words. We propose a matching scheme based on possible combinations of sequence of sub-words. We improve the performance of simple features through considering the words in a context. The method is applied on two versions of Layla and Majnun divan by Fuzuli. The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents. © 2014, Springer-Verlag London

    A Line-based representation for matching words

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2009.Thesis (Master's) -- Bilkent University, 2009.Includes bibliographical references leaves 46-49.With the increase of the number of documents available in the digital environment, efficient access to the documents becomes crucial. Manual indexing of the documents is costly; however, and can be carried out only in limited amounts. Therefore, automatic analysis of documents is crucial. Although plenty of effort has been spent on optical character recognition (OCR), most of the existing OCR systems fail to address the challenge of recognizing characters in historical documents on account of the poor quality of old documents, the high level of noise factors, and the variety of scripts. More importantly, OCR systems are usually language dependent and not available for all languages. Word spotting techniques have been recently proposed to access the historical documents with the idea that humans read whole words at a time. In these studies the words rather than the characters are considered as the basic units. Due to the poor quality of historical documents, the representation and matching of words continue to be challenging problems for word spotting. In this study we address these challenges and propose a simple but effective method for the representation of word images by a set of line descriptors. Then, two different matching criteria making use of the line-based representation are proposed. We apply our methods on the word spotting and redif extraction tasks. The proposed line-based representation does not require any specific pre-processing steps, and is applicable to different languages and scripts. In word spotting task, our results provide higher scores than the existing word spotting studies in terms of retrieval and recognition performances. In the redif extraction task, we obtain promising results providing a motivation for further and advanced studies on Ottoman literary texts.Can, Ethem FatihM.S

    A line-based representation for matching words in historical manuscripts

    Get PDF
    In this study, we propose a new method for retrieving and recognizing words in historical documents. We represent word images with a set of line segments. Then we provide a criterion for word matching based on matching the lines. We carry out experiments on a benchmark dataset consisting of manuscripts by George Washington, as well as on Ottoman manuscripts. © 2011 Elsevier B.V. All rights reserved

    HMM word graph based keyword spotting in handwritten document images

    Full text link
    [EN] Line-level keyword spotting (KWS) is presented on the basis of frame-level word posterior probabilities. These posteriors are obtained using word graphs derived from the recogni- tion process of a full-fledged handwritten text recognizer based on hidden Markov models and N-gram language models. This approach has several advantages. First, since it uses a holistic, segmentation-free technology, it does not require any kind of word or charac- ter segmentation. Second, the use of language models allows the context of each spotted word to be taken into account, thereby considerably increasing KWS accuracy. And third, the proposed KWS scores are based on true posterior probabilities, taking into account all (or most) possible word segmentations of the input image. These scores are properly bounded and normalized. This mathematically clean formulation lends itself to smooth, threshold-based keyword queries which, in turn, permit comfortable trade-offs between search precision and recall. Experiments are carried out on several historic collections of handwritten text images, as well as a well-known data set of modern English handwrit- ten text. According to the empirical results, the proposed approach achieves KWS results comparable to those obtained with the recently-introduced "BLSTM neural networks KWS" approach and clearly outperform the popular, state-of-the-art "Filler HMM" KWS method. Overall, the results clearly support all the above-claimed advantages of the proposed ap- proach.This work has been partially supported by the Generalitat Valenciana under the Prometeo/2009/014 project grant ALMA-MATER, and through the EU projects: HIMANIS (JPICH programme, Spanish grant Ref. PCIN-2015-068) and READ (Horizon 2020 programme, grant Ref. 674943).Toselli, AH.; Vidal, E.; Romero, V.; Frinken, V. (2016). HMM word graph based keyword spotting in handwritten document images. Information Sciences. 370:497-518. https://doi.org/10.1016/j.ins.2016.07.063S49751837
    corecore