32,278 research outputs found

    Extraction of Projection Profile, Run-Histogram and Entropy Features Straight from Run-Length Compressed Text-Documents

    Full text link
    Document Image Analysis, like any Digital Image Analysis requires identification and extraction of proper features, which are generally extracted from uncompressed images, though in reality images are made available in compressed form for the reasons such as transmission and storage efficiency. However, this implies that the compressed image should be decompressed, which indents additional computing resources. This limitation induces the motivation to research in extracting features directly from the compressed image. In this research, we propose to extract essential features such as projection profile, run-histogram and entropy for text document analysis directly from run-length compressed text-documents. The experimentation illustrates that features are extracted directly from the compressed image without going through the stage of decompression, because of which the computing time is reduced. The feature values so extracted are exactly identical to those extracted from uncompressed images.Comment: Published by IEEE in Proceedings of ACPR-2013. arXiv admin note: text overlap with arXiv:1403.778

    The DIGMAP geo-temporal web gazetteer service

    Get PDF
    This paper presents the DIGMAP geo-temporal Web gazetteer service, a system providing access to names of places, historical periods, and associated geo-temporal information. Within the DIGMAP project, this gazetteer serves as the unified repository of geographic and temporal information, assisting in the recognition and disambiguation of geo-temporal expressions over text, as well as in resource searching and indexing. We describe the data integration methodology, the handling of temporal information and some of the applications that use the gazetteer. Initial evaluation results show that the proposed system can adequately support several tasks related to geo-temporal information extraction and retrieval

    Boosting Handwriting Text Recognition in Small Databases with Transfer Learning

    Full text link
    In this paper we deal with the offline handwriting text recognition (HTR) problem with reduced training datasets. Recent HTR solutions based on artificial neural networks exhibit remarkable solutions in referenced databases. These deep learning neural networks are composed of both convolutional (CNN) and long short-term memory recurrent units (LSTM). In addition, connectionist temporal classification (CTC) is the key to avoid segmentation at character level, greatly facilitating the labeling task. One of the main drawbacks of the CNNLSTM-CTC (CLC) solutions is that they need a considerable part of the text to be transcribed for every type of calligraphy, typically in the order of a few thousands of lines. Furthermore, in some scenarios the text to transcribe is not that long, e.g. in the Washington database. The CLC typically overfits for this reduced number of training samples. Our proposal is based on the transfer learning (TL) from the parameters learned with a bigger database. We first investigate, for a reduced and fixed number of training samples, 350 lines, how the learning from a large database, the IAM, can be transferred to the learning of the CLC of a reduced database, Washington. We focus on which layers of the network could be not re-trained. We conclude that the best solution is to re-train the whole CLC parameters initialized to the values obtained after the training of the CLC from the larger database. We also investigate results when the training size is further reduced. The differences in the CER are more remarkable when training with just 350 lines, a CER of 3.3% is achieved with TL while we have a CER of 18.2% when training from scratch. As a byproduct, the learning times are quite reduced. Similar good results are obtained from the Parzival database when trained with this reduced number of lines and this new approach.Comment: ICFHR 2018 Conferenc

    Digitisation Processing and Recognition of Old Greek Manuscipts (the D-SCRIBE Project)

    Get PDF
    After many years of scholar study, manuscript collections continue to be an important source of novel information for scholars, concerning both the history of earlier times as well as the development of cultural documentation over the centuries. D-SCRIBE project aims to support and facilitate current and future efforts in manuscript digitization and processing. It strives toward the creation of a comprehensive software product, which can assist the content holders in turning an archive of manuscripts into a digital collection using automated methods. In this paper, we focus on the problem of recognizing early Christian Greek manuscripts. We propose a novel digital image binarization scheme for low quality historical documents allowing further content exploitation in an efficient way. Based on the existence of closed cavity regions in the majority of characters and character ligatures in these scripts, we propose a novel, segmentation-free, fast and efficient technique that assists the recognition procedure by tracing and recognizing the most frequently appearing characters or character ligatures
    • …
    corecore