32,278 research outputs found
Extraction of Projection Profile, Run-Histogram and Entropy Features Straight from Run-Length Compressed Text-Documents
Document Image Analysis, like any Digital Image Analysis requires
identification and extraction of proper features, which are generally extracted
from uncompressed images, though in reality images are made available in
compressed form for the reasons such as transmission and storage efficiency.
However, this implies that the compressed image should be decompressed, which
indents additional computing resources. This limitation induces the motivation
to research in extracting features directly from the compressed image. In this
research, we propose to extract essential features such as projection profile,
run-histogram and entropy for text document analysis directly from run-length
compressed text-documents. The experimentation illustrates that features are
extracted directly from the compressed image without going through the stage of
decompression, because of which the computing time is reduced. The feature
values so extracted are exactly identical to those extracted from uncompressed
images.Comment: Published by IEEE in Proceedings of ACPR-2013. arXiv admin note: text
overlap with arXiv:1403.778
The DIGMAP geo-temporal web gazetteer service
This paper presents the DIGMAP geo-temporal Web gazetteer service, a system providing access to names of places, historical periods, and associated geo-temporal information. Within the DIGMAP project, this gazetteer serves as the unified repository of geographic and temporal information, assisting in the recognition and disambiguation of geo-temporal expressions over text, as well as in resource searching and indexing. We describe the data integration methodology, the handling of temporal information and some of the applications that use the gazetteer. Initial evaluation results show that the proposed system can adequately support several tasks related to geo-temporal information extraction and retrieval
Boosting Handwriting Text Recognition in Small Databases with Transfer Learning
In this paper we deal with the offline handwriting text recognition (HTR)
problem with reduced training datasets. Recent HTR solutions based on
artificial neural networks exhibit remarkable solutions in referenced
databases. These deep learning neural networks are composed of both
convolutional (CNN) and long short-term memory recurrent units (LSTM). In
addition, connectionist temporal classification (CTC) is the key to avoid
segmentation at character level, greatly facilitating the labeling task. One of
the main drawbacks of the CNNLSTM-CTC (CLC) solutions is that they need a
considerable part of the text to be transcribed for every type of calligraphy,
typically in the order of a few thousands of lines. Furthermore, in some
scenarios the text to transcribe is not that long, e.g. in the Washington
database. The CLC typically overfits for this reduced number of training
samples. Our proposal is based on the transfer learning (TL) from the
parameters learned with a bigger database. We first investigate, for a reduced
and fixed number of training samples, 350 lines, how the learning from a large
database, the IAM, can be transferred to the learning of the CLC of a reduced
database, Washington. We focus on which layers of the network could be not
re-trained. We conclude that the best solution is to re-train the whole CLC
parameters initialized to the values obtained after the training of the CLC
from the larger database. We also investigate results when the training size is
further reduced. The differences in the CER are more remarkable when training
with just 350 lines, a CER of 3.3% is achieved with TL while we have a CER of
18.2% when training from scratch. As a byproduct, the learning times are quite
reduced. Similar good results are obtained from the Parzival database when
trained with this reduced number of lines and this new approach.Comment: ICFHR 2018 Conferenc
Digitisation Processing and Recognition of Old Greek Manuscipts (the D-SCRIBE Project)
After many years of scholar study, manuscript collections continue to be an important source of novel
information for scholars, concerning both the history of earlier times as well as the development of cultural
documentation over the centuries. D-SCRIBE project aims to support and facilitate current and future efforts in
manuscript digitization and processing. It strives toward the creation of a comprehensive software product, which
can assist the content holders in turning an archive of manuscripts into a digital collection using automated
methods. In this paper, we focus on the problem of recognizing early Christian Greek manuscripts. We propose a
novel digital image binarization scheme for low quality historical documents allowing further content exploitation in
an efficient way. Based on the existence of closed cavity regions in the majority of characters and character
ligatures in these scripts, we propose a novel, segmentation-free, fast and efficient technique that assists the
recognition procedure by tracing and recognizing the most frequently appearing characters or character ligatures
- …