7,288 research outputs found
Text Line Segmentation of Historical Documents: a Survey
There is a huge amount of historical documents in libraries and in various
National Archives that have not been exploited electronically. Although
automatic reading of complete pages remains, in most cases, a long-term
objective, tasks such as word spotting, text/image alignment, authentication
and extraction of specific fields are in use today. For all these tasks, a
major step is document segmentation into text lines. Because of the low quality
and the complexity of these documents (background noise, artifacts due to
aging, interfering lines),automatic text line segmentation remains an open
research field. The objective of this paper is to present a survey of existing
methods, developed during the last decade, and dedicated to documents of
historical interest.Comment: 25 pages, submitted version, To appear in International Journal on
Document Analysis and Recognition, On line version available at
http://www.springerlink.com/content/k2813176280456k3
Extraction of Projection Profile, Run-Histogram and Entropy Features Straight from Run-Length Compressed Text-Documents
Document Image Analysis, like any Digital Image Analysis requires
identification and extraction of proper features, which are generally extracted
from uncompressed images, though in reality images are made available in
compressed form for the reasons such as transmission and storage efficiency.
However, this implies that the compressed image should be decompressed, which
indents additional computing resources. This limitation induces the motivation
to research in extracting features directly from the compressed image. In this
research, we propose to extract essential features such as projection profile,
run-histogram and entropy for text document analysis directly from run-length
compressed text-documents. The experimentation illustrates that features are
extracted directly from the compressed image without going through the stage of
decompression, because of which the computing time is reduced. The feature
values so extracted are exactly identical to those extracted from uncompressed
images.Comment: Published by IEEE in Proceedings of ACPR-2013. arXiv admin note: text
overlap with arXiv:1403.778
Baseline Detection in Historical Documents using Convolutional U-Nets
Baseline detection is still a challenging task for heterogeneous collections
of historical documents. We present a novel approach to baseline extraction in
such settings, turning out the winning entry to the ICDAR 2017 Competition on
Baseline detection (cBAD). It utilizes deep convolutional nets (CNNs) for both,
the actual extraction of baselines, as well as for a simple form of layout
analysis in a pre-processing step. To the best of our knowledge it is the first
CNN-based system for baseline extraction applying a U-net architecture and
sliding window detection, profiting from a high local accuracy of the candidate
lines extracted. Final baseline post-processing complements our approach,
compensating for inaccuracies mainly due to missing context information during
sliding window detection. We experimentally evaluate the components of our
system individually on the cBAD dataset. Moreover, we investigate how it
generalizes to different data by means of the dataset used for the baseline
extraction task of the ICDAR 2017 Competition on Layout Analysis for
Challenging Medieval Manuscripts (HisDoc). A comparison with the results
reported for HisDoc shows that it also outperforms the contestants of the
latter.Comment: 6 pages, accepted to DAS 201
Visual Representation of Text in Web Documents and Its Interpretation
This paper examines the uses of text and its representation on Web documents in terms of the challenges in its interpretation. Particular attention is paid to the significant problem of non-uniform representation of text. This non-uniformity is mainly due to the presence of semantically important text in image form as opposed to the standard encoded text. The issues surrounding text representation in Web documents are discussed in the context of colour perception and spatial representation. The characteristics of the representation of text in image form are examined and research towards interpreting these images of text is briefly described
Visual Representation of Text in Web Documents and Its Interpretation
This paper examines the uses of text and its representation on Web documents in terms of the challenges in its interpretation. Particular attention is paid to the significant problem of non-uniform representation of text. This non-uniformity is mainly due to the presence of semantically important text in image form as opposed to the standard encoded text. The issues surrounding text representation in Web documents are discussed in the context of colour perception and spatial representation. The characteristics of the representation of text in image form are examined and research towards interpreting these images of text is briefly described
- …