12,609 research outputs found
Text Line Segmentation of Historical Documents: a Survey
There is a huge amount of historical documents in libraries and in various
National Archives that have not been exploited electronically. Although
automatic reading of complete pages remains, in most cases, a long-term
objective, tasks such as word spotting, text/image alignment, authentication
and extraction of specific fields are in use today. For all these tasks, a
major step is document segmentation into text lines. Because of the low quality
and the complexity of these documents (background noise, artifacts due to
aging, interfering lines),automatic text line segmentation remains an open
research field. The objective of this paper is to present a survey of existing
methods, developed during the last decade, and dedicated to documents of
historical interest.Comment: 25 pages, submitted version, To appear in International Journal on
Document Analysis and Recognition, On line version available at
http://www.springerlink.com/content/k2813176280456k3
READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents
Text line detection is crucial for any application associated with Automatic
Text Recognition or Keyword Spotting. Modern algorithms perform good on
well-established datasets since they either comprise clean data or
simple/homogeneous page layouts. We have collected and annotated 2036 archival
document images from different locations and time periods. The dataset contains
varying page layouts and degradations that challenge text line segmentation
methods. Well established text line segmentation evaluation schemes such as the
Detection Rate or Recognition Accuracy demand for binarized data that is
annotated on a pixel level. Producing ground truth by these means is laborious
and not needed to determine a method's quality. In this paper we propose a new
evaluation scheme that is based on baselines. The proposed scheme has no need
for binarization and it can handle skewed as well as rotated text lines. The
ICDAR 2017 Competition on Baseline Detection and the ICDAR 2017 Competition on
Layout Analysis for Challenging Medieval Manuscripts used this evaluation
scheme. Finally, we present results achieved by a recently published text line
detection algorithm.Comment: Submitted to DAS201
Word matching using single closed contours for indexing handwritten historical documents
Effective indexing is crucial for providing convenient access to scanned versions of large collections of historically valuable handwritten manuscripts. Since traditional handwriting recognizers based on optical character recognition (OCR) do not perform well on historical documents, recently a holistic word recognition approach has gained in popularity as an attractive and more straightforward solution (Lavrenko et al. in proc. document Image Analysis for Libraries (DIAL’04), pp. 278–287, 2004). Such techniques attempt to recognize words based on scalar and profile-based features extracted from whole word images. In this paper, we propose a new approach to holistic word recognition for historical handwritten manuscripts based on matching word contours instead of whole images or word profiles. The new method consists of robust extraction of closed word contours and the application of an elastic contour matching technique proposed originally for general shapes (Adamek and O’Connor in IEEE Trans Circuits Syst Video Technol 5:2004). We demonstrate that multiscale contour-based descriptors can effectively capture intrinsic word features avoiding any segmentation of words into smaller subunits. Our experiments show a recognition accuracy of 83%, which considerably exceeds the performance of other systems reported in the literature
Implementation of a Human-Computer Interface for Computer Assisted Translation and Handwritten Text Recognition
A human-computer interface is developed to provide services of computer assisted machine translation (CAT) and computer assisted transcription of handwritten text images (CATTI). The back-end machine translation (MT) and handwritten text recognition (HTR) systems are provided by the Pattern Recognition and Human Language Technology (PRHLT) research group. The idea is to provide users with easy to use tools to convert interactive translation and transcription feasible tasks. The assisted service is provided by remote servers with CAT or CATTI capabilities. The interface supplies the user with tools for efficient local edition: deletion, insertion and substitution.Ocampo Sepúlveda, JC. (2009). Implementation of a Human-Computer Interface for Computer Assisted Translation and Handwritten Text Recognition. http://hdl.handle.net/10251/14318Archivo delegad
Handwriting Recognition of Historical Documents with few labeled data
Historical documents present many challenges for offline handwriting
recognition systems, among them, the segmentation and labeling steps. Carefully
annotated textlines are needed to train an HTR system. In some scenarios,
transcripts are only available at the paragraph level with no text-line
information. In this work, we demonstrate how to train an HTR system with few
labeled data. Specifically, we train a deep convolutional recurrent neural
network (CRNN) system on only 10% of manually labeled text-line data from a
dataset and propose an incremental training procedure that covers the rest of
the data. Performance is further increased by augmenting the training set with
specially crafted multiscale data. We also propose a model-based normalization
scheme which considers the variability in the writing scale at the recognition
phase. We apply this approach to the publicly available READ dataset. Our
system achieved the second best result during the ICDAR2017 competition
- …