2,234 research outputs found
Text Line Segmentation of Historical Documents: a Survey
There is a huge amount of historical documents in libraries and in various
National Archives that have not been exploited electronically. Although
automatic reading of complete pages remains, in most cases, a long-term
objective, tasks such as word spotting, text/image alignment, authentication
and extraction of specific fields are in use today. For all these tasks, a
major step is document segmentation into text lines. Because of the low quality
and the complexity of these documents (background noise, artifacts due to
aging, interfering lines),automatic text line segmentation remains an open
research field. The objective of this paper is to present a survey of existing
methods, developed during the last decade, and dedicated to documents of
historical interest.Comment: 25 pages, submitted version, To appear in International Journal on
Document Analysis and Recognition, On line version available at
http://www.springerlink.com/content/k2813176280456k3
Recommended from our members
Use of colour for hand-filled form analysis and recognition
Colour information in form analysis is currently under utilised. As technology has advanced and computing costs have reduced, the processing of forms in colour has now become practicable. This paper describes a novel colour-based approach to the extraction of filled data from colour form images. Images are first quantised to reduce the colour complexity and data is extracted by examining the colour characteristics of the images. The improved performance of the proposed method has been verified by comparing the processing time, recognition rate, extraction precision and recall rate to that of an equivalent black and white system
CloudScan - A configuration-free invoice analysis system using recurrent neural networks
We present CloudScan; an invoice analysis system that requires zero
configuration or upfront annotation. In contrast to previous work, CloudScan
does not rely on templates of invoice layout, instead it learns a single global
model of invoices that naturally generalizes to unseen invoice layouts. The
model is trained using data automatically extracted from end-user provided
feedback. This automatic training data extraction removes the requirement for
users to annotate the data precisely. We describe a recurrent neural network
model that can capture long range context and compare it to a baseline logistic
regression model corresponding to the current CloudScan production system. We
train and evaluate the system on 8 important fields using a dataset of 326,471
invoices. The recurrent neural network and baseline model achieve 0.891 and
0.887 average F1 scores respectively on seen invoice layouts. For the harder
task of unseen invoice layouts, the recurrent neural network model outperforms
the baseline with 0.840 average F1 compared to 0.788.Comment: Presented at ICDAR 201
Handwritten Text Line Detection and Classification based on HMMs
[ES] En este trabajo presentamos una forma para realizar el análisis y la detección de líneas de
texto en documentos manuscritos basada en los Modelos Ocultos de Markov, una técnica
ampliamente utilizada en otras tareas del reconocimiento del texto manuscrito y del
habla. Mostamos que el análisis y la detección de líneas de texto puede realizarse
utilizando metodologías más formales en contraposición a los métodos heurístics que se
pueden encontrar en la literatura. Nuestro método no solo proporciona las mejores
coordenas de posición para cada una de las regiones verticales de la página sino que
también las etiqueta, de esta manera superando los métodos heurísticos tradicionales. En
nuestros experimentos demonstramos el rendimiento de nuestro método ( tanto en
detección como en classificación de líneas) y estudiamos el impacto de incrementalmente
restringidos "lenguajes de estructuración vertical de páginas" y modelos morfológicos
sobre la precisión de detección y clasificación. Mediante esta experimentación también
demostramos la mejora en calidad de las líneas base generadas por nuestro método en
comparación con un método heurístico estado del arte basado en perfiles de proyección
vertical.[EN] In this paper we present an approach for text line analysis and detection in handwritten
documents based on Hidden Markov Models, a technique widely used in other handwritten
and speech recognition tasks. It is shown that text line analysis and detection can be
solved using a more formal methodology in contraposition to most of the proposed
heuristic approaches found in the literature. Our approach not only provides the best
position coordinates for each of the vertical page regions but also labels them, in this
manner surpassing the traditional heuristic methods. In our experiments we demonstrate
the performance of the approach (both in line analysis and detection) and study the
impact of increasingly constrained ¿vertical layout language models¿ and morphologic
models on text line detection and classification accuracy. Through this experimentation
we also show the improvement in quality of the baselines yielded by our approach in
comparisonwith a state-of-the-art heuristic method based on vertical projection profiles.Bosch Campos, V. (2012). Handwritten Text Line Detection and Classification based on HMMs. http://hdl.handle.net/10251/17964Archivo delegad
Information Preserving Processing of Noisy Handwritten Document Images
Many pre-processing techniques that normalize artifacts and clean noise induce anomalies due to discretization of the document image. Important information that could be used at later stages may be lost. A proposed composite-model framework takes into account pre-printed information, user-added data, and digitization characteristics. Its benefits are demonstrated by experiments with statistically significant results. Separating pre-printed ruling lines from user-added handwriting shows how ruling lines impact people\u27s handwriting and how they can be exploited for identifying writers. Ruling line detection based on multi-line linear regression reduces the mean error of counting them from 0.10 to 0.03, 6.70 to 0.06, and 0.13 to 0.02, com- pared to an HMM-based approach on three standard test datasets, thereby reducing human correction time by 50%, 83%, and 72% on average. On 61 page images from 16 rule-form templates, the precision and recall of form cell recognition are increased by 2.7% and 3.7%, compared to a cross-matrix approach. Compensating for and exploiting ruling lines during feature extraction rather than pre-processing raises the writer identification accuracy from 61.2% to 67.7% on a 61-writer noisy Arabic dataset. Similarly, counteracting page-wise skew by subtracting it or transforming contours in a continuous coordinate system during feature extraction improves the writer identification accuracy. An implementation study of contour-hinge features reveals that utilizing the full probabilistic probability distribution function matrix improves the writer identification accuracy from 74.9% to 79.5%
- …