2,234 research outputs found

    Text Line Segmentation of Historical Documents: a Survey

    Full text link
    There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines),automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade, and dedicated to documents of historical interest.Comment: 25 pages, submitted version, To appear in International Journal on Document Analysis and Recognition, On line version available at http://www.springerlink.com/content/k2813176280456k3

    CloudScan - A configuration-free invoice analysis system using recurrent neural networks

    Get PDF
    We present CloudScan; an invoice analysis system that requires zero configuration or upfront annotation. In contrast to previous work, CloudScan does not rely on templates of invoice layout, instead it learns a single global model of invoices that naturally generalizes to unseen invoice layouts. The model is trained using data automatically extracted from end-user provided feedback. This automatic training data extraction removes the requirement for users to annotate the data precisely. We describe a recurrent neural network model that can capture long range context and compare it to a baseline logistic regression model corresponding to the current CloudScan production system. We train and evaluate the system on 8 important fields using a dataset of 326,471 invoices. The recurrent neural network and baseline model achieve 0.891 and 0.887 average F1 scores respectively on seen invoice layouts. For the harder task of unseen invoice layouts, the recurrent neural network model outperforms the baseline with 0.840 average F1 compared to 0.788.Comment: Presented at ICDAR 201

    Handwritten Text Line Detection and Classification based on HMMs

    Full text link
    [ES] En este trabajo presentamos una forma para realizar el análisis y la detección de líneas de texto en documentos manuscritos basada en los Modelos Ocultos de Markov, una técnica ampliamente utilizada en otras tareas del reconocimiento del texto manuscrito y del habla. Mostamos que el análisis y la detección de líneas de texto puede realizarse utilizando metodologías más formales en contraposición a los métodos heurístics que se pueden encontrar en la literatura. Nuestro método no solo proporciona las mejores coordenas de posición para cada una de las regiones verticales de la página sino que también las etiqueta, de esta manera superando los métodos heurísticos tradicionales. En nuestros experimentos demonstramos el rendimiento de nuestro método ( tanto en detección como en classificación de líneas) y estudiamos el impacto de incrementalmente restringidos "lenguajes de estructuración vertical de páginas" y modelos morfológicos sobre la precisión de detección y clasificación. Mediante esta experimentación también demostramos la mejora en calidad de las líneas base generadas por nuestro método en comparación con un método heurístico estado del arte basado en perfiles de proyección vertical.[EN] In this paper we present an approach for text line analysis and detection in handwritten documents based on Hidden Markov Models, a technique widely used in other handwritten and speech recognition tasks. It is shown that text line analysis and detection can be solved using a more formal methodology in contraposition to most of the proposed heuristic approaches found in the literature. Our approach not only provides the best position coordinates for each of the vertical page regions but also labels them, in this manner surpassing the traditional heuristic methods. In our experiments we demonstrate the performance of the approach (both in line analysis and detection) and study the impact of increasingly constrained ¿vertical layout language models¿ and morphologic models on text line detection and classification accuracy. Through this experimentation we also show the improvement in quality of the baselines yielded by our approach in comparisonwith a state-of-the-art heuristic method based on vertical projection profiles.Bosch Campos, V. (2012). Handwritten Text Line Detection and Classification based on HMMs. http://hdl.handle.net/10251/17964Archivo delegad

    Information Preserving Processing of Noisy Handwritten Document Images

    Get PDF
    Many pre-processing techniques that normalize artifacts and clean noise induce anomalies due to discretization of the document image. Important information that could be used at later stages may be lost. A proposed composite-model framework takes into account pre-printed information, user-added data, and digitization characteristics. Its benefits are demonstrated by experiments with statistically significant results. Separating pre-printed ruling lines from user-added handwriting shows how ruling lines impact people\u27s handwriting and how they can be exploited for identifying writers. Ruling line detection based on multi-line linear regression reduces the mean error of counting them from 0.10 to 0.03, 6.70 to 0.06, and 0.13 to 0.02, com- pared to an HMM-based approach on three standard test datasets, thereby reducing human correction time by 50%, 83%, and 72% on average. On 61 page images from 16 rule-form templates, the precision and recall of form cell recognition are increased by 2.7% and 3.7%, compared to a cross-matrix approach. Compensating for and exploiting ruling lines during feature extraction rather than pre-processing raises the writer identification accuracy from 61.2% to 67.7% on a 61-writer noisy Arabic dataset. Similarly, counteracting page-wise skew by subtracting it or transforming contours in a continuous coordinate system during feature extraction improves the writer identification accuracy. An implementation study of contour-hinge features reveals that utilizing the full probabilistic probability distribution function matrix improves the writer identification accuracy from 74.9% to 79.5%
    corecore