65 research outputs found

    A joint study of deep learning-based methods for identity document image binarization and its influence on attribute recognition

    Get PDF
    Text recognition has benefited considerably from deep learning research, as well as the preprocessing methods included in its workflow. Identity documents are critical in the field of document analysis and should be thoroughly researched in relation to this workflow. We propose to examine the link between deep learning-based binarization and recognition algorithms for this sort of documents on the MIDV-500 and MIDV-2020 datasets. We provide a series of experiments to illustrate the relation between the quality of the collected images with respect to the binarization results, as well as the influence of its output on final recognition performance. We show that deep learning-based binarization solutions are affected by the capture quality, which implies that they still need significant improvements. We also show that proper binarization results can improve the performance for many recognition methods. Our retrained U-Net-bin outperformed all other binarization methods, and the best result in recognition was obtained by Paddle Paddle OCR v2

    Landscape Analysis for the Specimen Data Refinery

    Get PDF
    This report reviews the current state-of-the-art applied approaches on automated tools, services and workflows for extracting information from images of natural history specimens and their labels. We consider the potential for repurposing existing tools, including workflow management systems; and areas where more development is required. This paper was written as part of the SYNTHESYS+ project for software development teams and informatics teams working on new software-based approaches to improve mass digitisation of natural history specimens

    Article Segmentation in Digitised Newspapers

    Get PDF
    Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities

    Field typing for improved recognition on heterogeneous handwritten forms

    Full text link
    Offline handwriting recognition has undergone continuous progress over the past decades. However, existing methods are typically benchmarked on free-form text datasets that are biased towards good-quality images and handwriting styles, and homogeneous content. In this paper, we show that state-of-the-art algorithms, employing long short-term memory (LSTM) layers, do not readily generalize to real-world structured documents, such as forms, due to their highly heterogeneous and out-of-vocabulary content, and to the inherent ambiguities of this content. To address this, we propose to leverage the content type within an LSTM-based architecture. Furthermore, we introduce a procedure to generate synthetic data to train this architecture without requiring expensive manual annotations. We demonstrate the effectiveness of our approach at transcribing text on a challenging, real-world dataset of European Accident Statements

    Audiovisual Metadata Platform Pilot Development (AMPPD), Final Project Report

    Get PDF
    This report documents the experience and findings of the Audiovisual Metadata Platform Pilot Development (AMPPD) project, which has worked to enable more efficient generation of metadata to support discovery and use of digitized and born-digital audio and moving image collections. The AMPPD project was carried out by partners Indiana University Libraries, AVP, University of Texas at Austin, and New York Public Library between 2018-2021

    Implementación de una plataforma web de almacenamiento y difusión de planos arquitectonicos antiguos usando OCR y Tecnologías WEB

    Get PDF
    El objetivo del desarrollo de la presente tesis es de integrar la tecnología OCR y las tecnologías web en una sola plataforma web que permita el almacenamiento y la difusión de planos arquitectonicos antiguos, el registro de los planos actualmente es realizado de forma manual mediante fichas (ver anexo 1), por lo que la plataforma web que se desarrolló, permitirá un respaldo más seguro y consultas más rápidas. Para el desarrollo general se usó la metodología ágil SCRUM que permitió generar entregables funcionales cada vez más completos a medida que se avanzaba con los requerimientos. Al optar por tecnologías web, los lenguajes de programación que se usaron fueron php, javascript, html; finalmente se optó por el framework de desarrollo Laravel pues proporcionaba una arquitectura MVC e integraba de manera sencilla el OCR. La tecnología OCR se seleccionó mediante el análisis con pruebas ‘t’ entre dos tecnologías como son Tesseract OCR y OCR Space Api, teniendo como resultado final la selección de la tecnología OCR Space Api, la cual conto con un porcentaje de aciertos del 80.28 %. frente al 71.58 %. de la tecnología Tesseract. Estando el resultado de aciertos dentro de un rango aceptable, la tecnología OCR Tesseract fue descartada pues al ser un conjunto de librerías, dependía de la velocidad de procesamiento del ordenador en donde se va a usar y que a diferencia del OCR Space que, al ser una API, la mayoría de procesos para el escaneo de una imagen se encuentran en la nube, siendo su única limitación es la cantidad de consultas que se realizan por día. Finalmente se obtuvo como resultado una plataforma web que usa la tecnología OCR para poder autocompletar algunos campos de texto mediante la selección de la información en un plano escaneado; asimismo cumple con el objetivo de poder almacenar dicha información para posteriormente ser consultada por cualquier persona registrada mediante una interfaz responsiva y amigable para el usuario

    ANALYZING IMAGE TWEETS IN MICROBLOGS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore