65 research outputs found
A joint study of deep learning-based methods for identity document image binarization and its influence on attribute recognition
Text recognition has benefited considerably from deep learning research, as well as the preprocessing methods included in its workflow. Identity documents are critical in the field of document analysis and should be thoroughly researched in relation to this workflow. We propose to examine the link between deep learning-based binarization and recognition algorithms for this sort of documents on the MIDV-500 and MIDV-2020 datasets. We provide a series of experiments to illustrate the relation between the quality of the collected images with respect to the binarization results, as well as the influence of its output on final recognition performance. We show that deep learning-based binarization solutions are affected by the capture quality, which implies that they still need significant improvements. We also show that proper binarization results can improve the performance for many recognition methods. Our retrained U-Net-bin outperformed all other binarization methods, and the best result in recognition was obtained by Paddle Paddle OCR v2
Landscape Analysis for the Specimen Data Refinery
This report reviews the current state-of-the-art applied approaches on automated tools, services and workflows for extracting information from images of natural history specimens and their labels. We consider the potential for repurposing existing tools, including workflow management systems; and areas where more development is required. This paper was written as part of the SYNTHESYS+ project for software development teams and informatics teams working on new software-based approaches to improve mass digitisation of natural history specimens
Article Segmentation in Digitised Newspapers
Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities
Field typing for improved recognition on heterogeneous handwritten forms
Offline handwriting recognition has undergone continuous progress over the
past decades. However, existing methods are typically benchmarked on free-form
text datasets that are biased towards good-quality images and handwriting
styles, and homogeneous content. In this paper, we show that state-of-the-art
algorithms, employing long short-term memory (LSTM) layers, do not readily
generalize to real-world structured documents, such as forms, due to their
highly heterogeneous and out-of-vocabulary content, and to the inherent
ambiguities of this content. To address this, we propose to leverage the
content type within an LSTM-based architecture. Furthermore, we introduce a
procedure to generate synthetic data to train this architecture without
requiring expensive manual annotations. We demonstrate the effectiveness of our
approach at transcribing text on a challenging, real-world dataset of European
Accident Statements
Audiovisual Metadata Platform Pilot Development (AMPPD), Final Project Report
This report documents the experience and findings of the Audiovisual Metadata Platform Pilot Development (AMPPD) project, which has worked to enable more efficient generation of metadata to support discovery and use of digitized and born-digital audio and moving image collections. The AMPPD project was carried out by partners Indiana University Libraries, AVP, University of Texas at Austin, and New York Public Library between 2018-2021
Recommended from our members
Steganoscription : exploring techniques for privacy-preserving crowdsourced transcription of handwritten documents
textThe focus my research is the historical document format represented by the Central State Hospital (CSH) dataset, handwritten medical records. The specific problem innate to the CSH dataset in question is how to transcribe sensitive, cursive-handwritten documents via a manual vehicle- such as crowdsourcing. Manual methods are necessarily no matter the sophistication of the optical character recognition system used because of the inconsistencies within cursive script. To address this problem I've developed an application that enables users to transcribe sensitive, handwritten, document images while preserving the privacy of the context around the transcribed text via random word selection and visual manipulation of the displayed text. This is made possible through several algorithms that process documents from a top-down approach. These system operations detect and segment lines of text in images, reverse the slant common to cursive script, detect and segment words, and finally, manipulate word-images before they are displayed to users; combinations of color, noise, and geometric manipulations are currently supported and used randomly. This system, called Steganoscription, combines the concepts of steganography and transcription.Informatio
Implementación de una plataforma web de almacenamiento y difusión de planos arquitectonicos antiguos usando OCR y Tecnologías WEB
El objetivo del desarrollo de la presente tesis es de integrar la tecnología OCR y las
tecnologías web en una sola plataforma web que permita el almacenamiento y la difusión
de planos arquitectonicos antiguos, el registro de los planos actualmente es realizado
de forma manual mediante fichas (ver anexo 1), por lo que la plataforma web que se
desarrolló, permitirá un respaldo más seguro y consultas más rápidas.
Para el desarrollo general se usó la metodología ágil SCRUM que permitió generar
entregables funcionales cada vez más completos a medida que se avanzaba con los
requerimientos. Al optar por tecnologías web, los lenguajes de programación que se
usaron fueron php, javascript, html; finalmente se optó por el framework de desarrollo
Laravel pues proporcionaba una arquitectura MVC e integraba de manera sencilla el
OCR.
La tecnología OCR se seleccionó mediante el análisis con pruebas ‘t’ entre dos
tecnologías como son Tesseract OCR y OCR Space Api, teniendo como resultado final
la selección de la tecnología OCR Space Api, la cual conto con un porcentaje de
aciertos del 80.28 %. frente al 71.58 %. de la tecnología Tesseract. Estando el resultado
de aciertos dentro de un rango aceptable, la tecnología OCR Tesseract fue descartada
pues al ser un conjunto de librerías, dependía de la velocidad de procesamiento del
ordenador en donde se va a usar y que a diferencia del OCR Space que, al ser una API,
la mayoría de procesos para el escaneo de una imagen se encuentran en la nube, siendo
su única limitación es la cantidad de consultas que se realizan por día.
Finalmente se obtuvo como resultado una plataforma web que usa la tecnología
OCR para poder autocompletar algunos campos de texto mediante la selección de
la información en un plano escaneado; asimismo cumple con el objetivo de poder
almacenar dicha información para posteriormente ser consultada por cualquier persona
registrada mediante una interfaz responsiva y amigable para el usuario
- …