Search CORE

65 research outputs found

A joint study of deep learning-based methods for identity document image binarization and its influence on attribute recognition

Author: Bezmaternykh P.V.
Bulatov K.B.
Gayer A.V.
José Silva-Mata F.
Morales-González A.
Sánchez-Rivero R.
Publication venue: Самарский национальный исследовательский университет
Publication date: 01/08/2023
Field of study

Text recognition has benefited considerably from deep learning research, as well as the preprocessing methods included in its workflow. Identity documents are critical in the field of document analysis and should be thoroughly researched in relation to this workflow. We propose to examine the link between deep learning-based binarization and recognition algorithms for this sort of documents on the MIDV-500 and MIDV-2020 datasets. We provide a series of experiments to illustrate the relation between the quality of the collected images with respect to the binarization results, as well as the influence of its output on final recognition performance. We show that deep learning-based binarization solutions are affected by the capture quality, which implies that they still need significant improvements. We also show that proper binarization results can improve the performance for many recognition methods. Our retrained U-Net-bin outperformed all other binarization methods, and the best result in recognition was obtained by Paddle Paddle OCR v2

Samara University

Landscape Analysis for the Specimen Data Refinery

Author: Bánki Olaf
Cubey Robert
Drinkwater Robyn
Englund Markus
Goble Carole
Groom Quentin
Kermorvant Christopher
Livermore Laurence
Rey Isabel
Santos Celia
Scott Ben
Walton Stephanie
Williams Alan
Wu Zhengzhe
Publication venue
Publication date: 01/01/2020
Field of study

This report reviews the current state-of-the-art applied approaches on automated tools, services and workflows for extracting information from images of natural history specimens and their labels. We consider the potential for repurposing existing tools, including workflow management systems; and areas where more development is required. This paper was written as part of the SYNTHESYS+ project for software development teams and informatics teams working on new software-based approaches to improve mass digitisation of natural history specimens

ZENODO

The University of Manchester - Institutional Repository

Digital.CSIC

ARPHA OAI-PMH Endpoint

ARPHA Preprints

Article Segmentation in Digitised Newspapers

Author: Naoum Andrew
Publication venue: Faculty of Engineering and Information Technologies, School of Computer Science
Publication date: 01/01/2020
Field of study

Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities

Sydney eScholarship

Field typing for improved recognition on heterogeneous handwritten forms

Author: Feng Paul
Jayet Patrick
Salzmann Mathieu
Tomoiaga Ciprian
Publication venue
Publication date: 22/09/2019
Field of study

Offline handwriting recognition has undergone continuous progress over the past decades. However, existing methods are typically benchmarked on free-form text datasets that are biased towards good-quality images and handwriting styles, and homogeneous content. In this paper, we show that state-of-the-art algorithms, employing long short-term memory (LSTM) layers, do not readily generalize to real-world structured documents, such as forms, due to their highly heterogeneous and out-of-vocabulary content, and to the inherent ambiguities of this content. To address this, we propose to leverage the content type within an LSTM-based architecture. Furthermore, we introduce a procedure to generate synthetic data to train this architecture without requiring expensive manual annotations. We demonstrate the effectiveness of our approach at transcribing text on a challenging, real-world dataset of European Accident Statements

arXiv.org e-Print Archive

Crossref

Audiovisual Metadata Platform Pilot Development (AMPPD), Final Project Report

Author: Averkamp Shawn
Clement Tanya
Dunn Jon W.
Feng Ying
Fischer Liz
Hardesty Juliet L.
Lyons Bertram
Rudersdorf Amy
Wheeler Brian
Whitaker Maria
Whittaker Thomas
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 10/12/2021
Field of study

This report documents the experience and findings of the Audiovisual Metadata Platform Pilot Development (AMPPD) project, which has worked to enable more efficient generation of metadata to support discovery and use of digitized and born-digital audio and moving image collections. The AMPPD project was carried out by partners Indiana University Libraries, AVP, University of Texas at Austin, and New York Public Library between 2018-2021

DigitalCommons@University of Nebraska

Recommended from our members

Steganoscription : exploring techniques for privacy-preserving crowdsourced transcription of handwritten documents

Author: Smith Dustin Lee
Publication venue
Publication date: 23/10/2015
Field of study

textThe focus my research is the historical document format represented by the Central State Hospital (CSH) dataset, handwritten medical records. The specific problem innate to the CSH dataset in question is how to transcribe sensitive, cursive-handwritten documents via a manual vehicle- such as crowdsourcing. Manual methods are necessarily no matter the sophistication of the optical character recognition system used because of the inconsistencies within cursive script. To address this problem I've developed an application that enables users to transcribe sensitive, handwritten, document images while preserving the privacy of the context around the transcribed text via random word selection and visual manipulation of the displayed text. This is made possible through several algorithms that process documents from a top-down approach. These system operations detect and segment lines of text in images, reverse the slant common to cursive script, detect and segment words, and finally, manipulate word-images before they are displayed to users; combinations of color, noise, and geometric manipulations are currently supported and used randomly. This system, called Steganoscription, combines the concepts of steganography and transcription.Informatio

Texas ScholarWorks

Implementación de una plataforma web de almacenamiento y difusión de planos arquitectonicos antiguos usando OCR y Tecnologías WEB

Author: Agramonte Flores Martín
Publication venue: 'Baishideng Publishing Group Inc.'
Publication date: 09/06/2022
Field of study

El objetivo del desarrollo de la presente tesis es de integrar la tecnología OCR y las tecnologías web en una sola plataforma web que permita el almacenamiento y la difusión de planos arquitectonicos antiguos, el registro de los planos actualmente es realizado de forma manual mediante fichas (ver anexo 1), por lo que la plataforma web que se desarrolló, permitirá un respaldo más seguro y consultas más rápidas. Para el desarrollo general se usó la metodología ágil SCRUM que permitió generar entregables funcionales cada vez más completos a medida que se avanzaba con los requerimientos. Al optar por tecnologías web, los lenguajes de programación que se usaron fueron php, javascript, html; finalmente se optó por el framework de desarrollo Laravel pues proporcionaba una arquitectura MVC e integraba de manera sencilla el OCR. La tecnología OCR se seleccionó mediante el análisis con pruebas ‘t’ entre dos tecnologías como son Tesseract OCR y OCR Space Api, teniendo como resultado final la selección de la tecnología OCR Space Api, la cual conto con un porcentaje de aciertos del 80.28 %. frente al 71.58 %. de la tecnología Tesseract. Estando el resultado de aciertos dentro de un rango aceptable, la tecnología OCR Tesseract fue descartada pues al ser un conjunto de librerías, dependía de la velocidad de procesamiento del ordenador en donde se va a usar y que a diferencia del OCR Space que, al ser una API, la mayoría de procesos para el escaneo de una imagen se encuentran en la nube, siendo su única limitación es la cantidad de consultas que se realizan por día. Finalmente se obtuvo como resultado una plataforma web que usa la tecnología OCR para poder autocompletar algunos campos de texto mediante la selección de la información en un plano escaneado; asimismo cumple con el objetivo de poder almacenar dicha información para posteriormente ser consultada por cualquier persona registrada mediante una interfaz responsiva y amigable para el usuario

Repositorio de Tesis - Universidad Católica de Santa María