7 research outputs found
Key-value information extraction from full handwritten pages
We propose a Transformer-based approach for information extraction from
digitized handwritten documents. Our approach combines, in a single model, the
different steps that were so far performed by separate models: feature
extraction, handwriting recognition and named entity recognition. We compare
this integrated approach with traditional two-stage methods that perform
handwriting recognition before named entity recognition, and present results at
different levels: line, paragraph, and page. Our experiments show that
attention-based models are especially interesting when applied on full pages,
as they do not require any prior segmentation step. Finally, we show that they
are able to learn from key-value annotations: a list of important words with
their corresponding named entities. We compare our models to state-of-the-art
methods on three public databases (IAM, ESPOSALLES, and POPP) and outperform
previous performances on all three datasets
Multiple Document Datasets Pre-training Improves Text Line Detection With Deep Neural Networks
In this paper, we introduce a fully convolutional network for the document
layout analysis task. While state-of-the-art methods are using models
pre-trained on natural scene images, our method Doc-UFCN relies on a U-shaped
model trained from scratch for detecting objects from historical documents. We
consider the line segmentation task and more generally the layout analysis
problem as a pixel-wise classification task then our model outputs a
pixel-labeling of the input images. We show that Doc-UFCN outperforms
state-of-the-art methods on various datasets and also demonstrate that the
pre-trained parts on natural scene images are not required to reach good
results. In addition, we show that pre-training on multiple document datasets
can improve the performances. We evaluate the models using various metrics to
have a fair and complete comparison between the methods
SIMARA: a database for key-value information extraction from full pages
We propose a new database for information extraction from historical
handwritten documents. The corpus includes 5,393 finding aids from six
different series, dating from the 18th-20th centuries. Finding aids are
handwritten documents that contain metadata describing older archives. They are
stored in the National Archives of France and are used by archivists to
identify and find archival documents. Each document is annotated at page-level,
and contains seven fields to retrieve. The localization of each field is not
available in such a way that this dataset encourages research on
segmentation-free systems for information extraction. We propose a model based
on the Transformer architecture trained for end-to-end information extraction
and provide three sets for training, validation and testing, to ensure fair
comparison with future works. The database is freely accessible at
https://zenodo.org/record/7868059
Large-scale genealogical information extraction from handwritten Quebec parish records
This paper presents a complete workflow designed for extracting information from Quebec handwritten parish registers. The acts in these documents contain individual and family information highly valuable for genetic, demographic and social studies of the Quebec population. From an image of parish records, our workflow is able to identify the acts and extract personal information. The workflow is divided into successive steps: page classification, text line detection, handwritten text recognition, named entity recognition and act detection and classification. For all these steps, different machine learning models are compared. Once the information is extracted, validation rules designed by experts are then applied to standardize the extracted information and ensure its consistency with the type of act (birth, marriage and death). This validation step is able to reject records that are considered invalid or merged. The full workflow has been used to process over two million pages of Quebec parish registers from the 19–20th centuries. On a sample comprising 65% of registers, 3.2 million acts were recognized. Verification of the birth and death acts from this sample shows that 74% of them are considered complete and valid. These records will be integrated into the BALSAC database and linked together to recreate family and genealogical relations at large scale
Robust text line detection in historical documents: learning and evaluation methods
International audienc
HORAE: an annotated dataset of books of hours
International audienc
Handwritten Text Recognition from Crowdsourced Annotations
Accepted to the 7th International Workshop on Historical Document Imaging and Processing (HIP 23)International audienceIn this paper, we explore different ways of training a model for handwritten text recognition when multiple imperfect or noisy transcriptions are available. We consider various training configurations, such as selecting a single transcription, retaining all transcriptions, or computing an aggregated transcription from all available annotations. In addition, we evaluate the impact of quality-based data selection, where samples with low agreement are removed from the training set. Our experiments are carried out on municipal registers of the city of Belfort (France) written between 1790 and 1946. % results The results show that computing a consensus transcription or training on multiple transcriptions are good alternatives. However, selecting training samples based on the degree of agreement between annotators introduces a bias in the training data and does not improve the results. Our dataset is publicly available on Zenodo: https://zenodo.org/record/8041668