13,460 research outputs found
Text Line Segmentation of Historical Documents: a Survey
There is a huge amount of historical documents in libraries and in various
National Archives that have not been exploited electronically. Although
automatic reading of complete pages remains, in most cases, a long-term
objective, tasks such as word spotting, text/image alignment, authentication
and extraction of specific fields are in use today. For all these tasks, a
major step is document segmentation into text lines. Because of the low quality
and the complexity of these documents (background noise, artifacts due to
aging, interfering lines),automatic text line segmentation remains an open
research field. The objective of this paper is to present a survey of existing
methods, developed during the last decade, and dedicated to documents of
historical interest.Comment: 25 pages, submitted version, To appear in International Journal on
Document Analysis and Recognition, On line version available at
http://www.springerlink.com/content/k2813176280456k3
On-the-fly Historical Handwritten Text Annotation
The performance of information retrieval algorithms depends upon the
availability of ground truth labels annotated by experts. This is an important
prerequisite, and difficulties arise when the annotated ground truth labels are
incorrect or incomplete due to high levels of degradation. To address this
problem, this paper presents a simple method to perform on-the-fly annotation
of degraded historical handwritten text in ancient manuscripts. The proposed
method aims at quick generation of ground truth and correction of inaccurate
annotations such that the bounding box perfectly encapsulates the word, and
contains no added noise from the background or surroundings. This method will
potentially be of help to historians and researchers in generating and
correcting word labels in a document dynamically. The effectiveness of the
annotation method is empirically evaluated on an archival manuscript collection
from well-known publicly available datasets
Handwritten and printed text separation in historical documents
Historical documents present many challenges for Optical Character Recognition Systems
(OCR), especially documents of poor quality containing handwritten annotations,
stamps, signatures, and historical fonts. As most OCRs recognize either machine-printed
or handwritten texts, printed and handwritten parts have to be separated before using
the respective recognition system. This thesis addresses the problem of segmentation of
handwritings and printings in historical Latin text documents. To alleviate the problem
of lack of data containing handwritten and machine-printed components located on the
same page or even overlapping each other as well as their pixel-wise annotations, the data
synthesis method proposed in [12] was applied and new datasets were generated. The
newly created images and their pixel-level labels were used to train Fully Convolutional
Model (FCN) introduced in [5]. The newly trained model has shown better results in the
separation of machine-printed and handwritten text in historical documents
The Challenges of HTR Model Training: Feedback from the Project Donner le gout de l'archive a l'ere numerique
The arrival of handwriting recognition technologies offers new possibilities
for research in heritage studies. However, it is now necessary to reflect on
the experiences and the practices developed by research teams. Our use of the
Transkribus platform since 2018 has led us to search for the most significant
ways to improve the performance of our handwritten text recognition (HTR)
models which are made to transcribe French handwriting dating from the 17th
century. This article therefore reports on the impacts of creating transcribing
protocols, using the language model at full scale and determining the best way
to use base models in order to help increase the performance of HTR models.
Combining all of these elements can indeed increase the performance of a single
model by more than 20% (reaching a Character Error Rate below 5%). This article
also discusses some challenges regarding the collaborative nature of HTR
platforms such as Transkribus and the way researchers can share their data
generated in the process of creating or training handwritten text recognition
models
SIMARA: a database for key-value information extraction from full pages
We propose a new database for information extraction from historical
handwritten documents. The corpus includes 5,393 finding aids from six
different series, dating from the 18th-20th centuries. Finding aids are
handwritten documents that contain metadata describing older archives. They are
stored in the National Archives of France and are used by archivists to
identify and find archival documents. Each document is annotated at page-level,
and contains seven fields to retrieve. The localization of each field is not
available in such a way that this dataset encourages research on
segmentation-free systems for information extraction. We propose a model based
on the Transformer architecture trained for end-to-end information extraction
and provide three sets for training, validation and testing, to ensure fair
comparison with future works. The database is freely accessible at
https://zenodo.org/record/7868059
A bimodal crowdsourcing platform for demographic historical manuscripts
Ponència presentada al First International Conference on Digital Access to Textual Cultural Heritage celebrada del 19 al 20 de maig de 2014 a MadridIn this paper we present a crowdsourcing web-based application for extracting information from demographic handwritten document images. The proposed application integrates two points of view: the semantic information for demographic research, and the ground-truthing for document analysis research. Concretely, the application has the contents view, where the information is recorded into forms, and the labeling view, with the word labels for evaluating document analysis techniques. The crowdsourcing architecture allows to accelerate the information extraction (many users can work simultaneously), validate the information, and easily provide feedback to the users. We finally show how the proposed application can be extended to other kind of demographic historical manuscripts
- …