52 research outputs found
Ground-Truth production in the tranScriptorium Project
© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Tran Scriptorium is a 3-years project that aims to develop innovative, cost-effective solutions for the indexing, search and full transcription of historical handwritten document images, using Handwritten Text Recognition (HTR) technology. The production of ground-truth (GT) of a dataset of handwritten document images is among the first tasks. We address novel approaches for the faster production of this GT based on crowd-sourcing and on prior-knowledge methods. We also address here a novel low-cost semi-supervised procedure for obtaining pairs of correct line-level aligned detected/extracted text line images and text line transcripts, specially suitable for training models of the HTR technology employed in Tran Scriptorium.Work supported by the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No.600707 - tranScriptorium.Gatos, B.; Louloudis, G.; Caser, T.; Grint, K.; Romero Gómez, V.; Sánchez Peiró, JA.; Toselli, AH.... (2014). Ground-Truth production in the tranScriptorium Project. En Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on. IEEE Computer Society - Conference Publishing Services (CPS). 237-241. https://doi.org/10.1109/DAS.2014.23S23724
On-the-fly Historical Handwritten Text Annotation
The performance of information retrieval algorithms depends upon the
availability of ground truth labels annotated by experts. This is an important
prerequisite, and difficulties arise when the annotated ground truth labels are
incorrect or incomplete due to high levels of degradation. To address this
problem, this paper presents a simple method to perform on-the-fly annotation
of degraded historical handwritten text in ancient manuscripts. The proposed
method aims at quick generation of ground truth and correction of inaccurate
annotations such that the bounding box perfectly encapsulates the word, and
contains no added noise from the background or surroundings. This method will
potentially be of help to historians and researchers in generating and
correcting word labels in a document dynamically. The effectiveness of the
annotation method is empirically evaluated on an archival manuscript collection
from well-known publicly available datasets
Handwritten Text Recognition for Historical Documents in the tranScriptorium Project
""© Owner/Author 2014. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM, In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (pp. 111-117) http://dx.doi.org/10.1145/2595188.2595193Transcription of historical handwritten documents is a crucial
problem for making easier the access to these documents
to the general public. Currently, huge amount of historical
handwritten documents are being made available by on-line
portals worldwide. It is not realistic to obtain the transcription
of these documents manually, and therefore automatic
techniques has to be used. tranScriptorium is
a project that aims at researching on modern Handwritten
Text Recognition (HTR) technology for transcribing historical
handwritten documents. The HTR technology used in
tranScriptorium is based on models that are learnt automatically
from examples. This HTR technology has been
used on a Dutch collection from 15th century selected for
the tranScriptorium project. This paper provides preliminary
HTR results on this Dutch collection that are very
encouraging, taken into account that minimal resources have
been deployed to develop the transcription system.The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 600707 - tranScriptorium and the Spanish MEC under the STraDa (TIN2012-37475-C02-01) research project.Sánchez Peiró, JA.; Bosch Campos, V.; Romero Gómez, V.; Depuydt, K.; De Does, J. (2014). Handwritten Text Recognition for Historical Documents in the tranScriptorium Project. ACM. https://doi.org/10.1145/2595188.2595193
ICFHR2014 Competition on Handwritten Text Recognition on tranScriptorium Datasets (HTRtS)
©2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.A contest on Handwritten Text Recognition organised
in the context of the ICFHR 2014 conference is described.
Two tracks with increased freedom on the use of training
data were proposed and three research groups participated
in these two tracks. The handwritten images for this contest
were drawn from an English data set which is currently being
considered in the tranScriptorium project. The the goal of
this project is to develop innovative, efficient and cost-effective
solutions for the transcription of historical handwritten document
images, focusing on four languages: English, Spanish,
German and Dutch. For the English language, the so-called
“Bentham collection” is being considered in tranScriptorium.
It encompasses a large set of manuscripts written by the
renowned English philosopher and reformer Jeremy Bentham
(1748-1832). A small subset of this collection has been chosen
for the present HTR competition. The selected subset has been
written by several hands (Bentham himself and his secretaries)
and entails significant varibilities and difficulties regarding the
quality of text images and writting styles. Training and test
data were provided in the form of carefully segmented line
images, along with the corresponding transcripts. The three
participants achieved very good results, with transcription
word error rates ranging from 15.0% down to 8.6%.The research leading to these results has received funding from the European Union’s Seventh Framework Pro-gramme (FP7/2007-2013) under grant agreement no. 600707- tranScriptorium. The authors would like to thank all theTRANSCRIPTORIUMmembers for their collaboration and the entrants for their participation in this contest.Sánchez Peiró, JA.; Romero Gómez, V.; Toselli, AH.; Vidal Ruiz, E. (2014). ICFHR2014 Competition on Handwritten Text Recognition on tranScriptorium Datasets (HTRtS). IEEE. https://doi.org/10.1109/ICFHR.2014.137
tranScriptorium: A Guide for the Humanist
This report describes the work of the tranScriptorium project, which intends to produce software applications which will 'read' historical handwritten documents, and produce transcripts for these manuscripts. It describes the process in non-technical language, and explains the potential benefits of making available millions of pages of hitherto digitized, but untranscribed and relatively inaccessible, manuscripts. The tranScriptorium consortium consists of: Universitat Politècnica de València – UPV (Spain, lead institution) Universiy of Innsbruck – UIBK (Austria) National Center for Scientific Research “Demokritos” – NCSR (Greece) University College London – UCL (UK) Institute for Dutch Lexicology – INL (Netherlands) University London Computer Centre – ULCC (UK) This project has received funding from the European Union's Seventh Framework Programme for research, technological development and demonstration under grant agreement no 600707
Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio -- Episode 1: Machine Transcription of the Manuscripts
In Codice Ratio is a research project to study tools and techniques for
analyzing the contents of historical documents conserved in the Vatican Secret
Archives (VSA). In this paper, we present our efforts to develop a system to
support the transcription of medieval manuscripts. The goal is to provide
paleographers with a tool to reduce their efforts in transcribing large
volumes, as those stored in the VSA, producing good transcriptions for
significant portions of the manuscripts. We propose an original approach based
on character segmentation. Our solution is able to deal with the dirty
segmentation that inevitably occurs in handwritten documents. We use a
convolutional neural network to recognize characters and language models to
compose word transcriptions. Our approach requires minimal training efforts,
making the transcription process more scalable as the production of training
sets requires a few pages and can be easily crowdsourced. We have conducted
experiments on manuscripts from the Vatican Registers, an unreleased corpus
containing the correspondence of the popes. With training data produced by 120
high school students, our system has been able to produce good transcriptions
that can be used by paleographers as a solid basis to speedup the transcription
process at a large scale.Comment: Donatella Firmani, Marco Maiorino, Paolo Merialdo, and Elena Nieddu.
2018. Towards Knowledge Discovery from the Vatican Secret Archives. In Codice
Ratio - Episode 1: Machine Transcription of the Manuscripts. In Proceedings
of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining (KDD '18). ACM, New York, NY, USA, 263-27
Transforming scholarship in the archives through handwritten text recognition:Transkribus as a case study
Purpose: An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues. - Design/methodology/approach: This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material. - Findings: Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified. - Research limitations/implications: The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc. - Practical implications: Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field. - Social implications: The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals. - Originality/value: This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector
Advances in Handwritten Keyword Indexing and Search Technologies
Many extensive manuscript collections are available in archives and libraries all over the world, but their textual contents remain practically inaccessible, buried under thousands of terabytes worth of high-resolution images. If perfect or sufficiently accurate text-image transcripts were available, textual content could be indexed directly for plaintext access using conventional information retrieval systems. But the results of fully automated transcriptions generally lack the level of accuracy needed for reliable text indexing and search purposes. Additionally, manual or even computer-assited transcription is entierely unsustainable when dealing with the extensive image collections typically considered for indexing. This paper explains how accurate indexing and search commands can be implemented directly on the digital images themselves without the need to explicitly resort to image transcripts. Results obtained using the proposed techniques on several relevant historical data sets are presented, clearly supporting the considerable potential of these technologies
- …