Search CORE

52 research outputs found

Ground-Truth production in the tranScriptorium Project

Author: Caser Tim
Gatos Basilis
Grint Kris
Louloudis Georgios
Romero Gómez Verónica
Sánchez Peiró Joan Andreu
Toselli Alejandro Héctor
Vidal Enrique
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/04/2014
Field of study

© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Tran Scriptorium is a 3-years project that aims to develop innovative, cost-effective solutions for the indexing, search and full transcription of historical handwritten document images, using Handwritten Text Recognition (HTR) technology. The production of ground-truth (GT) of a dataset of handwritten document images is among the first tasks. We address novel approaches for the faster production of this GT based on crowd-sourcing and on prior-knowledge methods. We also address here a novel low-cost semi-supervised procedure for obtaining pairs of correct line-level aligned detected/extracted text line images and text line transcripts, specially suitable for training models of the HTR technology employed in Tran Scriptorium.Work supported by the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No.600707 - tranScriptorium.Gatos, B.; Louloudis, G.; Caser, T.; Grint, K.; Romero Gómez, V.; Sánchez Peiró, JA.; Toselli, AH.... (2014). Ground-Truth production in the tranScriptorium Project. En Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on. IEEE Computer Society - Conference Publishing Services (CPS). 237-241. https://doi.org/10.1109/DAS.2014.23S23724

RiuNet

On-the-fly Historical Handwritten Text Annotation

Author: Hast Anders
Vats Ekta
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/09/2017
Field of study

The performance of information retrieval algorithms depends upon the availability of ground truth labels annotated by experts. This is an important prerequisite, and difficulties arise when the annotated ground truth labels are incorrect or incomplete due to high levels of degradation. To address this problem, this paper presents a simple method to perform on-the-fly annotation of degraded historical handwritten text in ancient manuscripts. The proposed method aims at quick generation of ground truth and correction of inaccurate annotations such that the bounding box perfectly encapsulates the word, and contains no added noise from the background or surroundings. This method will potentially be of help to historians and researchers in generating and correcting word labels in a document dynamically. The effectiveness of the annotation method is empirically evaluated on an archival manuscript collection from well-known publicly available datasets

arXiv.org e-Print Archive

Crossref

Handwritten Text Recognition for Historical Documents in the tranScriptorium Project

Author: Bosch Campos Vicente
De Does Jesse
Depuydt Katrien
Romero Gómez Verónica
Sánchez Peiró Joan Andreu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

""© Owner/Author 2014. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM, In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (pp. 111-117) http://dx.doi.org/10.1145/2595188.2595193Transcription of historical handwritten documents is a crucial problem for making easier the access to these documents to the general public. Currently, huge amount of historical handwritten documents are being made available by on-line portals worldwide. It is not realistic to obtain the transcription of these documents manually, and therefore automatic techniques has to be used. tranScriptorium is a project that aims at researching on modern Handwritten Text Recognition (HTR) technology for transcribing historical handwritten documents. The HTR technology used in tranScriptorium is based on models that are learnt automatically from examples. This HTR technology has been used on a Dutch collection from 15th century selected for the tranScriptorium project. This paper provides preliminary HTR results on this Dutch collection that are very encouraging, taken into account that minimal resources have been deployed to develop the transcription system.The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 600707 - tranScriptorium and the Spanish MEC under the STraDa (TIN2012-37475-C02-01) research project.Sánchez Peiró, JA.; Bosch Campos, V.; Romero Gómez, V.; Depuydt, K.; De Does, J. (2014). Handwritten Text Recognition for Historical Documents in the tranScriptorium Project. ACM. https://doi.org/10.1145/2595188.2595193

Crossref

RiuNet

ICFHR2014 Competition on Handwritten Text Recognition on tranScriptorium Datasets (HTRtS)

Author: Romero Gómez Verónica
Sánchez Peiró Joan Andreu
Toselli Alejandro Héctor
Vidal Ruiz Enrique
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2014
Field of study

©2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.A contest on Handwritten Text Recognition organised in the context of the ICFHR 2014 conference is described. Two tracks with increased freedom on the use of training data were proposed and three research groups participated in these two tracks. The handwritten images for this contest were drawn from an English data set which is currently being considered in the tranScriptorium project. The the goal of this project is to develop innovative, efficient and cost-effective solutions for the transcription of historical handwritten document images, focusing on four languages: English, Spanish, German and Dutch. For the English language, the so-called “Bentham collection” is being considered in tranScriptorium. It encompasses a large set of manuscripts written by the renowned English philosopher and reformer Jeremy Bentham (1748-1832). A small subset of this collection has been chosen for the present HTR competition. The selected subset has been written by several hands (Bentham himself and his secretaries) and entails significant varibilities and difficulties regarding the quality of text images and writting styles. Training and test data were provided in the form of carefully segmented line images, along with the corresponding transcripts. The three participants achieved very good results, with transcription word error rates ranging from 15.0% down to 8.6%.The research leading to these results has received funding from the European Union’s Seventh Framework Pro-gramme (FP7/2007-2013) under grant agreement no. 600707- tranScriptorium. The authors would like to thank all theTRANSCRIPTORIUMmembers for their collaboration and the entrants for their participation in this contest.Sánchez Peiró, JA.; Romero Gómez, V.; Toselli, AH.; Vidal Ruiz, E. (2014). ICFHR2014 Competition on Handwritten Text Recognition on tranScriptorium Datasets (HTRtS). IEEE. https://doi.org/10.1109/ICFHR.2014.137

Crossref

RiuNet

tranScriptorium: A Guide for the Humanist

Author: Schofield P
Publication venue
Publication date: 01/01/2014
Field of study

This report describes the work of the tranScriptorium project, which intends to produce software applications which will 'read' historical handwritten documents, and produce transcripts for these manuscripts. It describes the process in non-technical language, and explains the potential benefits of making available millions of pages of hitherto digitized, but untranscribed and relatively inaccessible, manuscripts. The tranScriptorium consortium consists of: Universitat Politècnica de València – UPV (Spain, lead institution) Universiy of Innsbruck – UIBK (Austria) National Center for Scientific Research “Demokritos” – NCSR (Greece) University College London – UCL (UK) Institute for Dutch Lexicology – INL (Netherlands) University London Computer Centre – ULCC (UK) This project has received funding from the European Union's Seventh Framework Programme for research, technological development and demonstration under grant agreement no 600707

UCL Discovery

Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio -- Episode 1: Machine Transcription of the Manuscripts

Author: Firmani Donatella
Maiorino Marco
Merialdo Paolo
Nieddu Elena
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

In Codice Ratio is a research project to study tools and techniques for analyzing the contents of historical documents conserved in the Vatican Secret Archives (VSA). In this paper, we present our efforts to develop a system to support the transcription of medieval manuscripts. The goal is to provide paleographers with a tool to reduce their efforts in transcribing large volumes, as those stored in the VSA, producing good transcriptions for significant portions of the manuscripts. We propose an original approach based on character segmentation. Our solution is able to deal with the dirty segmentation that inevitably occurs in handwritten documents. We use a convolutional neural network to recognize characters and language models to compose word transcriptions. Our approach requires minimal training efforts, making the transcription process more scalable as the production of training sets requires a few pages and can be easily crowdsourced. We have conducted experiments on manuscripts from the Vatican Registers, an unreleased corpus containing the correspondence of the popes. With training data produced by 120 high school students, our system has been able to produce good transcriptions that can be used by paleographers as a solid basis to speedup the transcription process at a large scale.Comment: Donatella Firmani, Marco Maiorino, Paolo Merialdo, and Elena Nieddu. 2018. Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio - Episode 1: Machine Transcription of the Manuscripts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 263-27

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Roma 3

Archivio della ricerca- Università di Roma La Sapienza

Transforming scholarship in the archives through handwritten text recognition:Transkribus as a case study

Author: Ares Oliveira Sofia
Bryan Maximilian
Colutto Sebastian
Diem Markus
Déjean Hervé
Fiel Stefan
Gatos Basilis
Greinoecker Albert
Grüning Tobias
Hackl Guenter
Haukkovaara Vili
Heyer Gerhard
Hirvonen Lauri
Hodel Tobias
Jokinen Matti
Jokinen Philip
Kallio Mario
Kaplan Frederic
Kleber Florian
Labahn Roger
Lang Eva Maria
Laube Sören
Leifert Gundram
Louloudis Georgios
McNicholl Rory
Meunier Jean-Luc
Michael Johannes
Muehlberger Guenter
Mühlbauer Elena
Philipp Nathanael
Pratikakis Ioannis
Puigcerver Pérez Joan
Putz Hannelore
Retsinas George
Romero Verónica
Sablatnig Robert
Schofield Philip
Seaward Louise
Sfikas Georgios
Sieber Christian
Stamatopoulos Nikolaos
Strauss Tobias
Sánchez Joan Andreu
Terbul Tamara
Terras Melissa
Toselli Alejandro Hector
Ulreich Berthold
Vicente Bosch
Vidal Enrique
Villega Mauricio
Walcher Johanna
Weidemann Max
Wurster Herbert
Zagoris Konstantinos
Publication venue: 'Emerald'
Publication date: 09/09/2019
Field of study

Purpose: An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues. - Design/methodology/approach: This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material. - Findings: Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified. - Research limitations/implications: The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc. - Practical implications: Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field. - Social implications: The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals. - Originality/value: This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector

Infoscience - École polytechnique fédérale de Lausanne

UCL Discovery

Edinburgh Research Explorer

ZORA

Bern Open Repository and Information System (BORIS)

Advances in Handwritten Keyword Indexing and Search Technologies

Author: Vidal Enrique
Publication venue: 'Books on Demand'
Publication date: 01/07/2017
Field of study

Many extensive manuscript collections are available in archives and libraries all over the world, but their textual contents remain practically inaccessible, buried under thousands of terabytes worth of high-resolution images. If perfect or sufficiently accurate text-image transcripts were available, textual content could be indexed directly for plaintext access using conventional information retrieval systems. But the results of fully automated transcriptions generally lack the level of accuracy needed for reliable text indexing and search purposes. Additionally, manual or even computer-assited transcription is entierely unsustainable when dealing with the extensive image collections typically considered for indexing. This paper explains how accurate indexing and search commands can be implemented directly on the digital images themselves without the need to explicitly resort to image transcripts. Results obtained using the proposed techniques on several relevant historical data sets are presented, clearly supporting the considerable potential of these technologies

Kölner UniversitätsPublikationsServer