Search CORE

33 research outputs found

Handwriting recognition in historical documents using very large vocabularies

Author: Fischer Andreas
Frinken Volkmar
Martínez-Hinarejos Carlos-D.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

© ACM 2013. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in HIP '13 Proceedings of the 2nd International Workshop on Historical Document Imaging and Processinghttp://dx.doi.org/10.1145/2501115.2501116Language models are used in automatic transcription system to resolve ambiguities. This is done by limiting the vocabulary of words that can be recognized as well as estimating the n-gram probability of the words in the given text. In the context of historical documents, a non-unified spelling and the limited amount of written text pose a substantial problem for the selection of the recognizable vocabulary as well as the computation of the word probabilities. In this paper we propose for the transcription of historical Spanish text to keep the corpus for the n-gram limited to a sample of the target text, but expand the vocabulary with words gathered from external resources. We analyze the performance of such a transcription system with different sizes of external vocabularies and demonstrate the applicability and the significant increase in recognition accuracy of using up to 300 thousand external words.This work has been supported by the European project FP7-PEOPLE-2008-IAPP: 230653 the European Research Council’s Advanced Grant ERC-2010-AdG 20100407, the Spanish R&D projects TIN2009-14633-C03-03, RYC-2009-05031, TIN2011-24631, TIN2012-37475-C02-02, MITTRAL (TIN2009-14633-C03-01), Active2Trans (TIN2012-31723) as well as the Swiss National Science Foundation fellowship project PBBEP2_141453.Frinken, V.; Fischer, A.; Martínez-Hinarejos, C. (2013). Handwriting recognition in historical documents using very large vocabularies. ACM. https://doi.org/10.1145/2501115.2501116

Crossref

RiuNet

HMM word graph based keyword spotting in handwritten document images

Author: Frinken Volkmar
Romero Verónica
Toselli Alejandro Héctor
Vidal Enrique
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

[EN] Line-level keyword spotting (KWS) is presented on the basis of frame-level word posterior probabilities. These posteriors are obtained using word graphs derived from the recogni- tion process of a full-fledged handwritten text recognizer based on hidden Markov models and N-gram language models. This approach has several advantages. First, since it uses a holistic, segmentation-free technology, it does not require any kind of word or charac- ter segmentation. Second, the use of language models allows the context of each spotted word to be taken into account, thereby considerably increasing KWS accuracy. And third, the proposed KWS scores are based on true posterior probabilities, taking into account all (or most) possible word segmentations of the input image. These scores are properly bounded and normalized. This mathematically clean formulation lends itself to smooth, threshold-based keyword queries which, in turn, permit comfortable trade-offs between search precision and recall. Experiments are carried out on several historic collections of handwritten text images, as well as a well-known data set of modern English handwrit- ten text. According to the empirical results, the proposed approach achieves KWS results comparable to those obtained with the recently-introduced "BLSTM neural networks KWS" approach and clearly outperform the popular, state-of-the-art "Filler HMM" KWS method. Overall, the results clearly support all the above-claimed advantages of the proposed ap- proach.This work has been partially supported by the Generalitat Valenciana under the Prometeo/2009/014 project grant ALMA-MATER, and through the EU projects: HIMANIS (JPICH programme, Spanish grant Ref. PCIN-2015-068) and READ (Horizon 2020 programme, grant Ref. 674943).Toselli, AH.; Vidal, E.; Romero, V.; Frinken, V. (2016). HMM word graph based keyword spotting in handwritten document images. Information Sciences. 370:497-518. https://doi.org/10.1016/j.ins.2016.07.063S49751837

Crossref

RiuNet

The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition

Author: Alejandro H. Toselli
Alicia Fornés
Coüasnon
Enrique Vidal
España-Boquera
Esteve
Fischer
Frinken
Graves
Jelinek
Joan Andreu Sánchez
Josep Lladós
Kise
Le Bourgeois
Manning
Marti
Nicolás Serrano
Rath
Toselli
Toselli
Verónica Romero
Volkmar Frinken
Wong
Publication venue: 'Elsevier BV'
Publication date: 01/06/2013
Field of study

NOTICE: this is the author’s version of a work that was accepted for publication in Pattern Recognition. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Pattern RecognitionVolume 46, Issue 6, June 2013, Pages 1658–1669 DOI: 10.1016/j.patcog.2012.11.024[EN] Historical records of daily activities provide intriguing insights into the life of our ancestors, useful for demography studies and genealogical research. Automatic processing of historical documents, however, has mostly been focused on single works of literature and less on social records, which tend to have a distinct layout, structure, and vocabulary. Such information is usually collected by expert demographers that devote a lot of time to manually transcribe them. This paper presents a new database, compiled from a marriage license books collection, to support research in automatic handwriting recognition for historical documents containing social records. Marriage license books are documents that were used for centuries by ecclesiastical institutions to register marriage licenses. Books from this collection are handwritten and span nearly half a millennium until the beginning of the 20th century. In addition, a study is presented about the capability of state-of-the-art handwritten text recognition systems, when applied to the presented database. Baseline results are reported for reference in future studies. © 2012 Elsevier Ltd. All rights reserved.Work supported by the EC (FEDER/FSE) and the Spanish MEC/MICINN under the MIPRCV ‘‘Consolider Ingenio 2010’’ program (CSD2007-00018), MITTRAL (TIN2009-14633-C03-01) and KEDIHC ((TIN2009-14633-C03-03) projects. This work has been partially supported by the European Research Council Advanced Grant (ERC-2010-AdG-20100407: 269796-5CofM) and the European seventh framework project (FP7-PEOPLE-2008-IAPP: 230653-ADAO). Also supported by the Generalitat Valenciana under grant Prometeo/2009/014 and FPU AP2007-02867, and by the Universitat Politecnica de Val encia (PAID-05-11). We would also like to thank the Center for Demographic Studies (UAB) and the Cathedral of Barcelona.Romero Gómez, V.; Fornés, A.; Serrano Martínez-Santos, N.; Sánchez Peiró, JA.; Toselli ., AH.; Frinken, V.; Vidal, E.... (2013). The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition. Pattern Recognition. 46(6):1658-1669. https://doi.org/10.1016/j.patcog.2012.11.024S1658166946

Crossref

RiuNet

Evaluating retraining rules for semi-supervised learning in neural network based cursive word recognition

Author: Bunke Horst
Frinken Volkmar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Training a system to recognize handwritten words is a task that requires a large amount of data with their correct transcription. However, the creation of such a training set, including the generation of the ground truth, is tedious and costly. One way of reducing the high cost of labeled training data acquisition is to exploit unlabeled data, which can be gathered easily. Making use of both labeled and unlabeled data is known as semi-supervised learning. One of the most general versions of semi-supervised learning is self-training, where a recognizer iteratively retrains itself on its own output on new, unlabeled data. In this paper we propose to apply semi-supervised learning, and in particular self-training, to the problem of cursive, handwritten word recognition. The special focus of the paper is on retraining rules that define what data are actually being used in the retraining phase. In a series of experiments it is shown that the performance of a neural network based recognizer can be significantly improved through the use of unlabeled data and self-training if appropriate retraining rules are applied

CiteSeerX

Crossref

Bern Open Repository and Information System (BORIS)

Self-training for handwritten text line recognition

Author: Bunke Horst
Frinken Volkmar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Crossref

Bern Open Repository and Information System (BORIS)

A novel word spotting algorithm using bidirectional long short-term memory neural networks

Author: Bunke Horst
Fischer Andreas
Frinken Volkmar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Bern Open Repository and Information System (BORIS)

Combining neural networks to improve performance of handwritten keyword spotting systems

Author: Bunke Horst
Fischer Andreas
Frinken Volkmar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Bern Open Repository and Information System (BORIS)

Improving graph classification by isomap

Author: Bunke Horst
Frinken Volkmar
Riesen Kaspar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Bern Open Repository and Information System (BORIS)

Recommended from our members

A Novel Word Spotting Method Based on Recurrent Neural Networks

Author: Bunke Horst
Fischer Andreas
Frinken Volkmar
Manmatha R.
Publication venue: SelectedWorks
Publication date: 01/01/2011
Field of study

Keyword spotting refers to the process of retrieving all instances of a given keyword from a document. In the present paper, a novel keyword spotting method for handwritten documents is described. It is derived from a neural network based system for unconstrained handwriting recognition. As such it performs template-free spotting, i.e. it is not necessary for a keyword to appear in the training set. The keyword spotting is done using a modification of the CTC Token Passing algorithm in conjunction with a recurrent neural network. We demonstrate that the proposed systems outperforms not only a classical dynamic time warping based approach but also a modern keyword spotting system, based on hidden Markov models. Furthermore, we analyze the performance of the underlying neural networks when using them in a recognition task followed by keyword spotting on the produced transcription. We point out the advantages of keyword spotting when compared to classic text line recognition

ScholarWorks@UMass Amherst

HMM-based word spotting in handwritten documents using subword models

Author: Bunke Hors
Fischer Andreas
Frinken Volkmar
Keller Anita
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Crossref

Bern Open Repository and Information System (BORIS)