8,983 research outputs found
The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition
NOTICE: this is the author’s version of a work that was accepted for publication in Pattern Recognition. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Pattern RecognitionVolume 46, Issue 6, June 2013, Pages 1658–1669
DOI: 10.1016/j.patcog.2012.11.024[EN] Historical records of daily activities provide intriguing insights into the life of our ancestors, useful for demography studies and genealogical research. Automatic processing of historical documents, however, has mostly been focused on single works of literature and less on social records, which tend to have a distinct layout, structure, and vocabulary. Such information is usually collected by expert demographers that devote a lot of time to manually transcribe them. This paper presents a new database, compiled from a marriage license books collection, to support research in automatic handwriting recognition for historical documents containing social records. Marriage license books are documents that were used for centuries by ecclesiastical institutions to register marriage licenses. Books from this collection are handwritten and span nearly half a millennium until the beginning of the 20th century. In addition, a study is presented about the capability of state-of-the-art handwritten text recognition systems, when applied to the presented database. Baseline results are reported for reference in future studies. © 2012 Elsevier Ltd. All rights reserved.Work supported by the EC (FEDER/FSE) and the Spanish MEC/MICINN under the MIPRCV ‘‘Consolider Ingenio 2010’’ program (CSD2007-00018), MITTRAL (TIN2009-14633-C03-01) and KEDIHC ((TIN2009-14633-C03-03) projects. This work has been partially supported by the European Research Council Advanced Grant (ERC-2010-AdG-20100407: 269796-5CofM) and the European seventh framework project (FP7-PEOPLE-2008-IAPP: 230653-ADAO). Also supported by the Generalitat Valenciana under grant Prometeo/2009/014 and FPU AP2007-02867, and by the Universitat Politecnica de Val encia (PAID-05-11). We would also like to thank the Center for Demographic Studies (UAB) and
the Cathedral of Barcelona.Romero Gómez, V.; Fornés, A.; Serrano MartÃnez-Santos, N.; Sánchez Peiró, JA.; Toselli ., AH.; Frinken, V.; Vidal, E.... (2013). The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition. Pattern Recognition. 46(6):1658-1669. https://doi.org/10.1016/j.patcog.2012.11.024S1658166946
Using the MGGI Methodology for Category-based Language Modeling in Handwritten Marriage Licenses Books
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Handwritten marriage licenses books have been
used for centuries by ecclesiastical and secular institutions
to register marriages. The information contained in these
historical documents is useful for demography studies and
genealogical research, among others. Despite the generally
simple structure of the text in these documents, automatic transcription
and semantic information extraction is difficult due
to the distinct and evolutionary vocabulary, which is composed
mainly of proper names that change along the time. In previous
works we studied the use of category-based language models to
both improve the automatic transcription accuracy and make
easier the extraction of semantic information. Here we analyze
the main causes of the semantic errors observed in previous
results and apply a Grammatical Inference technique known
as MGGI to improve the semantic accuracy of the language
model obtained. Using this language model, full handwritten
text recognition experiments have been carried out, with results
supporting the interest of the proposed approach.This work has been partially supported through the European Union’s H2020 grant READ (Ref: 674943), the European project ERC-2010-AdG-20100407-269796, the MINECO/FEDER, UE projects TIN2015-70924-C2-1-R and TIN2015-70924-C2-2-R, and the Ramon y Cajal Fellowship RYC-2014-16831.Romero Gómez, V.; Fornes, A.; Vidal Ruiz, E.; Sánchez Peiró, JA. (2016). Using the MGGI Methodology for Category-based Language Modeling in Handwritten Marriage Licenses Books. IEEE. https://doi.org/10.1109/ICFHR.2016.0069
A Tale of Two Transcriptions : Machine-Assisted Transcription of Historical Sources
This article is part of the "Norwegian Historical Population Register" project financed by the Norwegian Research Council (grant # 225950) and the Advanced Grand Project "Five Centuries of Marriages"(2011-2016) funded by the European Research Council (# ERC 2010-AdG_20100407)This article explains how two projects implement semi-automated transcription routines: for census sheets in Norway and marriage protocols from Barcelona. The Spanish system was created to transcribe the marriage license books from 1451 to 1905 for the Barcelona area; one of the world's longest series of preserved vital records. Thus, in the Project "Five Centuries of Marriages" (5CofM) at the Autonomous University of Barcelona's Center for Demographic Studies, the Barcelona Historical Marriage Database has been built. More than 600,000 records were transcribed by 150 transcribers working online. The Norwegian material is cross-sectional as it is the 1891 census, recorded on one sheet per person. This format and the underlining of keywords for several variables made it more feasible to semi-automate data entry than when many persons are listed on the same page. While Optical Character Recognition (OCR) for printed text is scientifically mature, computer vision research is now focused on more difficult problems such as handwriting recognition. In the marriage project, document analysis methods have been proposed to automatically recognize the marriage licenses. Fully automatic recognition is still a challenge, but some promising results have been obtained. In Spain, Norway and elsewhere the source material is available as scanned pictures on the Internet, opening up the possibility for further international cooperation concerning automating the transcription of historic source materials. Like what is being done in projects to digitize printed materials, the optimal solution is likely to be a combination of manual transcription and machine-assisted recognition also for hand-written sources
A bimodal crowdsourcing platform for demographic historical manuscripts
Ponència presentada al First International Conference on Digital Access to Textual Cultural Heritage celebrada del 19 al 20 de maig de 2014 a MadridIn this paper we present a crowdsourcing web-based application for extracting information from demographic handwritten document images. The proposed application integrates two points of view: the semantic information for demographic research, and the ground-truthing for document analysis research. Concretely, the application has the contents view, where the information is recorded into forms, and the labeling view, with the word labels for evaluating document analysis techniques. The crowdsourcing architecture allows to accelerate the information extraction (many users can work simultaneously), validate the information, and easily provide feedback to the users. We finally show how the proposed application can be extended to other kind of demographic historical manuscripts
MS-049: The Papers of Jacob M. Yingling, Class of 1952 (1930 - )
The Papers of Jacob M. Yingling, Class of 1952, consist of 23 boxes of processed material, two portraits, one photograph, and a four-volume, bound set of the Maryland Magazine. The collection encompasses 7.80 cubic feet (11.69 linear feet) and is composed of eleven series arranged somewhat chronologically. For example, series II on Gettysburg College (1949-1952) precedes series III, which documents Yingling’s service in the Maryland House of Delegates (1962-1972). Since some of his life activities may coincide with others—Jake served on the Board of Directors of the Maryland School of the Deaf during his appointment as Assistant Secretary to the Department of Economic and Community Development, for instance—it was impossible to construct a pure chronology.
Researchers should note that the collection is rich in local history, particularly Gettysburg College and Adams County, Pennsylvania history, as well as the history of Carroll County, Maryland. Series I, II, and VIII especially reflect this trend. In series I: Genealogy, for example, one not only traces Yingling’s family history, but gains valuable insight into Carroll County history as well.
Special Collections and College Archives Finding Aids are discovery tools used to describe and provide access to our holdings. Finding aids include historical and biographical information about each collection in addition to inventories of their content. More information about our collections can be found on our website http://www.gettysburg.edu/special_collections/collections/.https://cupola.gettysburg.edu/findingaidsall/1044/thumbnail.jp
Occode: an end-to-end machine learning pipeline for transcription of historical population censuses
Machine learning approaches achieve high accuracy for text recognition and
are therefore increasingly used for the transcription of handwritten historical
sources. However, using machine learning in production requires a streamlined
end-to-end machine learning pipeline that scales to the dataset size, and a
model that achieves high accuracy with few manual transcriptions. In addition,
the correctness of the model results must be verified. This paper describes our
lessons learned developing, tuning, and using the Occode end-to-end machine
learning pipeline for transcribing 7,3 million rows with handwritten occupation
codes in the Norwegian 1950 population census. We achieve an accuracy of 97%
for the automatically transcribed codes, and we send 3% of the codes for manual
verification. We verify that the occupation code distribution found in our
result matches the distribution found in our training data which should be
representative for the census as a whole. We believe our approach and lessons
learned are useful for other transcription projects that plan to use machine
learning in production. The source code is available at:
https://github.com/uit-hdl/rhd-code
Handwritten Text Recognition for Historical Documents in the tranScriptorium Project
""© Owner/Author 2014. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM, In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (pp. 111-117) http://dx.doi.org/10.1145/2595188.2595193Transcription of historical handwritten documents is a crucial
problem for making easier the access to these documents
to the general public. Currently, huge amount of historical
handwritten documents are being made available by on-line
portals worldwide. It is not realistic to obtain the transcription
of these documents manually, and therefore automatic
techniques has to be used. tranScriptorium is
a project that aims at researching on modern Handwritten
Text Recognition (HTR) technology for transcribing historical
handwritten documents. The HTR technology used in
tranScriptorium is based on models that are learnt automatically
from examples. This HTR technology has been
used on a Dutch collection from 15th century selected for
the tranScriptorium project. This paper provides preliminary
HTR results on this Dutch collection that are very
encouraging, taken into account that minimal resources have
been deployed to develop the transcription system.The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 600707 - tranScriptorium and the Spanish MEC under the STraDa (TIN2012-37475-C02-01) research project.Sánchez Peiró, JA.; Bosch Campos, V.; Romero Gómez, V.; Depuydt, K.; De Does, J. (2014). Handwritten Text Recognition for Historical Documents in the tranScriptorium Project. ACM. https://doi.org/10.1145/2595188.2595193
LeMoine Gaunce Lewis Papers, 1936-1995
LeMoine Gaunce Lewis Papers, 1936-1995, finding aid
Five centuries of marriages (5CofM). A project of historical demography in the Barcelona area
En aquest document es realitza una presentació del projecte Five Centuries of Marriages (5CofM), dirigit per la Professora Anna Cabré i finançat per l'European Research Council, convocatòria Advanced Grant
- …