6,832 research outputs found
Transforming scholarship in the archives through handwritten text recognition:Transkribus as a case study
Purpose: An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues. - Design/methodology/approach: This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material. - Findings: Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified. - Research limitations/implications: The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc. - Practical implications: Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field. - Social implications: The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals. - Originality/value: This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector
SIMARA: a database for key-value information extraction from full pages
We propose a new database for information extraction from historical
handwritten documents. The corpus includes 5,393 finding aids from six
different series, dating from the 18th-20th centuries. Finding aids are
handwritten documents that contain metadata describing older archives. They are
stored in the National Archives of France and are used by archivists to
identify and find archival documents. Each document is annotated at page-level,
and contains seven fields to retrieve. The localization of each field is not
available in such a way that this dataset encourages research on
segmentation-free systems for information extraction. We propose a model based
on the Transformer architecture trained for end-to-end information extraction
and provide three sets for training, validation and testing, to ensure fair
comparison with future works. The database is freely accessible at
https://zenodo.org/record/7868059
A plea for an upgrade to the digital craft of the historian and digital methodology for discovering the past
This essay aims to bid analogue historians assume that digitisation is the
first step to creating historical heritage based on the new language of
Science: Computer Science. As we know, Humanities disciplines cannot easily be
encapsulated in a few understandable numbers and names. However, historians
must boost Artificial Intelligence (such as Transkribus) and Neural Networks to
let the Machine infer meaning from the digitised historical primary source and
become the most powerful tool to help historians understand what happened in
the Past. Historians (collaborating with data scientists, expert annotators,
librarians, archivists, and others, who are crucial to the successful
management of digital data collection) have to create the primary ontology,
starting from coding manuscripts into digital text, as the Biscari Archive
(Italy) study case
Just Because We Can Doesnât Mean We Should: On Knowing and Protecting Data Produced by the Jewish Consumptivesâ Relief Society
A recent project at the University of Denver Libraries used handwritten text recognition (HTR) software to create transcriptions of records from the Jewish Consumptivesâ Relief Society (JCRS), a tuberculosis sanatorium located in Denver, Colorado from 1904 to 1954. Among a great many other potential uses, these type- and hand-written records give insight into the human experience of disease and epidemic, its treatment, its effect on cultures, and of Jewish immigration to and early life in the American West. Our intent is to provide these transcripts as data so the text may be computationally analyzed, pursuant to a larger effort in developing capacity in services and infrastructure to support digital humanities as a library, and to contribute to the emerging HTR ecosystem in archival work.Just because we can, however, doesnât always mean we should: the realities of publishing large datasets online that contain medical and personal histories of potentially vulnerable people and communities introduce serious ethical considerations. This paper both underscores the value of HTR and frames ethical considerations related to protecting data derived from it. It suggests a terms-of-use intervention perhaps valuable to similar projects, one that balances meeting the research needs of digital scholars with the care and respect of persons, their communities and inheritors, who lives produced the very data now valuable to those researchers
Just Because We Can Doesnât Mean We Should: On Knowing and Protecting Data Produced by the Jewish Consumptivesâ Relief Society
A recent project at the University of Denver Libraries used handwritten text recognition (HTR) software to create transcriptions of records from the Jewish Consumptivesâ Relief Society (JCRS), a tuberculosis sanatorium located in Denver, Colorado from 1904 to 1954. Among a great many other potential uses, these type- and hand-written records give insight into the human experience of disease and epidemic, its treatment, its effect on cultures, and of Jewish immigration to and early life in the American West. Our intent is to provide these transcripts as data so the text may be computationally analyzed, pursuant to a larger effort in developing capacity in services and infrastructure to support digital humanities as a library, and to contribute to the emerging HTR ecosystem in archival work.
Just because we can, however, doesnât always mean we should: the realities of publishing large datasets online that contain medical and personal histories of potentially vulnerable people and communities introduce serious ethical considerations. This paper both underscores the value of HTR and frames ethical considerations related to protecting data derived from it. It suggests a terms-of-use intervention perhaps valuable to similar projects, one that balances meeting the research needs of digital scholars with the care and respect of persons, their communities and inheritors, who lives produced the very data now valuable to those researchers
Towards the Corpus of Latvian Romani Texts : Deciphering the Manuscripts in JÄnis Leimanis' Archive
Publisher Copyright: © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)Latvian Romani is a Northeastern Romani dialect with a limited number of publicly available sources. Two large archival collections of texts in Latvian Romani, compiled primarily in the 1930s in Latvia and Estonia, have been recently digitized as images and made available online for a wider public. In our study, we focus on one of these collections, the Latvian Romani folklore texts collected by JÄnis Leimanis in interwar Latvia. In this paper, we describe how initial manual transcriptions, most of which have been created with the help of a special crowdsourcing platform, were integrated in the handwritten text recognition (HTR) workflow in Transkribus. We present two HTR models trained on the basis of Leimanis' collection and discuss various issues related to the work on these texts.Peer reviewe
- âŠ