7 research outputs found

    Projekt OCR-BW: Automatische Texterkennung von Handschriften

    Get PDF
    Nach der Digitalisierung von historischen Dokumenten ist der nĂ€chste konsequente Schritt die Anreicherung der Digitalisate im PrĂ€sentationssystem mit einem durchsuchbaren Volltext, um die ZugĂ€nglichkeit zu den Texten weiter zu erhöhen und neue Forschungsfragen an das Material zu ermöglichen. WĂ€hrend in vielen Bibliotheken bereits verschiedene Möglichkeiten zur automatischen Texterkennung von Druckwerken genutzt werden, ist die ZurĂŒckhaltung bei Handschriften vielfach höher, da handschriftliche Quellen die automatische Texterkennung vor neue Herausforderungen stellen. Mithilfe von Machine Learning wurden auf dem Feld der automatischen Handschriftenerkennung in den letzten Jahren jedoch große Fortschritte gemacht, die von Bibliotheken genutzt werden können, um ihre eigenen BestĂ€nde weiter zu erschließen, aber auch, um sich als Servicepartnerin fĂŒr die Wissenschaft zu etablieren. Im Rahmen des Projekts OCR-BW (https://ocr-bw.bib.uni-mannheim.de/) werden seit 2019 Transkribus und seit 2021 auch eScriptorium fĂŒr die Erzeugung von automatischen Volltexten fĂŒr Handschriften systematisch an ausgewĂ€hlten Korpora getestet. Die im bisherigen Projektverlauf erzielten Ergebnisse sind sehr positiv und zeigen, dass eine automatische Handschriftenerkennung mit einer Zeichenfehlerrate von unter 5 % möglich und erwartbar ist. Bereits veröffentlichte Volltexte haben die Sichtbarkeit und das Forschungsinteresse an diesen Materialien deutlich erhöht. Das Projekt zielt außerdem darauf ab, die Wissenschaft bei der Vorbereitung und DurchfĂŒhrung von Forschungsvorhaben zu unterstĂŒtzen. An Beispielen vom mittelalterlichen Gebetbuch ĂŒber GroßbestĂ€nde wie Juristische Konsilien bis hin zum Expeditionstagebuch des 20. Jahrhunderts soll gezeigt werden, mit welchem Ressourcenaufwand welche Ergebnisse erzielt werden können.After the digitization of historical documents, the next logical step is to enrich the digitized material with a searchable full text to further increase the accessibility of the texts and to enable new research questions. While many libraries already use various options for automatic text recognition of printed material, there is much higher reluctance to do so when it comes to manuscripts, since handwritten sources pose new challenges for automatic text recognition. With the help of machine learning, however, great progress has been made in the field of automatic handwritten text recognition in recent years, which libraries can not only use to make their own holdings more accessible, but also to establish themselves as a service partner for science. As part of the OCR-BW project (https://ocr-bw.bib.uni-mannheim.de/), since 2019 the transcription platforms Transkribus and, from 2021, eScriptorium have been systematically tested on selected corpora to generate automatic full texts for manuscripts. The results achieved during the project so far are very positive and show that automatic handwritten text recognition with a character error rate of less than 5 % is possible and can be expected. Full texts that have already been published have significantly increased the visibility and research interest in these materials. The project also aims to support science in the preparation and implementation of research projects. Examples ranging from medieval prayer books to large collections such as legal councils to expedition diaries of the 20th century will be used to show which results can be achieved with which resources.

    Exploring data provenance in handwritten text recognition infrastructure:Sharing and reusing ground truth data, referencing models, and acknowledging contributions. Starting the conversation on how we could get it done

    Get PDF
    This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, and ways to reference and acknowledge contributions to the creation and enrichment of data within these Machine Learning systems. We discuss how one can publish Ground Truth data in a repository and, subsequently, inform others. Furthermore, we suggest appropriate citation methods for HTR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitising, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of Machine Learning in archival and library contexts, and how the community should begin toacknowledge and record both contributions and data provenance

    Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Reusing Ground Truth Data, Referencing Models, and Acknowledging Contributions. Starting the Conversation on How We Could Get It Done

    Get PDF
    This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, as well as ways to reference and acknowledge contributions to the creation and enrichment of data within these systems. We discuss how one can place Ground Truth data in a repository and, subsequently, inform others through HTR-United. Furthermore, we want to suggest appropriate citation methods for ATR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitising, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of machine learning in archival and library contexts, and how the community should begin to acknowledge and record both contributions and data provenance

    Mobilizing and Enhancing Legacy Biodiversity Data: The case of Karl Wilhelm Verhoeff's correspondence

    No full text
    A considerable amount of biological data is preserved as physical documents, the legacy of former explorers, collectors, researchers, and others. Mobilizing data from handwritten documents has been considered particularly challenging, with well-known cases such as the manual transcription of specimen labels and herbarium sheets by museum staff, or crowdsourced transcription of data card collections through online platforms.Here we present a pipeline of open-source software that can be used toautomatically transcribe handwritten text,make it publicly available,annotate it with e.g., scientific names,extract names in Darwin Core Archive (DwC-A) for third-party reuse, andautomatically recognize named entities in the machine-readable text.We based our use case on the correspondence of the German zoologist Karl Wilhelm Verhoeff, related to the Myriapoda collection held at the MusĂ©e Zoologique de Strasbourg.The documents were processed with Transkribus (Muehlberger et al. 2019), a mostly open-source virtual research environment (OS VRE), which allows text in images to be converted into machine-readable text amenable to semantic enrichment. We achieved a character error rate as low as 5%, a remarkable result for handwritten material, as an accuracy higher than 95% for printed material is acceptable (Deutsche Forschungsgemeinschaft 2016). We then used Myriatrix (MartĂ­nez-Muñoz 2019), an instance of the Scratchpads OS VRE (Smith et al. 2011), to create bibliographic references, publish the full text, and annotate the correspondence with scientific names of myriapods. During the process, we added new scientific name spellings and combinations to the taxonomic backbone of Myriatrix and exported the full taxon classification in DwC-A via the Global Biodiversity Information Facility (GBIF) for reuse by the Global Names Architecture and its open-source tools (Patterson et al. 2016, Mozzherin et al. 2017).As a next step we are planning to subject the corrected text from Transkribus to a specific text-preprocessing workflow combining natural language processing (NLP) and machine learning (ML) techniques (LĂŒcking et al. 2021). This includes, inter alia, a multiple annotation approach for general and bioscientific term classification in order to detect the respective entities automatically. The workflow has been developed in the framework of the Specialized Information Service Biodiversity Research (Koch et al. 2017) to make biodiversity information available via a customized and (bio-)ontology-based semantic search engine (Pachzelt et al. 2021).We recommend our comprehensive approach to natural history institutions seeking to efficiently digitize and mobilize the rich biological data present in their archival documents

    Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Reusing Ground Truth Data, Referencing Models, and Acknowledging Contributions. Starting the Conversation on How We Could Get It Done

    No full text
    This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, as well as ways to reference and acknowledge contributions to the creation and enrichment of data within these systems. We discuss how one can place Ground Truth data in a repository and, subsequently, inform others through HTR-United. Furthermore, we want to suggest appropriate citation methods for ATR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitising, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of machine learning in archival and library contexts, and how the community should begin to acknowledge and record both contributions and data provenance

    Outcomes of Patients Presenting with Mild Acute Respiratory Distress Syndrome Insights from the LUNG SAFE Study

    No full text
    BACKGROUND: Patients with initial mild acute respiratory distress syndrome are often underrecognized and mistakenly considered to have low disease severity and favorable outcomes. They represent a relatively poorly characterized population that was only classified as having acute respiratory distress syndrome in the most recent definition. Our primary objective was to describe the natural course and the factors associated with worsening and mortality in this population. METHODS: This study analyzed patients from the international prospective Large Observational Study to Understand the Global Impact of Severe Acute Respiratory Failure (LUNG SAFE) who had initial mild acute respiratory distress syndrome in the first day of inclusion. This study defined three groups based on the evolution of severity in the first week: "worsening" if moderate or severe acute respiratory distress syndrome criteria were met, "persisting" if mild acute respiratory distress syndrome criteria were the most severe category, and "improving" if patients did not fulfill acute respiratory distress syndrome criteria any more from day 2. RESULTS: Among 580 patients with initial mild acute respiratory distress syndrome, 18% (103 of 580) continuously improved, 36% (210 of 580) had persisting mild acute respiratory distress syndrome, and 46% (267 of 580) worsened in the first week after acute respiratory distress syndrome onset. Global in-hospital mortality was 30% (172 of 576; specifically 10% [10 of 101], 30% [63 of 210], and 37% [99 of 265] for patients with improving, persisting, and worsening acute respiratory distress syndrome, respectively), and the median (interquartile range) duration of mechanical ventilation was 7 (4, 14) days (specifically 3 [2, 5], 7 [4, 14], and 11 [6, 18] days for patients with improving, persisting, and worsening acute respiratory distress syndrome, respectively). Admissions for trauma or pneumonia, higher nonpulmonary sequential organ failure assessment score, lower partial pressure of alveolar oxygen/fraction of inspired oxygen, and higher peak inspiratory pressure were independently associated with worsening. CONCLUSIONS: Most patients with initial mild acute respiratory distress syndrome continue to fulfill acute respiratory distress syndrome criteria in the first week, and nearly half worsen in severity. Their mortality is high, particularly in patients with worsening acute respiratory distress syndrome, emphasizing the need for close attention to this patient population.status: publishe
    corecore