21 research outputs found

    Digitale Kuratierungstechnologien fĂŒr Bibliotheken

    Get PDF
    In diesem Artikel werden digitale Kuratierungstechnologien vorgestellt, wie sie aktuell im Rahmen eines vom Bundesministerium fĂŒr Bildung und Forschung (BMBF) geförderten Verbundprojekts unter der Leitung des DFKI fĂŒr die Anwendung in verschiedenen Branchen und DomĂ€nen entwickelt werden. Der besondere Augenmerk liegt dabei auf deren Anwendungsszenarien in Bibliotheken – nach einer EinfĂŒhrung in die Herausforderungen, die sich durch die Massendigitalisierung in Bibliotheken stellen, werden exemplarisch Einsatzmöglichkeiten von Kuratierungstechnologien im Kontext von Bibliotheken vorgestellt. Anhand von Beispielen aus der bibliothekarischen Arbeit im 21. Jahrhundert, sowie der Nachnutzung digitaler Bibliotheksdaten in Wissenschaft und Kreativwirtschaft, werden die vielfĂ€ltigen Möglichkeiten und Potenziale der Nutzung von Kuratierungstechnologien zur Datenaufbereitung, -anreicherung und Bereitstellung von attraktiven Diensten in Bibliotheken sowie fĂŒr die Entwicklung neuartiger Forschungs- und GeschĂ€ftsfelder in deren Umfeld aufgezeigt.This article presents smart technologies for digital content curation that are currently under development in a collaborative project, coordinated by DFKI and funded by the German Federal Ministry of Education and Research. After a brief introduction into the concept of digital curation, the article goes on to explain the various technologies and their use cases across a number of sectors. In the following, we go into greater detail regarding the particular needs and requirements for digital curation technologies in the domain of libraries. Due to their large-scale digitisation activities, libraries are creating vast amounts of digital data. Technologies for digital curation can help leverage the full potential of these data for applications in scholarship, digital humanities, and the creative industries and produce new opportunities for research and industry, as illustrated by a number of examples from the Berlin State Library

    Volltexte – die Zukunft alter Drucke: Bericht zum Abschlussworkshop des OCR-D-Projekts

    Get PDF

    Datasheets for Digital Cultural Heritage Datasets

    Get PDF
    Sparked by issues of quality and lack of proper documentation for datasets, the machine learning community has begun developing standardised processes for establishing datasheets for machine learning datasets, with the intent to provide context and information on provenance, purposes, composition, the collection process, recommended uses or societal biases reflected in training datasets. This approach fits well with practices and procedures established in GLAM institutions, such as establishing collections’ descriptions. However, digital cultural heritage datasets are marked by specific characteristics. They are often the product of multiple layers of selection; they may have been created for different purposes than establishing a statistical sample according to a specific research question; they change over time and are heterogeneous. Punctuated by a series of recommendations to create datasheets for digital cultural heritage, the paper addresses the scope and characteristics of digital cultural heritage datasets; possible metrics and measures; lessons from concepts similar to datasheets and/or established workflows in the cultural heritage sector. This paper includes a proposal for a datasheet template that has been adapted for use in cultural heritage institutions, and which proposes to incorporate information on the motivation and selection criteria, digitisation pipeline, data provenance, the use of linked open data, and version information

    Making Europe's Historic Newspapers Searchable

    No full text
    This poster provides a rare glimpse into the overall approach for the refinement, i.e. the enrichment of scanned historical newspapers with text and layout recognition, in the Europeana Newspapers project. Within three years, the project processed more than 10 million pages of historical newspapers from 12 national and major libraries to produce the largest open access and fully searchable text collection of digital historical newspapers in Europe. In this, a wide variety of legal, logistical, technical and other challenges were encountered. After introducing the background issues in newspaper digitization in Europe, the paper discusses the technical aspects of refinement in greater detail. It explains what decisions were taken in the design of the large-scale processing workflow to address these challenges, what were the results produced and what were identified as best practices.<br

    User Collaboration for Improving Access to Historical Texts

    Get PDF
    The paper will describe how web-based collaboration tools can engage users in the building of historical printed text resources created by mass digitisation projects. The drivers for developing such tools will be presented, identifying the benefits that can be derived for both the user community and cultural heritage institutions. The perceived risks, such as new errors introduced by the users, and the limitations of engaging with users in this way will be set out with the lessons that can be learned from existing activities, such as the National Library of Australia's newspaper website which supports collaborative correction of Optical Character Recognition (OCR) output. The paper will present the work of the IMPACT (Improving Access to Text) project, a large-scale integrating project funded by the European Commission as part of the Seventh Framework Programme (FP7). One of the aims of the project is to develop tools that help improve OCR results for historical printed texts, specifically those works published before the industrial production of books from the middle of the 19th century. Technological improvements to image processing and OCR engine technology are vital to improving access to historic text, but engaging the user community also has an important role to play. Utilising the intended user can help achieve the levels of accuracy currently found in born-digital materials. Improving OCR results will allow for better resource discovery and enhance performance by text mining and accessibility tools. The IMPACT project will specifically develop a tool that supports collaborative correction and validation of OCR results and a tool to allow user involvement in building historical dictionaries which can be used to validate word recognition. The technologies use the characteristics of human perception as a basis for error detection

    Computational Approaches to Digitised Historical Newspapers (Dagstuhl Seminar 22292)

    Get PDF
    Historical newspapers are mirrors of past societies, keeping track of the small and great history and reflecting the political, moral, and economic environments in which they were produced. Highly valued as primary sources by historians and humanities scholars, newspaper archives have been massively digitised in libraries, resulting in large collections of machine-readable documents and, over the past half-decade, in numerous academic research initiatives on their automatic processing. The Dagstuhl Seminar 22292 "Computational Approaches to Digitised Historical Newspaper" gathered researchers and practitioners with backgrounds in natural language processing, computer vision, digital history and digital library involved in computational approaches to historical newspapers with the objectives to share experiences, analyse successes and shortcomings, deepen our understanding of the interplay between computational aspects and digital scholarship, and discuss future challenges. This report documents the program and the outcomes of the seminar

    D-OC5.3 Evaluation Report | OC5

    No full text
    A core piece of work in IMPACT lies in the development of novel software techniques for a number of tasks connected to Optical Character Recognition (OCR), such as image enhancement, segmentation and postprocessing, as well as in the improvement of existing OCR engines and experimental prototypes. The variety of platforms used by the developers of these tools makes it necessary to define an overall technical architecture for establishing interoperability of the various software components. The notion of interoperability that has been pursued in IMPACT is defined by ISO/IEC 2382-01, Information Technology Vocabulary, Fundamental Terms, as: “The capability to communicate, execute programs, or transfer data among various functional units in a manner that requires the user to have little or no knowledge of the unique characteristics of those units.” Accordingly, the concept of interoperability in IMPACT is aimed towards the implementation of a highly flexible and easy to use technical framework and the integration of the software components within that framework. Interoperability of all tools as a main principle is herein achieved through two abstraction layers: individual software tools will be wrapped as web services, which are then again wrapped in a so called “basic” workflow module. Web services are provided by describing the publicly available interfaces of the service using the Web Service Description Language (WSDL), and by using Simple Object Access Protocol (SOAP) for data exchange. Basic workflows are provided by wrapping the web services and their documented in- and outputs in a dataflow description used by the Taverna workflow management system, thereby exposing relevant features of the software tools in the form of ready-to-use components. The main advantages of this approach are the guaranteed compatibility of the tools offered as web services – avoiding incompatibility issues between different web service frameworks – and the reusability of a web service implementation. This report is basically divided into two major sections: the first one will detail the current status of work on the Interoperability Framework architecture, the software components it has been built from and the functionality it provides, whereas the second section will be describing how each of the tools and applications developed by IMPACT partners are envisaged to be integrated with the Interoperability Framework, thereby enabling interaction and data exchange between them

    Publishing an OCR ground truth data set for reuse in an unclear copyright setting.: Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort

    No full text
    International audienceWe present an OCR ground truth data set for historical prints and show improvement of recognition results over baselines with training on this data. We reflect on reusability of the ground truth data set based on two experiments that look into the legal basis for reuse of digitized document images in the case of 19th century English and German books. We propose a framework for publishing ground truth data even when digitized document images cannot be easily redistributed
    corecore