215 research outputs found

    Extended Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents

    Full text link
    This paper presents an overview of the second edition of HIPE (Identifying Historical People, Places and other Entities), a shared task on named entity recognition and linking in multilingual historical documents. Following the success of the first CLEF-HIPE-2020 evaluation lab, HIPE-2022 confronts systems with the challenges of dealing with more languages, learning domain-specific entities, and adapting to diverse annotation tag sets. This shared task is part of the ongoing efforts of the natural language processing and digital humanities communities to adapt and develop appropriate technologies to efficiently retrieve and explore information from historical texts. On such material, however, named entity processing techniques face the challenges of domain heterogeneity, input noisiness, dynamics of language, and lack of resources. In this context, the main objective of HIPE-2022, run as an evaluation lab of the CLEF 2022 conference, is to gain new insights into the transferability of named entity processing approaches across languages, time periods, document types, and annotation tag sets. Tasks, corpora, and results of participating teams are presented. Compared to the condensed overview [1], this paper contains more refined statistics on the datasets, a break down of the results per type of entity, and a discussion of the ‘challenges’ proposed in the shared task

    Measuring metadata quality

    Get PDF

    Named Entity Recognition for early-modern textual sources: a review of capabilities and challenges with strategies for the future

    Get PDF
    Purpose: By mapping-out the capabilities, challenges and limitations of named-entity recognition (NER), this article aims to synthesise the state of the art of NER in the context of the early modern research field and to inform discussions about the kind of resources, methods and directions that may be pursued to enrich the application of the technique going forward. // Design/methodology/approach: Through an extensive literature review, this article maps out the current capabilities, challenges and limitations of NER and establishes the state of the art of the technique in the context of the early modern, digitally augmented research field. It also presents a new case study of NER research undertaken by Enlightenment Architectures: Sir Hans Sloane's Catalogues of his Collections (2016–2021), a Leverhulme funded research project and collaboration between the British Museum and University College London, with contributing expertise from the British Library and the Natural History Museum. // Findings: Currently, it is not possible to benchmark the capabilities of NER as applied to documents of the early modern period. The authors also draw attention to the situated nature of authority files, and current conceptualisations of NER, leading them to the conclusion that more robust reporting and critical analysis of NER approaches and findings is required. // Research limitations/implications: This article examines NER as applied to early modern textual sources, which are mostly studied by Humanists. As addressed in this article, detailed reporting of NER processes and outcomes is not necessarily valued by the disciplines of the Humanities, with the result that it can be difficult to locate relevant data and metrics in project outputs. The authors have tried to mitigate this by contacting projects discussed in this paper directly, to further verify the details they report here. // Practical implications: The authors suggest that a forum is needed where tools are evaluated according to community standards. Within the wider NER community, the MUC and ConLL corpora are used for such experimental set-ups and are accompanied by a conference series, and may be seen as a useful model for this. The ultimate nature of such a forum must be discussed with the whole research community of the early modern domain. // Social implications: NER is an algorithmic intervention that transforms data according to certain rules-, patterns- or training data and ultimately affects how the authors interpret the results. The creation, use and promotion of algorithmic technologies like NER is not a neutral process, and neither is their output A more critical understanding of the role and impact of NER on early modern documents and research and focalization of some of the data- and human-centric aspects of NER routines that are currently overlooked are called for in this paper. // Originality/value: This article presents a state of the art snapshot of NER, its applications and potential, in the context of early modern research. It also seeks to inform discussions about the kinds of resources, methods and directions that may be pursued to enrich the application of NER going forward. It draws attention to the situated nature of authority files, and current conceptualisations of NER, and concludes that more robust reporting of NER approaches and findings are urgently required. The Appendix sets out a comprehensive summary of digital tools and resources surveyed in this article

    Patents information for humanities research: Could there be something?

    Get PDF
    Latour and co-authors proposed, in the Science and Technology Translation theory, to target the many SHS (Social and Human Science) questions addressed by social studies of sciences by considering, in complement to traditional academic matters, the complete social environment (political, economic or societal). Patents obviously are a potential primary information source to do so. We propose to extend this considering that recent changes have evolved in our capacity to do so. We propose three preliminary steps: (a) patent documents as providing a structured information source, (b) a patent database as a technical encyclopedia and (c) the recent expansion of the variety of uses and users in patent domains. We underline, furthermore, that minority research in the academic space does effectively use patent information, especially in SHS compared to other disciplines. We deliver an experiment to estimate the amount of data unconsidered by not questioning the huge database of the European Patent Office. By comparatively considering the terminology of the two branches of the Unesco thesaurus, namely the micro thesauri “Social and Human Sciences" and the “Information and Communication Science” branches, we evaluate a database response to the whole vocabulary. An in-depth analysis of one selected concept will complete the study. Results show that patent information may provide a quantity of documents for a wide range of academic research questions, from strategic to state of the art, and position advances aside from the Social Studies of Science. The free open source tool is also a way to practice digital humanities expected skills on real world corpora

    User Interfaces to the Web of Data based on Natural Language Generation

    Get PDF
    We explore how Virtual Research Environments based on Semantic Web technologies support research interactions with RDF data in various stages of corpus-based analysis, analyze the Web of Data in terms of human readability, derive labels from variables in SPARQL queries, apply Natural Language Generation to improve user interfaces to the Web of Data by verbalizing SPARQL queries and RDF graphs, and present a method to automatically induce RDF graph verbalization templates via distant supervision

    Share.TEC Final Project Report

    Get PDF
    This report provides an overview of Share.TEC, a three-year project co-funded by the EC that supports access to, exchange and re-use of digital resources and practitioner experiences within Teacher Education at European level. The document comprises a number of sections that can either be read consecutively, to gain the full picture of the project and its outcomes, or in combinations so as to grasp particular aspects, how these were approached and what results were achieved. Section 2 describes the project\u27s overall objectives in terms of both its technological ambitions and its wider mission as part of the overall educational landscape. Section 3 gives brief profiles of the partners who made up the Share.TEC consortium. In Section 4 the results and achievements of the project are reported. This includes a description of the portal and its features; the system architecture, tools and services; the models underpinning the Share.TEC system; and the approach taken to its multilingual dimension. Section 5 addresses the question of Share.TEC\u27s target users and their needs. It describes the strategies and means employed for incorporating the user perspective, and for ensuring that the project direction was in line with users\u27 concerns so that the resulting portal responds suitably to the actual requirements of the people it\u27s designed for. Section 6 examines the critical aspect of underlying content. In keeping with the Share.TEC mission, the focus is largely on aggregated metadata records that describe digital resources for TE and which are expressed in terms defined by the project for TE purposes. Section 7 reports the activities undertaken in the project and thus narrates the processes that unfolded through the project lifetime as the consortium pursued its objectives and generated its outcomes. Section 8 describes the effort to establish the Share.TEC portal within its natural ecosystem. It looks at the global strategy for maximising impact both at regional/national level and internationally, and analyses the conditions and prospects for continuity and growth. Readers interested in the technical/technological dimension of Share.TEC (the system, portal, models, metadata, etc.) are likely to find Sections 4, 5 and 6 to be the ones closest to their concerns. Conversely, those whose interests lie elsewhere could simply consult Section 4.1 to get an idea of the portal from the user\u27s viewpoint and go to Sections 2, 3, 7 and 8 for a vision of the project and how Share.TEC is positioned in the panorama of digital resources and Teacher Education

    Text Mining the History of Medicine

    Get PDF
    Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform

    Digital Collection Evaluation: A Rubric for Holistic Evaluation

    Get PDF
    Digital collections have been a rising trend in library sciences for over a decade. However, analysis of these collections has still largely been limited to the digital specialists and the digital humanists. This paper summarizes the existing evaluation literature to propose a tool for librarians to use for their own individual collections' evaluations. It also examines the difficulties of evaluation and emphasizes the need for further research into librarian conducted analyses, as their evaluations differ from the evaluations of an expert. It also explains the development of digitization, digital collections and digital evaluation until this point.Master of Science in Library Scienc
    corecore