37 research outputs found
Semantic Web Modelling: Challenges and Opportunities in Small and Large Museum Collections
Semantic Web technologies foster connection and contextualization. They can benefit museum collections by disclosing information in a scalable and interoperable way, aggregating previously heterogeneous and siloed data. Based on formal languages such as RDF, RDFS or OWL they can describe the meaning and the connections among disparate data to define concepts, entities, and relationships and to facilitate multifaceted retrieval, reasoning, data integration and knowledge reuse. Benefits of Semantic Web technologies to the broader DH domain include but not limited to harmonised views of distributed sources, semantic-based content aggregation, enrichment, search, browsing and recommendation. Over the last decades we have witnessed a proliferation of semantic web projects in the broader cultural heritage domain at a national and European level. Infrastructure programmes, such as EUROPEANA, DARIAH, PARTHENOS and ARIADNEplus, to name but a few, have delivered rich interoperable structures and innovations that advanced the tasks of data integration, sharing, analysis, retrieval, and visualisation. As conceptual models mature and expand, and CIDOC-CRM is becoming an undeniable standard in the domain, we reflect on the challenges and opportunities encountered when semantic web technologies are applied both to regional small and large, globally renowned museum collections. The role and application of semantic modelling is examined through two distinct case studies; a) the regional Archaeological Museum of Tripolis (Greece) of limited digital presence, but with a unique collection of regional antiquities that employed semantic methods to enrich and share their digitised collections holdings and b) the Sloane Lab (UK) that aims to aggregate a multitude of catalogue records (both historic and current, from multiple disciplines) dispersed across the British Museum, Natural History Museum and British Library. The presentation delivers useful insight and highlights the opportunities and challenges both for small heritage organisations and large global institutions when applying high-level semantics to withdraw silo barriers of museum items and enable interoperable and multi-layered representations
Sloane Lab: Domain Vocabularies for Semantic Interoperability of Museum Collections
How do domain vocabularies and terminological resources contribute to semantic harmonisation and enrichment of siloed collections in digital infrastructures? What is the role of industry-standard and bespoke museum-owned authority files and terminologies in the process of âcreating a unified virtual ânational collectionâ by dissolving barriers between different collections and opening UK heritage to the worldâ (Towards a National Collection, 2022)? The Sloane Lab aims to aggregate a multitude of catalogue records (both historic and current, from multiple disciplines) dispersed across the British Museum, Natural History Museum and British Library. The task of integrating these disparate records and facilitating interoperable access poses significant challenges. The competency of domain-oriented standardised, high-level ontologies such as CIDOC-CRM to act as a common application layer of data semantics and their capacity to enable innovative ways for cross-searching, contextual exploration and interrogation is well documented in the literature. However, their ability to provide a common conceptual layer of high-level semantics for the purposes of unification, alignment and harmonisation comes at the expense of the specialisation of terminological and typological definitions. This can hinder the discovery and interrogation of resources at a higher level of granularity and limit the opportunities for entity enrichment and linking to external definitions from the Linked Data Cloud. We discuss a method of specialisation of upper-level ontologies by adding an additional level of vocabulary semantics of thesauri, glossary, and authority files to supplement the CIDOC-CRM with specialised terms. In this process, we highlight the role and contribution of museum-based vocabulary resources towards the realisation of unified collections and the opportunities they offer for semantic enrichment, linking and interoperability
A knowledge-based approach to information extraction for semantic interoperability in the archaeology domain
The paper presents a method for automatic semantic indexing of archaeological grey-literature reports using empirical (rule-based) Information Extraction techniques in combination with domain-specific knowledge organization systems. Performance is evaluated via the Gold Standard method. The semantic annotation system (OPTIMA) performs the tasks of Named Entity Recognition, Relation Extraction, Negation Detection and Word Sense disambiguation using hand-crafted rules and terminological resources for associating contextual abstractions with classes of the standard ontology (ISO 21127:2006) CIDOC Conceptual Reference Model (CRM) for cultural heritage and its archaeological extension, CRM-EH, together with concepts from English Heritage thesauri and glossaries.Relation Extraction performance benefits from a syntactic based definition of relation extraction patterns derived from domain oriented corpus analysis. The evaluation also shows clear benefit in the use of assistive NLP modules relating to word-sense disambiguation, negation detection and noun phrase validation, together with controlled thesaurus expansion.The semantic indexing results demonstrate the capacity of rule-based Information Extraction techniques to deliver interoperable semantic abstractions (semantic annotations) with respect to the CIDOC CRM and archaeological thesauri. Major contributions include recognition of relevant entities using shallow parsing NLP techniques driven by a complimentary use of ontological and terminological domain resources and empirical derivation of context-driven relation extraction rules for the recognition of semantic relationships from phrases of unstructured text. The semantic annotations have proven capable of supporting semantic query, document study and cross-searching via the ontology framework
Semantic technologies for historical collections: A case study from the Sloane Lab Knowledge Base
The founding collection of the British Museum is a rich area to explore how we can reconnect dispersed heritage connections using state of the art technologies. In the Sloane Lab, we aim to represent this vast cultural heritage collection made up of disparate objects. At the core of the Sloane Lab resides the Knowledge Base (KB), which provides a homogeneous data environment using formal semantics to allow data integration, semantic enrichment, and knowledge discovery across a disparate environment of resources. The most fundamental challenge of the KB is the provision of a suitable semantic metadata schema for unifying the catalogues and enabling the Knowledge Graph to facilitate resourceful query, visualisation and fact-finding. The Sloane Lab approach to the modeling of the collection is record-centric, meaning that the record is the central entity that we represent. Most museum datasets are object-centric but in our case this can lead to multiple descriptions of the same object that are conflicting with each other. The semantic representation of the Sloane Lab knowledge base is based on Semantic Web standards, in particular RDF, RDFS, and OWL, and our data model is built on top of the CIDOC CRM reference ontology. The presentation will provide a rich insight to the design and development of the Sloane Lab knowledge base, the modelling choices and priorities in relation to semantics and vocabularies and the range of challenges addressed in the process of aggregation in terms of data disparity, integration facility, conflicting information and inconsistency, uncertainty and data absence
Negation detection and word sense disambiguation in digital archaeology reports for the purposes of semantic annotation
The paper presents the role and contribution of Natural Language Processing Techniques, in particular Negation Detection and Word Sense Disambiguation in the process of Semantic Annotation of Archaeological Grey Literature. Archaeological reports contain a great deal of information that conveys facts and findings in different ways. This kind of information is highly relevant to the research and analysis of archaeological evidence but at the same time can be a hindrance for the accurate indexing of documents with respect to positive assertion
A pilot investigation of Information Extraction in the semantic annotation of archaeological reports
The paper discusses a prototype investigation of semantic annotation, a form of metadata assigning conceptual entities to textual instances; in the case of archaeological grey literature. The use of Information Extraction (IE), a Natural Language Processing (NLP) technique, is central to the annotation process while the use of Knowledge Organization System (KOS) is explored for the association of semantic annotation with both ontological and terminological references. The annotation process follows a rule-based information extraction approach using the GATE NLP toolkit, together with the CIDOC CRM ontology, its CRM-EH archaeological extension and English Heritage thesauri and glossaries. Results are reported from an initial evaluation, which suggest that these information extraction techniques can be applied to archaeological grey literature reports. Further work is discussed drawing on the evaluation and consideration of the characteristics of the archaeology domain. Copyright © 2012 Inderscience Enterprises Ltd
Semantic Indexing via Knowledge Organization Systems: Applying the CIDOC-CRM to Archaeological Grey Literature
The volume of archaeological reports being produced since the introduction of PG161
has
significantly increased, as a result of the increased volume of archaeological investigations
conducted by academic and commercial archaeology. It is highly desirable to be able to
search effectively within and across such reports in order to find information that promotes
quality research. A potential dissemination of information via semantic technologies offers
the opportunity to improve archaeological practice, not only by enabling access to
information but also by changing how information is structured and the way research is
conducted.
This thesis presents a method for automatic semantic indexing of archaeological greyliterature
reports using rule-based Information Extraction techniques in combination with
domain-specific ontological and terminological resources. This semantic annotation of
contextual abstractions from archaeological grey-literature is driven by Natural Language
Processing (NLP) techniques which are used to identify ârichâ meaningful pieces of text,
thus overcoming barriers in document indexing and retrieval imposed by the use of natural
language. The semantic annotation system (OPTIMA) performs the NLP tasks of Named
Entity Recognition, Relation Extraction, Negation Detection and Word Sense
disambiguation using hand-crafted rules and terminological resources for associating
contextual abstractions with classes of the ISO Standard (ISO 21127:2006) CIDOC
Conceptual Reference Model (CRM) for cultural heritage and its archaeological extension,
CRM-EH, together with concepts from English Heritage thesauri and glossaries.
The results demonstrate that the techniques can deliver semantic annotations of
archaeological grey literature documents with respect to the domain conceptual models.
Such semantic annotations have proven capable of supporting semantic query, document
study and cross-searching via web based applications. The research outcomes have
provided semantic annotations for the Semantic Technologies for Archaeological
Resources (STAR) project, which explored the potential of semantic technologies in the
integration of archaeological digital resources. The thesis represents the first discussion on
the employment of CIDOC CRM and CRM-EH in semantic annotation of grey-literature
documents using rule-based Information Extraction techniques driven by a supplementary
exploitation of domain-specific ontological and terminological resources. It is anticipated
that the methods can be generalised in the future to the broader field of Digital Humanities
A comparison of machine learning and rule-based approaches for text mining in the archaeology domain, across three languages
Archaeology is a destructive process in which the evidence primarily becomes written documentation. As such, the archaeological domain creates huge amounts of text, from books and scholarly articles to unpublished âgrey literatureâ fieldwork reports. We are experiencing a significant increase in archaeological investigations and easy access to the information hidden in these texts is a substantial problem for the archaeological field, which has been identified as early as 2005 (Falkingham 2005). In the Netherlands alone, it is estimated that 4,000 new grey literature reports are being created each year, as well as numerous books, papers and monographs. Furthermore, as research â such as desk based assessments â are increasingly being carried out online remotely, these documents need to be made more easily Findable, Accessible, Interoperable and Reusable. Making these documents searchable and analysing them is a time consuming task when done by hand, and will often lack consistency. Text mining provides methods for disclosing information in large text collections, allowing researchers to locate (parts of) texts relevant to their research questions, as well as being able to identify patterns of past behaviour in these reports. Furthermore, it enables resources to be searched in meaningful ways using semantic interoperable vocabularies and domain ontologies to answer questions on what, where and when.
The EXALT project at Leiden University is working on creating a semantic search engine for archaeology in and around the Netherlands, indexing all available, open-access texts, which includes Dutch, English and German language documents.
In this context, we are systematically researching and comparing different methods for extracting information from archaeological texts, in these 3 languages. The specific task we are looking at is Named Entity Recognition (NER), which is to find and recognise certain concepts in text, e.g. artefacts, time periods, places, etc. In the archaeology domain, the task of entity recognition is particularly specialised and determined by domain semantics that pose challenges to conventional NER. We develop text mining applications tailored to the archaeological domain and in this process we will compare a rule-based knowledge driven approach (using GATE), a âtraditionalâ machine learning method (Conditional Random Fields), and a deep learning method (BERT).
Previous studies have investigated different applications of text mining in archaeological literature (Richards et al. 2015), but this often occurred at a relatively small scale, in isolated case studies, or as proof-of-concept type work. With this study, we are comparing multiple methods in multiple languages, and we aim to contribute to guidelines and good practice for text mining in archaeology. Specifically, we will compare not only the overall accuracy of each approach, but also the time, digital literacy, hardware, and labelled data needed to run each method. We also pay attention to the energy usage and CO2 output of these machine learning models and the impact on climate change, something thatâs particularly poignant during the ongoing energy crisis. Besides these more practical aspects, we also aim to describe some general properties of the way we write about archaeology, and how writing in a particular language can make knowledge transfer (and by extension, NER) easier or more difficult
The Collection Unit as the Integral Component in Development of a Data Atlas of Complex Cultural Heritage Landscapes
In what form were museum and cultural heritage collections first acquired? How have museum and cultural heritage collections changed over time? And in what way has their cumulative evolution led to their growth over time? These questions sit at the heart of collections history and collections as data research and are likewise core concerns of much social and material culture research. More recently, research is moving beyond the accumulation of collections, seeking a deeper examination of the dynamic movement of objects between individual and institutional agents and actors. This can cast new light on the fluidity of collections, and the intrinsic role that circulation played in their formation and mobilisation (Driver et al., 2021, 3), both in analogue and digital contexts.
Researchers attempting to trace the movement of objects and collections between and within cultural heritage institutions are faced with complex data environments. Challenges relate to the intricacies of scope, size, availability, coverage, legacy attributes, and manifestation of collections, that often persist both within and between institutions. Such attributes cannot be adequately addressed by conceptual data models and metadata mappings that merely address a lower level of data interoperability, supporting aggregation and unification objectives (Dragoni et.al 2017).
In response, we introduce the âData Atlasâ. This is both a comprehensive metaphor that provides a collective perspective on cultural heritage collections, and an instrument that offers a means to map the intricacies of the complex landscapes of historical resources that have resulted from the long- term curation, circulation, and accumulation of collections dispersed across and within, various institutions and systems of varying accessibility status (Vlachidis et. al forthcoming) . Central to this metaphor is the Collection Unit which originates from the Natural History Museumâs âJoin the dotsâ collections assessment exercise where collections are arranged into discrete units that reflect how curators organise, index and work with their collections (Miller 2020). We define a âCollection Unitâ as a physical or digital born entity treated as a coherent item of a curatorial or collection activity which is not abstract but possess attributes unique to its form such as size, physical location, level of digitization, transcription type, availability, and access. Our definition is elastic so as to allow use of the Collection Unit as a building block to create a visual representation of the historical and contemporary collections of the physician and collector Sir Hans Sloane (1660-1753).
The Sloane collection is now dispersed across different information systems and infrastructures. Assembled from the 1680s onwards, and in part financed by profits from the transatlantic slave trade and enslavement, Sloaneâs vast collection of natural history, pharmaceutical specimens, books, manuscripts, prints, drawings, coins, and antiquities from across the world was made as Britain became a global trading and imperial power. Leveraging from a static 2D tabular representation of the Data Atlas we propose the development of an interactive version of the Data Atlas to facilitate a dynamic representation of a dispersed collection, allowing for a comprehensive, interlinked and layered view of Collection Units. The Atlas is part of the UKRI-funded Towards a National Collection Discovery programme, the âSloane Lab: looking back to build future shared collectionsâ works at the intersection of the history of digital humanities and the history of collections (Nyhan et al. 2023)