32 research outputs found

    No Pain No Gain: Standards mapping in Latimer Core development

    Get PDF
    Latimer Core (LtC) is a new proposed Biodiversity Information Standards (TDWG) data standard that supports the representation and discovery of natural science collections by structuring data about the groups of objects that those collections and their subcomponents encompass (Woodburn et al. 2022). It is designed to be applicable to a range of use cases that include high level collection registries, rich textual narratives and semantic networks of collections, as well as more granular, quantitative breakdowns of collections to aid collection discovery and digitisation planning.As a standard that is (in this first version) focused on natural science collections, LtC has significant intersections with existing data standards and models (Fig. 1) that represent individual natural science objects and occurrences and their associated data (e.g., Darwin Core (DwC), Access to Biological Collection Data (ABCD), Conceptual Reference Model of the International Committee on Documentation (CIDOC-CRM)). LtC's scope also overlaps with standards for more generic concepts like metadata, organisations, people and activities (i.e., Dublin Core, World Wide Web Consortium (W3C) ORG Ontology and PROV Ontology, Schema.org). LtC represents just an element of this extended network of data standards for the natural sciences and related concepts. Mapping between LtC and intersecting standards is therefore crucial for avoiding duplication of effort in the standard development process, and ensuring that data stored using the different standards are as interoperable as possible in alignment with FAIR (Findable, Accessible, Interoperable, Reusable) principles. In particular, it is vital to make robust associations between records representing groups of objects in LtC and records (where available) that represent the objects within those groups.During LtC development, efforts were made to identify and align with relevant standards and vocabularies, and adopt existing terms from them where possible. During expert review, a more structured approach was proposed and implemented using the Simple Knowledge Organization System (SKOS) mappingRelation vocabulary. This exercise helped to better describe the nature of the mappings between new LtC terms and related terms in other standards, and to validate decisions around the borrowing of existing terms for LtC. A further exercise also used elements of the Simple Standard for Sharing Ontological Mappings (SSSOM) to start to develop a more comprehensive set of metadata around these mappings. At present, these mappings (Suppl. material 1 and Suppl. material 2) are provisional and not considered to be comprehensive, but should be further refined and expanded over time.Even with the support provided by the SKOS and SSSOM standards, the LtC experience has proven the mapping process to be far from straightforward. Different standards vary in how they are structured, for example, DwC is a 'bag of terms', with informal classes and no structural constraints, while more structured standards and ontologies like ABCD and PROV employ different approaches to how structure is defined and documented. The various standards use different metadata schemas and serialisations (e.g., Resource Description Framework (RDF), XML) for their documentation, and different approaches to providing persistent, resolvable identifiers for their terms. There are also many subtle nuances involved in assessing the alignment between the concepts that the source and target terms represent, particularly when assessing whether a match is exact enough to allow the existing term to be adopted. These factors make the mapping process quite manual and labour-intensive. Approaches and tools, such as developing decision trees (Fig. 2) to represent the logic involved and further exploration of the SSSOM standard, could help to streamline this process.In this presentation, we will discuss the LtC experience of the standard mapping process, the challenges faced and methods used, and the potential to contribute this experience to a collaborative standards mapping within the anticipated TDWG Standards Mapping Interest Group

    DiSSCo Prepare WP7 –D7.3 Assessment tools and direction map to the implementation of common DiSSCo policies

    Get PDF
    The Distributed System for Scientific Collections (DiSSCo) Research Infrastructure will operate a number of e-services, all of which will have policy requirements for participating institutions. These policies include those related to digital and physical access to specimens, digital image and specimen metadata, and FAIR / Open Data. Previous projects have shown that the policy landscape is complex, and Task 7.3 has developed a policy self-assessment tool that will allow DiSSCo to assess policy alignment across the consortium. This deliverable describes the development of the policy self-assessment tool and provides a walkthrough of the key features. The same technical framework was used to create a digital maturity tool, which was initially proposed by Task 3.1, and this is also described within this document. A set of recommendations are included that outline the future direction for the development of the policy tool.The Distributed System for Scientific Collections (DiSSCo) Research Infrastructure will operate a number´of e-services, all of which will have policy requirements for participating institutions. These policies include those related to digital and physical access to specimens, digital image and specimen metadata, and FAIR / Open Data. Previous projects have shown that the policy landscape is complex, and Task 7.3 has developed a policy self-assessment tool that will allow DiSSCo to assess policy alignment across the consortium. This deliverable describes the development of the policy self-assessment tool and provides a walkthrough of the key features. The same technical framework was used to create a digital maturity tool, which was initially proposed by Task 3.1, and this is also described within this document. A set of recommendations are included that outline the future direction for the development of the policy tool

    What, when and where: 'Broad and thin' specimen digitisation for biodiversity research (RDA Plenary 6, 2015)

    No full text
    Presentation given in the Biodiversity Integration Interest Group meeting at the 6th RDA Plenary in Paris, 2015.  <div><br></div><div>A description of the factors and challenges involved in digitising and publishing data from the huge and heterogeneous collections of the Natural History Museum, London.</div

    Rethinking Collection Management Data Models

    No full text
    The data modelling of physical natural history objects has never been trivial, and the need for greater interoperability and adherence to multiple standards and internal requirements has made the task more challenging than ever. The Natural History Museum’s internal RECODE (Rethinking Collections Data Ecosystems; see Dupont et al. 2022) programme has taken the approach of creating a data model to fit these internal and external requirements, rather than try and force an existing data model to work with our next generation collections management system (CMS) requirements. In this regard, community standards become vitally important, and existing and emerging standards and models like Spectrum, Darwin Core, Access to Biological Collection Data (ABCD) (Extended for Geosciences (EFG)), Latimer Core and The Conceptual Reference Model from the International Committee for Documentation (CIDOC CRM) have and will be used heavily to inform this work. The poster will provide a starting point for: publicly sharing and discussing the work that the RECODE programme has done; eliciting ideas that members of the community may have regarding its continuing improvement.We have concentrated on creating a backbone for the data model, from collecting, through the object curation to the scientific identification. This has yielded two significant outcomes:The Collection Object: Traditional CMS data models treat each specimen as a single record in the database. The RECODE model recognises that there are a number of different concepts that need their own entities:Collected material: the specimens collected in the field are not always fully identified or separated into discrete items.Stored object: the aim of the RECODE model is to treat all objects as the same type of entity, with relationships between them enhancing the data. For example, a collection object is defined as a discrete object that can be moved and loaned independently. Its specific type (e.g., specimen, preparation, derivation) is given by its relationships to other collection objects.Identifiable item: what can be taxonomically identified does not necessarily have a 1-to-1 relationship with the stored objects. One item may contain multiple species (e.g., a parasite and host; a rock containing many minerals) or one species may be split across many objects (e.g., long branches on two or more herbarium sheets; large skeletons stored in separate locations).The Collection Level Description (CLD): This is a construct to enable the attachment of descriptive and quantitative data to groups of collection objects, rather than individual collection object. There will always be a need for an inventory which represents the basic holdings, organisation and indexing of collections as well as a variety of use cases for grouping collection objects and attaching information at the group level.The next challenge is to integrate the concepts more closely with each other to provide the best possible description of the collection and make it as shareable as possible. Some of the current challenges being addressed are:An object group may represent a heterogenous group of objects.There will be multiple parallel CLD schemes for different purposes.Different attributes and metrics will be relevant to different schemes.For some use cases, we need to be able to quantify relationships between an object group and its attributes as well as attaching metrics to the object group itself.We also need to be able to reflect relationships between object groups.These challenges necessitate a data model that has a considerable degree of flexibility but enables rules and constraints to be introduced as appropriate for the different use cases. It is also important that, wherever possible, the model uses the same attributes as individual collection objects, to allow object groups to be implicitly linked to collection object records through common attributes as well as explicitly linked within the model. The aim of the conceptual model is to reflect these requirements

    Persistent Identifiers at the Natural History Museum

    No full text
    This report describes the use of identifiers and persistent identifiers (PIDs) at the Natural History Museum (NHM), London. The NHM is a visitor attraction and international science centre for natural history collections. It has an extensive research programme and employs approximately 300 research scientists. It is in the midst of an extensive collection digitisation programme to make all of the specimens in its collections available online, almost 4.8 million of the 80 million specimens are available so far. The NHM's main internal identifier for collection objects is a registration number and the Museum also uses barcodes in some areas of its collection which include the registration number encoded within them. The registration number is included in the Museum's collections management system (CMS), EMu, which is in the process of being replaced. Registration numbers were historically assigned by the five departments of the Museum independently and this means they are not always unique and do not have a standard format. As part of the programme of work to replace the CMS, the NHM is creating a new data model to document complex digital objects more effectively, of which identifiers will form a core part. The NHM's Data Portal forms the main external point of access for the NHM's research and specimen collections. The digitised specimen collections, currently numbering 4.8 million, are assigned Globally Unique Identifiers (GUIDs) which form citable versioned links to records. These are not true PIDs as they do not have any governance of their persistence and occasionally can become inaccessible but they do comply with linked data standards and the CETAF Stable Identifiers initiative. The NHM mints Digital Object Identifiers (DOIs) registered with DataCite for datasets created by staff and any researchers affiliated with the Museum. These do not have to be based on the specimen collections but in practice often are. As much of the data is tabular, the Data Portal allows for DOIs to be minted for each query as needed by a user, so their retrieved data can then be cited and resolved

    Towards Community Collections Management

    No full text
    Collections Management Systems (CMS) are central to the operation of many natural science collections. Over the past few decades, these have evolved from simple tables and databases recording the contents of our collections, to take on multiple roles supporting complex business processes and information management needs within our organisations. These new functional demands have often outpaced the technical development of these systems and organisational capacity to sustain them. Furthermore, their contents essentially remain institutional silos, managed and controlled by single institutions, despite servicing data and a scientific mission that is shared across the global community.  The Natural History Museum, London (NHM) has recently embarked on a journey, working with peer institutions and external consultants, to develop a platform-agnostic set of requirements for a sustainable, scalable and interoperable CMS. The vision of a highly efficient, more flexible and connected solution that can engage with other collections-based organisations and stakeholders is not unique to the NHM, and working with others, including the developers of other CMS, we seek to generate a better understanding of shared collection management business processes, data and data models.  To achieve these goals, a dedicated year-long programme has been formed to address the many facets of a collections management system specification (requirements, processes, standards, models and compliance), and to engage the landscape of internal and external peers, stakeholders and CMS providers.   In this presentation, we will provide some background to explain how the NHM has arrived in its current position and discuss our vision for building on the discussion around existing standards, interoperability and data access. We will summarise the programme’s structure and plans, report on the progress of the first few months, and highlight any challenges encountered and solutions delivered

    Join the Dots: Adding collection assessment to collection descriptions

    No full text
    The natural science collections community has identified an increasing need for shared, structured and interoperable data standards that can be used to describe the totality of institutional collection holdings, whether digitised or not. Major international initiatives - including the Global Biodiversity Information Facility (GBIF), the Distributed System of Scientific Collections (DiSSCo) and the Consortium of European Taxonomic Facilities (CETAF) - consider the current lack of standards to be a major barrier, which must be overcome to further their strategic aims and contribute to an open, discoverable catalogue of global collections. The Biodiversity Information Standards (TDWG) Collection Descriptions (CD) group is looking to address this issue with a new data standard for collection descriptions. At an institutional level, this concept of collection descriptions aligns strongly with the need to use a structured and more data-driven approach to assessing and working with collections, both to identify and prioritise investment and effort, and to monitor the impact of the work. Use cases include planning conservation and collection moves, prioritising specimen digitisation activities, and informing collection development strategy. The data can be integrated with the collection description framework for ongoing assessments of the state of the collection. This approach was pioneered with the ‘Move the Dots’ methodology by the Smithsonian National Museum of Natural History, started in 2009 and run annually since. The collection is broken down into several hundred discrete subcollections, for each of which the number of objects was estimated and a numeric rank allocated according to a range of assessment criteria. This method has since been adopted by several other institutions, including Naturalis Biodiversity Centre, Museum für Naturkunde and Natural History Museum, London (NHM). First piloted in 2016, and now implemented as a core framework, the NHM’s adaptation, ‘Join the Dots’, divides the collection into approximately 2,600 ‘collection units’. The breakdown uses formal controlled lists and hierarchies, primarily taxonomy, type of object, storage location and (where relevant) stratigraphy, which are mapped to external authorities such as the Catalogue of Life and Paleobiology Database. The collection breakdown is enhanced with estimations of number of items, and ranks from 1 to 5 for each collection unit against 17 different criteria. These are grouped into four categories of ‘Condition’, ‘Information’ (including digital records), ‘Importance and Significance’ and ‘Outreach’. Although requiring significant time investment from collections staff to provide the estimates and assessments, this methodology has yielded a rich dataset that supports both discoverability (collection descriptions) and management (collection assessment). Links to further datasets about the building infrastructure and environmental conditions also make it into a powerful resource for planning activities such as collections moves, pest monitoring and building work. We have developed dynamic dashboards to provide rich visualisations for exploring, analysing and communicating the data. As an ongoing, embedded activity for collections staff, there will also be a build-up of historical data going forward, enabling us to see trends, track changes to the collection, and measure the impact of projects and events. The concept of Join the Dots also offers a generic, institution-agnostic model for enhancing the collection description framework with additional metrics that add value for strategic management and resourcing of the collection. In the design and implementation, we’ve faced challenges that should be highly relevant to the TDWG CD group, such as managing the dynamic breakdown of collections across multiple dimensions. We also face some that are yet to be resolved, such as a robust model for managing the evolving dataset over time. We intend to contribute these use cases into the development of the new TDWG data standard and be an early adopter and reference case. We envisage that this could constitute a common model that, where resources are available, provides the ability to add greater depth and utility to the world catalogue of collections

    Collections Digitization and Assessment Dashboard, a Tool for Supporting Informed Decisions

    No full text
    Natural Science Collections (NSCs) contain specimen-related data from which we extract valuable information for science and policy. Openness of those collections facilitates development of science. Moreover, virtual accessibility to physical containers by means of their digitization will allow an exponential increase in the level of available information. Digitization of collections will allow us to set a comprehensive registry of reliable, accurate, updated, comparable and interconnected information. Equally, the scope of interested potential users will largely expand and so will the different levels of granularity required by researchers, institutions and governmental bodies. Meeting diverse needs entails a special effort in data management and data analysis to extract, digest and present information on a compressed but still precise and objective-oriented format. The Collections Digitisation Dashboard (CDD) underpins such an attempt. The CDD stands as a practical tool that specifically aims to support high-level decisions with a wide coverage of data, by providing a visual, simplified and structured arrangement that will allow discovery of key indicators concerning digitization of bio- and geodiversity collections. The realm of possible approaches to the CDD covers levels of digitization, collection exceptionality, resourceavailability and many others. Still all those different angles need to be aligned and processed at once to provide an overall overview of the status of NSCs in the digitization process and analyse its further development. The CDD is a powerful mechanism to identify priorities, specialisation lines together with regional development, gaps and niches and future capabilities as well, and strengths and weaknesses across collections, institutions, countries and regions. It can perfectly underpin measurable and comparable assessments, with evolution indexes and progress indicators, all under an overarching homogenous approach. The Distributed System of Scientific Collections (DiSSCo) Research Infrastructure, currently in its preparatory phase, is built on top of the largest ever community of collections-related institutions across Europe and anchored on the Consortium of European Taxonomic Facilities (CETAF). It aims to provide a unique virtual access point to NSCs by facilitating a large and massive digitisation effort throughout Europe. Setting up priorities and specialization areas is pivotal to its success. To that end, the DiSSCo CDD will provide a valuation tool to summarize and showcase NSC's digitization status on a first-hand visualization. Different projects and initiatives will contribute, jointly and on a synergetic basis, to the production of the DiSSCo CDD. The ICEDIG project will address its basics features, terms of classification and tiers of information, and will produce a prototype and a set of recommendations on how to better attempt a massive dashboard by collating specific collections-based information and defining global strategic representations. CETAF working groups on collections and digitization will provide the desired homogeneity in describing and capturing the different implementation requirements from the users’ perspectives, which will be complemented by the contributions made under the umbrella of the COST Action MOBILISE. The Action will use networking activities to identify the right standards and policies to enable enlarging the scope of the DiSSCo CDD and its broader implementation by linking to the TDWG criteria and adopted standards. Complementarily, the ELViS platform to be developed under the SYNTHESYS+ project will provide the right virtual environment. Furthermore, SYNTHESYS+ will address the assessment capabilities of the CDD to enable the visual representation becoming a practical assessment mechanism and endow it with a dynamic feature for analysis over the time. The DiSSCo CDD will thus become an instrumental mechanism for decision-taking that will be embedded into the clustering initiative of products and services provided to the EOSC by the ENVRI-FAIR project in the environmental domain

    Technical capacities of digitisation centres within ICEDIG participating institutions

    No full text
    DiSSCo, the Distributed System of Scientific Collections, is seeking to centralise certain infrastructure and activities relating to the digitisation of natural science collections. Deciding what activities to distribute, what to centralise, and what geographic level of aggregation (e.g. regional, national or pan European) is most appropriate for each task, was one of the challenges set out within the EC-funded ICEDIG project. In this paper we present the results of a survey of several European collections to establish current digitisation capacity, strengths and skills associated with existing digitisation infrastructure. Our results indicate that most of the institutions surveyed are engaged in large-scale digitisation of collections and that this is usually being undertaken by dedicated teams of digitisers within each institution. Some cross institutional collaboration is happening, but this is still the exception for a variety of funder and practical reasons. These results inform future work that establishes a set of principles to determine how digitisation infrastructure might be most efficiently organised across European organisations in order to maximise progress on the digitisation of the estimated 1.5 billion specimens held within European natural science collections

    Exposing the Dark Data of Undigitized Collections: A TDWG global standard for collection descriptions

    No full text
    Aggregating content of museum and scientific collections worldwide offers us the opportunity to realize a virtual museum of our planet and the life upon it through space and time. By mapping specimen-level data records to standards and publishing this information, an increasing number of collections contribute to a digitally accessible wealth of knowledge. Visualizing these digital records by parameters such as collection type and geographic origin, helps collections and institutions to better understand their digital holdings and compare them to other such collections, as well as enabling researchers to find specimens and specimen data quickly (Singer et al. 2018). At the higher level of collections, related people and their activities, and especially the great majority of material that is yet to be digitised, we know much less. Many collections hold material not yet digitally discoverable in any form. For those that do publish collection-level data, it is commonly text-based data without the Global Unique Identifiers (GUIDs) or the controlled vocabularies that would support quantitative collection metrics and aid discovery of related expertise and publications. To best understand and plan for our world’s bio- and geodiversity represented in collections, we need standardised, quantitative collections-level metadata. Various groups planet-wide are actively developing tools to capture this much-needed metadata, including information about the backlog, and more detailed information about institutions and their activities (e.g. staffing, space, species-level inventories, geographic and taxonomic expertise, and related publications) (Smith et al. 2018). The Biodiversity Information Standards organization (TDWG) Collection Descriptions (CD) Data Standard Task Group aims to provide a data standard for describing natural scientific collections, which enables the ability to provide: automated metrics, using standardised collection descriptions and/or data derived from specimen datasets (e.g., counts of specimens) and a global registry of physical collections (either digitised or non-digitised). The group will also produce a data model to underpin the new standard, and provide guidance and reference implementations for the practical use of the standard in institutional and collaborative data infrastructures. Our task group includes members from a myriad of groups with a stake in mobilizing such data at local, regional, domain-specific and global levels. With such a standard adopted, it will be possible to effectively share data across different community resources. So far, we have carried out landscape analyses of existing collection description frameworks, and amassed a portfolio of use cases from the group as well as from a range of other sources, including the Collection Descriptions Dashboard working group of ICEDIG ("Innovation and consolidation for large scale digitisation of natural heritage"), iDigBio (Integrated Digitized Biocollections), Smithsonian, Index Herbariorum, the Field Museum, GBIF (Global Biodiversity Information Facility), GRBio (Global Registry of Biodiversity Repositories) and fishfindR.net. These were used to develop a draft data model, and between them inform the first iteration of CD draft data standard. A variety of challenges present themselves in developing this standard. Some relate to the standard development process itself, such as identifying (often learning) effective tools and methods for collaborative working and communication across globally distributed volunteers. Others concern the scope and gaining consensus from stakeholders, across a wide range of disciplines, while maintaining achievable goals. Further challenges arise from the requirement to develop a data model and standard that support such a variety of use cases and priorities, while retaining interoperability and manageability of the data. We will present some of these challenges and methods for addressing them, and summarise the progress and draft outputs of the group so far. We will also discuss the vision of how the new standard may be adopted and its potential impact on collections discoverability across the natural science collections community
    corecore