77 research outputs found
‘openDS’ – progress on the new standard for digital specimens.
In a Biodiversity_Next 2019 symposium, a vision of Digital Specimens based on the concept of a Digital Object Architecture (Kahn and Wilensky 2006) (DOA) was discussed as a new layer between data infrastructure of natural science collections and user applications for processing and interacting with information about specimens and collections. This vision would enable the transformation of institutional curatorial practises into joint community curation of the scientific data by providing seamless global access to specimens and collections spanning multiple collection-holding institutions and sources. A DOA-based implementation (Lannom et al. 2020) also offers wider, more flexible, and ‘FAIR’ (Findable, Accessible, Interoperable, Reusable) access for varied research and policy uses: recognising curatorial work, annotating with latest taxonomic treatments, understanding variations, working with DNA sequences or chemical analyses, supporting regulatory processes for health, food, security, sustainability and environmental change, inventions/products critical to the bio-economy, and educational uses. To make this vision a reality, a specification is needed that describes what a Digital Specimen is, and how to technically implement it. This specification is named 'openDS' for open Digital Specimen. It needs to describe how machines and humans can act on a Digital Specimen and gain attribution for their work; how the data can be serialized and packaged; and it needs to describe the object model (the scientific content part and its structure). The object model should describe how to include the specimen data itself as well as all data derived from the specimen, which is in principle the same as what the Extended Specimen model aims to describe. This part will therefore be developed in close collaboration with people working on that model. After the Biodiversity_Next symposium, the idea of a standard for Digital Specimens has been further discussed and detailed in a MOBILISE Workshop in Warsaw, 2020, with stakeholders like the GBIF, iDigBio, CETAF and DiSSCo. The workshop examined the technical basis of the new specification, agreed on scope and structure of the new specification and laid groundwork for future activities in the Research Data Alliance (RDA), Biodiversity Information Standards (TDWG), and technical workshops. A working group in the DiSSCo Prepare project has begun on the technical specification of the ‘open Digital Specimen’ (openDS). This specification will provide the definition of what a Digital Specimen is, its logical structure and content, and the operations permitted on that. The group is also working on a document with frequently asked questions. Realising the vision of Digital Specimen on a global level requires openDS to become a new TDWG standard and to be aligned with the vision for Extended Specimens. A TDWG Birds-of-a-Feather working session in September 2020 discusses and plans this further. The object model will include concepts from ABCD 3.0 and EFG extension for geo-sciences, and also extend from bco:MaterialSample in the OBO Foundry’s Biological Collection Ontology (BCO), which is linked to Darwin Core and from iao:InformationContentEntity in OBO Foundry's Information Artifact Ontology (IAO). openDS will also make use of the RDA/TDWG attribution metadata recommendation and other RDA recommendations. A publication is in preparation that describes the relationship with RDA recommendations in more detail, which will also be presented in the TDWG symposium
Identifications in BioPortals™
BioPortals are a ‘Google-like’ webportal solution tailored for
national or thematic biological diversity information needs. This solution allows
for an efficient route to retrieve information from heterogeneous biological
information sources after identification with integrated identification systems.
BioPortals can be used in combination with mobile devices and provide
options to share biological observation data after identification in the field
Digital Object Cloud for linking natural science collections information; The case of DiSSCo
DiSSCo (The Distributed System of Scientific Collections) is a Research Infrastructure (RI) aiming at providing unified physical (transnational), remote (loans) and virtual (digital) access to the approximately 1.5 billion biological and geological specimens in collections across Europe. DiSSCo represents the largest ever formal agreement between natural science museums (114 organisations across 21 European countries). With political and financial support across 14 European governments and a robust governance model DiSSCo will deliver, by 2025, a series of innovative end-user discovery, access, interpretation and analysis services for natural science collections data. As part of DiSSCo's developing data model, we evaluate the application of Digital Objects (DOs), which can act as the centrepiece of its architecture. DOs have bit-sequences representing some content, are identified by globally unique persistent identifiers (PIDs) and are associated with different types of metadata. The PIDs can be used to refer to different types of information such as locations, checksums, types and other metadata to enable immediate operations. In the world of natural science collections, currently fragmented data classes (inter alia genes, traits, occurrences) that have derived from the study of physical specimens, can be re-united as parts in a virtual container (i.e., as components of a Digital Object). These typed DOs, when combined with software agents that scan the data offered by repositories, can act as complete digital surrogates of the physical specimens. In this paper we: 1. investigate the architectural and technological applicability of DOs for large scale data RIs for bio- and geo-diversity, 2. identify benefits and challenges of a DO approach for the DiSSCo RI and 3. describe key specifications (incl. metadata profiles) for a specimen-based new DO type
Requirement analysis for the DiSSCo research infrastructure
DiSSCo – the Distributed System of Scientific Collections – will mobilise, unify and deliver bio- and geo-diversity information at the scale, form and precision required by scientific communities, and thereby transform a fragmented landscape into a coherent and responsive research infrastructure. At present DiSSCo has 115 partners from 21 countries across Europe. The DiSSCo research infrastructure will enable critical new insights from integrated digital data to address some of the world's greatest challenges, such as biodiversity loss, food security and impacts of climate change. A requirement analysis for DiSSCo was conducted to ensure that all of its envisioned future uses are accommodated through a large survey using epic user stories. An epic user story has the following format: As [e.g. scientist] I want to [e.g. map the distribution of a species through time] so that I [e.g. analyse the impact of climate change] for this I need [e.g. all georeferenced specimens records through time] Several consultation rounds within the ICEDIG community resulted in 78 unique user stories that were assigned to one, or more, out of seven recognized stakeholder categories: - Research, - Collection management, - Technical support, - Policy, - Education, - Industry, and - External. Each user story was assessed for the level of collection detail it required; four levels of detail were recognised: Collection, Taxonomic, Storage unit, and Specimen level. Furthermore, it was assessed whether the future envisioned use of digitised natural history collections were possible without the DiSSCo research infrastructure. Subsequently 1243 identified stakeholders were invited to review the DiSSCo user stories through a Survey Monkey questionnaire. Additionally, an invitation for review was posted in several Facebook groups and announced on Twitter. A total of 379 stakeholders responded to the invitation, which led to 85 additional user stories for the envisioned use of the DiSSCo research infrastructure. In order to assess which component of the DiSSCo data flow diagram should facilitate the described user story, all user stories were mapped to the five phases of the DiSSCo Data Management Cycle (DMC), including data: - acquisition, - curation, - publishing, - processing, and - use. At present, the user stories are being analysed and the results will be presented in this symposium
Requirement analysis for the DiSSCo research infrastructure
DiSSCo – the Distributed System of Scientific Collections – will mobilise, unify and deliver bio- and geo-diversity information at the scale, form and precision required by scientific communities, and thereby transform a fragmented landscape into a coherent and responsive research infrastructure. At present DiSSCo has 115 partners from 21 countries across Europe. The DiSSCo research infrastructure will enable critical new insights from integrated digital data to address some of the world's greatest challenges, such as biodiversity loss, food security and impacts of climate change. A requirement analysis for DiSSCo was conducted to ensure that all of its envisioned future uses are accommodated through a large survey using epic user stories. An epic user story has the following format: As [e.g. scientist] I want to [e.g. map the distribution of a species through time] so that I [e.g. analyse the impact of climate change] for this I need [e.g. all georeferenced specimens records through time] Several consultation rounds within the ICEDIG community resulted in 78 unique user stories that were assigned to one, or more, out of seven recognized stakeholder categories: - Research, - Collection management, - Technical support, - Policy, - Education, - Industry, and - External. Each user story was assessed for the level of collection detail it required; four levels of detail were recognised: Collection, Taxonomic, Storage unit, and Specimen level. Furthermore, it was assessed whether the future envisioned use of digitised natural history collections were possible without the DiSSCo research infrastructure. Subsequently 1243 identified stakeholders were invited to review the DiSSCo user stories through a Survey Monkey questionnaire. Additionally, an invitation for review was posted in several Facebook groups and announced on Twitter. A total of 379 stakeholders responded to the invitation, which led to 85 additional user stories for the envisioned use of the DiSSCo research infrastructure. In order to assess which component of the DiSSCo data flow diagram should facilitate the described user story, all user stories were mapped to the five phases of the DiSSCo Data Management Cycle (DMC), including data: - acquisition, - curation, - publishing, - processing, and - use. At present, the user stories are being analysed and the results will be presented in this symposium
DiSSCo Prepare Project: Increasing the Implementation Readiness Levels of the European Research Infrastructure
The Distributed System of Scientific Collections (DiSSCo) is a new world-class Research Infrastructure (RI) for Natural Science Collections. The DiSSCo RI aims to create a new business model for one European collection that digitally unifies all European natural science assets under common access, curation, policies and practices that ensure that all the data is easily Findable, Accessible, Interoperable and Reusable (FAIR principles). DiSSCo represents the largest ever formal agreement between natural history museums, botanic gardens and collection-holding institutions in the world.DiSSCo entered the European Roadmap for Research Infrastructures in 2018 and launched its main preparatory phase project (DiSSCo Prepare) in 2020. DiSSCo Prepare is the primary vehicle through which DiSSCo reaches the overall maturity necessary for its construction and eventual operation. DiSSCo Prepare raises DiSSCo’s implementation readiness level (IRL) across the five dimensions: technical, scientific, data, organisational and financial. Each dimension of implementation readiness is separately addressed by specific Work Packages (WP) with distinct targets, actions and tasks that will deliver DiSSCo’s Construction Masterplan. This comprehensive and integrated Masterplan will be the product of the outputs of all of its content related tasks and will be the project’s final output. It will serve as the blueprint for construction of the DiSSCo RI, including establishing it as a legal entity.DiSSCo Prepare builds on the successful completion of DiSSCo’s design study, ICEDIG and the outcomes of other DiSSCo-linked projects such as SYNTHESYS+ and MOBILISE.This paper is an abridged version of the original DiSSCo Prepare grant proposal. It contains the overarching scientific case for DiSSCo Prepare, alongside a description of our major activities
A choice of persistent identifier schemes for the Distributed System of Scientific Collections (DiSSCo)
Persistent identifiers (PID) to identify digital representations of physical specimens in natural science collections (i.e., digital specimens) unambiguously and uniquely on the Internet are one of the mechanisms for digitally transforming collections-based science. Digital Specimen PIDs contribute to building and maintaining long-term community trust in the accuracy and authenticity of the scientific data to be managed and presented by the Distributed System of Scientific Collections (DiSSCo) research infrastructure planned in Europe to commence implementation in 2024. Not only are such PIDs valid over the very long timescales common in the heritage sector but they can also transcend changes in underlying technologies of their implementation. They are part of the mechanism for widening access to natural science collections. DiSSCo technical experts previously selected the Handle System as the choice to meet core PID requirements.
Using a two-step approach, this options appraisal captures, characterises and analyses different alternative Handle-based PID schemes and the possible operational modes of use. In a first step a weighting and ranking the options has been applied followed by a structured qualitative assessment of social and technical compliance across several assessment dimensions: levels of scalability, community trust, persistence, governance, appropriateness of the scheme and suitability for future global adoption. The results are discussed in relation to branding, community perceptions and global context to determine a preferred PID scheme for DiSSCo that also has potential for adoption and acceptance globally.
DiSSCo will adopt a ‘driven-by DOI’ persistent identifier (PID) scheme customised with natural sciences community characteristics. Establishing a new Registration Agency in collaboration with the International DOI Foundation is a practical way forward to support the FAIR (findable, accessible interoperable, reusable) data architecture of DiSSCo research infrastructure. This approach is compatible with the policies of the European Open Science Cloud (EOSC) and is aligned to existing practices across the global community of natural science collections
Machine learning as a service for DiSSCo’s digital specimen architecture
International mass digitization efforts through infrastructures like the European Distributed System of Scientific Collections (DiSSCo), the US resource for Digitization of Biodiversity Collections (iDigBio), the National Specimen Information Infrastructure (NSII) of China, and Australia’s digitization of National Research Collections (NRCA Digital) make geo- and biodiversity specimen data freely, fully and directly accessible.
Complementary, overarching infrastructure initiatives like the European Open Science Cloud (EOSC) were established to enable mutual integration, interoperability and reusability of multidisciplinary data streams including biodiversity, Earth system and life sciences.
Natural Science Collections (NSC) are of particular importance for such multidisciplinary and internationally linked infrastructures, since they provide hard scientific evidence by allowing direct traceability of derived data (e.g., images, sequences, measurements) to physical specimens and material samples in NSC.
To open up the large amounts of trait and habitat data and to link these data to digital resources like sequence databases (e.g., ENA), taxonomic infrastructures (e.g., GBIF) or environmental repositories (e.g., PANGAEA), proper annotation of specimen data with rich (meta)data early in the digitization process is required, next to bridging technologies to facilitate the reuse of these data.
This was addressed in recent studies, where we employed computational image processing and artificial intelligence technologies (Deep Learning) for the classification and extraction of features like organs and morphological traits from digitized collection data (with a focus on herbarium sheets).
However, such applications of artificial intelligence are rarely—this applies both for (sub-symbolic) machine learning and (symbolic) ontology-based annotations—integrated in the workflows of NSC’s management systems, which are the essential repositories for the aforementioned integration of data streams.
This was the motivation for the development of a Deep Learning-based trait extraction and coherent Digital Specimen (DS) annotation service providing “Machine learning as a Service” (MLaaS) with a special focus on interoperability with the core services of DiSSCo, notably the DS Repository (nsidr.org) and the Specimen Data Refinery, as well as reusability within the data fabric of EOSC.
Taking up the use case to detect and classify regions of interest (ROI) on herbarium scans, we demonstrate a MLaaS prototype for DiSSCo involving the digital object framework, Cordra, for the management of DS as well as instant annotation of digital objects with extracted trait features (and ROIs) based on the DS specification openDS
Envisaging a global infrastructure to exploit the potential of digitised collections
Tens of millions of images from biological collections have become available online over the last two decades. In parallel, there has been a dramatic increase in the capabilities of image analysis technologies, especially those involving machine learning and computer vision. While image analysis has become mainstream in consumer applications, it is still used only on an artisanal basis in the biological collections community, largely because the image corpora are dispersed. Yet, there is massive untapped potential for novel applications and research if images of collection objects could be made accessible in a single corpus. In this paper, we make the case for infrastructure that could support image analysis of collection objects. We show that such infrastructure is entirely feasible and well worth investing in
- …