23 research outputs found

    Specimens as research objects: reconciliation across distributed repositories to enable metadata propagation

    Full text link
    Botanical specimens are shared as long-term consultable research objects in a global network of specimen repositories. Multiple specimens are generated from a shared field collection event; generated specimens are then managed individually in separate repositories and independently augmented with research and management metadata which could be propagated to their duplicate peers. Establishing a data-derived network for metadata propagation will enable the reconciliation of closely related specimens which are currently dispersed, unconnected and managed independently. Following a data mining exercise applied to an aggregated dataset of 19,827,998 specimen records from 292 separate specimen repositories, 36% or 7,102,710 specimens are assessed to participate in duplication relationships, allowing the propagation of metadata among the participants in these relationships, totalling: 93,044 type citations, 1,121,865 georeferences, 1,097,168 images and 2,191,179 scientific name determinations. The results enable the creation of networks to identify which repositories could work in collaboration. Some classes of annotation (particularly those regarding scientific name determinations) represent units of scientific work: appropriate management of this data would allow the accumulation of scholarly credit to individual researchers: potential further work in this area is discussed.Comment: 9 pages, 1 table, 3 figure

    Automating the construction of higher order data representations from heterogeneous biodiversity datasets

    Get PDF
    This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonDatasets created from large-scale specimen digitisation drive biodiversity research, but these are often heterogeneous: incomplete and fragmented. As aggregated data volumes increase, there have been calls to develop a “biodiversity knowledge graph” to better interconnect the data and support meta-analysis, particularly relating to the process of species description. This work maps data concepts and inter-relationships, and aims to develop automated approaches to detect the entities required to support these kinds of meta-analyses. An example is given using trends analysis on name publication events and their authors, which shows that despite implementation and widespread adoption of major changes to the process by which authors can publish new scientific names for plants, the data show no difference in the rates of publication. A novel data-mining process based on unsupervised learning is described, which detects specimen collectors and events preparatory to species description, allowing a larger set of data to be used in trends analysis. Record linkage techniques are applied to these two datasets to integrate data on authors and collectors to create a generalised agent entity, assessing specialisation and classifying working practices into separate categories. Recognising the role of agents (collectors, authors) in the processes (collection, publication) contributing to the recognition of new species, it is shown that features derived from data-mined aggregations can be used to build a classification model to predict which agent-initiated units of work are particularly valuable for species discovery. Finally, shared collector entities are used to integrate distributed specimen products of a single collection event across institutional boundaries, maximising impact of expert annotations. An inferred network of relationships between institutions based on specimen sharing relationships allows community analysis and the definition of optimal co-working relationships for efficient specimen digitisation and curation

    People are essential to linking biodiversity data

    Get PDF
    People are one of the best known and most stable entities in the biodiversity knowledge graph. The wealth of public information associated with people and the ability to identify them uniquely open up the possibility to make more use of these data in biodiversity science. Person data are almost always associated with entities such as specimens, molecular sequences, taxonomic names, observations, images, traits and publications. For example, the digitization and the aggregation of specimen data from museums and herbaria allow us to view a scientist’s specimen collecting in conjunction with the whole corpus of their works. However, the metadata of these entities are also useful in validating data, integrating data across collections and institutional databases and can be the basis of future research into biodiversity and science. In addition, the ability to reliably credit collectors for their work has the potential to change the incentive structure to promote improved curation and maintenance of natural history collections

    Progress in authority management of people names for collections

    Get PDF
    The concept of building a network of relationships between entities, a knowledge graph, is one of the most effective methods to understand the relations between data. By organizing data, we facilitate the discovery of complex patterns not otherwise evident in the raw data. Each datum at the nodes of a knowledge graph needs a persistent identifier (PID) to reference it unambiguously. In the biodiversity knowledge graph, people are key elements (Page 2016). They collect and identify specimens, they publish, observe, work with each other and they name organisms. Yet biodiversity informatics has been slow to adopt PIDs for people and people are currently represented in collection management systems as text strings in various formats. These text strings often do not separate individuals within a collecting team and little biographical information is collected to disambiguate collectors. In March 2019 we organised an international workshop to find solutions to the problem of PIDs for people in collections with the aim of identifying people unambiguously across the world's natural history collections in all of their various roles. Stakeholders were represented from 11 countries, representing libraries, collections, publishers, developers and name registers. We want to identify people for many reasons. Cross-validation of information about a specimen with biographical information on the specimen can be used to clean data. Mapping specimens from individual collectors across multiple herbaria can geolocate specimens accurately. By linking literature to specimens through their authors and collectors we can create collaboration networks leading to a much better understanding of the scientific contribution of collectors and their institutions. For taxonomists, it will be easier to identify nomenclatural type and syntype material, essential for reliable typification. Overall, it will mean that geographically dispersed specimens can be treated much more like a single distributed infrastructure of specimens as is envisaged in the European Distributed Systems of Scientific Collections Infrastructure (DiSSCo). There are several person identifier systems in use. For example, the Virtual International Authority File (VIAF) is a widely used system for published authors. The International Standard Name Identifier (ISNI), has broader scope and incorporates VIAF. The ORCID identifier system provides self-registration of living researchers. Also, Wikidata has identifiers of people, which have the advantage of being easy to add to and correct. There are also national systems, such as the French and German authority files, and considerable sharing of identifiers, particularly on Wikidata. This creates an integrated network of identifiers that could act as a brokerage system. Attendees agreed that no one identifier system should be recommended, however, some are more appropriate for particular circumstances. Some difficulties have still to be resolved to use those identifier schemes for biodiversity : 1) duplicate entries in the same identifier system; 2) handling collector teams and preserving the order of collectors; 3) how we integrate identifiers with standards such as Darwin Core, ABCD and in the Global Biodiversity Information Facility; and 4) many living and dead collectors are only known from their specimens and so they may not pass notability standards required by many authority systems. The participants of the workshop are now working on a number of fronts to make progress on the adoption of PIDs for people in collections. This includes extending pilots that have already been trialed, working with identifier systems to make them more suitable for specimen collectors and talking to service providers to encourage them to use ORCID iDs to identify their users. It was concluded that resolving the problem of person identifiers for collections is largely not a lack of a solution, but a need to implement solutions that already exist

    Recommendations for interoperability among infrastructures

    Get PDF
    The BiCIKL project is born from a vision that biodiversity data are most useful if they are presented as a network of data that can be integrated and viewed from different starting points. BiCIKL’s goal is to realise that vision by linking biodiversity data infrastructures, particularly for literature, molecular sequences, specimens, nomenclature and analytics. To make those links we need to better understand the existing infrastructures, their limitations, the nature of the data they hold, the services they provide and particularly how they can interoperate. In light of those aims, in the autumn of 2021, 74 people from the biodiversity data community engaged in a total of twelve hackathon topics with the aim to assess the current state of interoperability between infrastructures holding biodiversity data. These topics examined interoperability from several angles. Some were research subjects that required interoperability to get results, some examined modalities of access and the use and implementation of standards, while others tested technologies and workflows to improve linkage of different data types.These topics and the issues in regard to interoperability uncovered by the hackathon participants inspired the formulation of the following recommendations for infrastructures related to (1) the use of data brokers, (2) building communities and trust, (3) cloud computing as a collaborative tool, (4) standards and (5) multiple modalities of access:If direct linking cannot be supported between infrastructures, explore using data brokers to store linksCooperate with open linkage brokers to provide a simple way to allow two-way links between infrastructures, without having to co-organize between many different organisationsFacilitate and encourage the external reporting of issues related to their infrastructure and its interoperability.Facilitate and encourage requests for new features related to their infrastructure and its interoperability.Provide development roadmaps openlyProvide a mechanism for anyone to ask for helpDiscuss issues in an open forumProvide cloud-based environments to allow external participants to contribute and test changes to featuresConsider the opportunities that cloud computing brings as a means to enable shared management of the infrastructure.Promote the sharing of knowledge around big data technologies amongst partners, using cloud computing as a training environmentInvest in standards compliance and work with standards organisations to develop new, and extend existing standardsReport on and review standards compliance within an infrastructure with metrics that give credit for work on standard compliance and developmentProvide as many different modalities of access as possibleAvoid requiring personal contacts to download dataProvide a full description of an API and the data it servesFinally, the hackathons were an ideal meeting opportunity to build, diversify and extend the BiCIKL community further, and to ensure the alignment of the community with a common vision on how best to link data from specimens, samples, sequences, taxonomic names and taxonomic literature

    Open science tools: Supporting hands-on creation of the "digital extended specimen".

    No full text
    As a biodiversity informatics community, we have mobilised and interconnected a wide array of information, including specimen collections, published literature and metadata resources, which compile facts about collections and the people that work with them. We have defined data standards to facilitate data interoperability and tools development. Along with colleagues in allied research disciplines, we have helped to develop training resources, enabling researchers to automate routine tasks like data access and reference management. We have also started to explore how we could realise the vision of the digital extended specimen, which would integrate specimens and associated data across multiple research infrastructures, allowing the investigation of wider scale research questions. How this would be achieved is still the subject of discussion and experimentation: an open community will support a diverse range of approaches. In the construction of the digital extended specimen, we can envision useful activities operating at very different scales: from large scale computational processes run at (or between) research infrastructures, to an ecosystem of lightweight tools that support link construction in context, closer to researchers. A toolset enabling in-context link construction could play a similar role to Open Refine (which has been effective at democratizing data linking between different sources), and supply valuable training data for the development of machine learning approaches. We will review the use of Open Refine in the biodiversity informatics (and wider research) communities and examine the resources and working practices that facilitated the adoption of this tool. We will showcase work towards “an extensible notebook for open science”, and we aim to open a discussion on how a link-aware editor for semi-structured data plus standard open science tools (i.e., those covered by training resources such as software, data and author carpentry) could be viewed as a lightweight alternative to traditional document production—just as Open Refine is a viable alternative to many traditional spreadsheet use cases. Our aim is to enable researchers to develop the digital extended specimen at research time, but without being prescriptive about their workflow. We will conclude by discussing how this effort supports open science, showing how researchers are the data needed to explore their area of study and to form their hypotheses, using well recognised entities (specimens, names, people, institutions, citations etc), represented in data standards and accessed via open APIs - but also how they organise their work. In the authors own domain (botany), such a tool will be fundamental to e-taxonomic undertakings to build an online reference system in which all known plant species are described, as well as to significant acceleration of parts of the taxonomic process to address the biodiversity crisis

    The Role of the OLS Program in the Development of echinopscis (an Extensible Notebook for Open Science on Specimens)

    No full text
    Starting in early 2022, biodiversity informatics researchers at Kew have been developing echinopscis: an "extensible notebook for open science on specimens". This aims to build on the early experiments that our community conducted with "e-taxonomy": the development of tools and techniques to enable taxonomic research to be conducted online. Early e-taxonomic tools (e.g., Scratchpads Smith et al. 2011) had to perform a wide range of functions, but in the past decade or so the move towards open science has built better support for generic functionality, such as reference management (Zotero) and document production (pandoc), skills development in automation and revision control to support reproducible science, as documented by the Turing Way (The Turing Way Community 2022), and an awareness of the importance of community building. We have developed echinopscis at Kew via a cross-departmental collaboration between researchers in biodiversity informatics and accelerated taxonomy. We have also benefitted from valuable input and advice from our many colleagues in associated projects and organisations around the world. OLS (originally Open Life Sciences) is a training and mentoring program for Open Science leaders with a focus on community building. The name was recently (2023) made more generic—"Open Seeds"—whilst retaining their well-known acronym "OLS"*1.  OLS is a 16-week cohort-based mentoring program. Participants apply to join a cohort with a project that is developed through the 16 weeks. Each week of the syllabus alternates between time with a dedicated Open Science mentor and cohort calls, which are used to develop skills in project design, community building, open development & licencing, and inclusivity. Over 500 practitioners, experts and learners have participated across the seven completed cohorts of OLS' Open Seeds training and mentoring. Through this programme, over 300 researchers and open leaders from across six continents have designed, lauched and supported 200 projects from different disciplines worldwide. The next cohort will run between September 2023 and January 2024, and will be the eighth iteration of the program. This talk will briefly outline the work that we have done to setup and experiment with echinopscis, but will focus on the impact that the OLS program has had in its development. We will also include the use of techniques learned through OLS in other biodiversity informatics projects. OLS acknowledges that their program receives relatively few applications from project leads in biodiversity and we hope that this talk will be informative for Biodiversity Information Standards (TDWG) participants and can be used to build productive links between these communities

    Examining Herbarium Specimen Citation: Developing a literature-based institutional impact measure

    No full text
    Herbarium specimens are critical components of the research process - providing "what, where, when" evidence for species distributions and through type designation, providing the basis for un-ambiguous, standardised nomenclature facilitating the interpretation of scientific names. Specimen references are embedded within research article texts, by convention usually presented in a relatively formalised fashion. As this is a domain-specific practice, general publishers tend not to provide tools for detecting and tracking specimen references to enable bibliometric-style calculations and navigation to the referenced specimen, as is common practice in literature reference management. This means that it is difficult to measure impact, which affects both the individuals responsible for the collection and determination of herbarium specimens (McDade et al. 2011), and the institutions responsible for their long-term management. Specimen digitisation - creating searchable data repositories of metadata and/or images - has enabled many new and larger scale uses for herbarium specimens and their associated data, and stimulated interest in quantifying usage and measuring institutional impact. To date, these impact measures have been conducted by examining usage statistics for specimen portals, or by text searching for specimen identifier patterns. This research uses text mining and document classification techniques to detect article sections likely to contain specimen references, which are then extracted, classified and counted. A dataset of taxonomic publications categorised into paragraph-level units is used to train a text classifier to predict the presence of specimen references within component units of articles (sections or paragraphs). The input to the classifier is a set of features derived from the text contents of paragraphs, which detect content such as latitude/longitude, dates and bracketed lists of herbarium codes. Article units classified as containing specimen references are processed to extract a minimal representation of the specimen reference, including the abbreviated codes for the institutional holder(s) of the specimen material. This allows total and per-institution counts to be calculated, which can be compared to datasets of Global Biodiversity Information Facility data citations, to institutional-level type citations in nomenclatural acts recorded by the International Plant Names Index and to usage statistics recorded by institutional data repositories. As well as counting specimen references, distinct specimen reference styles are detected and quantified, including the use of numeric and persistent identifiers (Güntsch et al. 2017) which can be used to access a standardised metadata record for the specimen. We will present an assessment of the classification and detection process and initial results, and discuss future work to develop this approach to work with different kinds of literature inputs. These techniques have the potential to allow institutions to make better use of existing information to help assess the use and impact of their specimen and data holdings

    Integrating Collector and Author Roles Across Specimen and Publication Datasets

    No full text
    This work builds on the outputs of a collector data-mining exercise applied to GBIF mobilised herbarium specimen metadata, which uses unsupervised learning (clustering) to identify collectors from minimal metadata associated with field collected specimens (the DarwinCore terms recordedBy, eventDate and recordNumber). Here, we outline methods to integrate these data-mined collector entities (large scale dataset, aggregated from multiple sources, created programatically) with a dataset of author entities from the International Plant Names Index (smaller scale, single source dataset, created via editorial management). The integration process asserts a generic "scientist" entity with activities in different stages of the species description process: collecting and name publication. We present techniques to investigate specialisations including content - taxa of study - and activity stages: examining if individuals focus on collecting and/or name publication. Finally, we discuss generalisations of this initially herbarium-focussed data mining and record linkage process to enable applications in a wider context, particularly in zoological datasets

    Building Your Own Big Data Analysis Infrastructure for Biodiversity Science

    No full text
    The size of biodiversity data sets, and the size of people’s questions around them, are outgrowing the capabilities of desktop applications, single computers, and single developers. Numerous articles in the corporate sector (Delgado 2016) have been written on how much time professionals spend manipulating and formatting large data sets compared to the time they spend on the important work of doing analysis and modeling. To efficiently move large research questions forward, the biodiversity domain needs to transition towards shared infrastructure with the goal of providing a mise en place for researchers to do research with large data. The GUODA (Global Unified Open Data Access) collaboration was formed to explore tools and use cases for this type of collaborative work on entire biodiversity data sets. Three key parts of that exploration have been: the software and hardware infrastructure needed to be able to work with hundreds of millions of records and terabytes of data quickly, removing the impediment of data formatting and preparation, and workflows centered around GitHub for interacting with peers in an open and collaborative manner. We will describe our experiences building an infrastructure based on Apache Mesos, Apache Spark, HDFS, Jupyter Notebooks, Jenkins, and Github. We will also enumerate what resources are needed to do things like join millions of records, visualize patterns in whole data sets like iDigBio and the Biodiversity Heritage Library, build graph structures of billions of nodes, analyze terabytes of images, and use natural language processing to explore gigabytes of text. In addition to the hardware and software, we will describe the kinds of skills needed by staff to design, build, and use this sort of infrastructure and highlight some experiences we have with training students. Our infrastructure is one of many that are possible. We hope that by showing the amount and type of work we have done to the wider community, other organizations can understand what they would need to speed up their research programs by developing their own collaborative computation and development environments
    corecore