10 research outputs found

    A benchmark dataset of herbarium specimen images with label data

    Get PDF
    More and more herbaria are digitising their collections. Images of specimens are made available online to facilitate access to them and allow extraction of information from them. Transcription of the data written on specimens is critical for general discoverability and enables incorporation into large aggregated research datasets. Different methods, such as crowdsourcing and artificial intelligence, are being developed to optimise transcription, but herbarium specimens pose difficulties in data extraction for many reasons. To provide developers of transcription methods with a means of optimisation, we have compiled a benchmark dataset of 1,800 herbarium specimen images with corresponding transcribed data. These images originate from nine different collections and include specimens that reflect the multiple potential obstacles that transcription methods may encounter, such as differences in language, text format (printed or handwritten), specimen age and nomenclatural type status. We are making these specimens available with a Creative Commons Zero licence waiver and with permanent online storage of the data. By doing this, we are minimising the obstacles to the use of these images for transcription training. This benchmark dataset of images may also be used where a defined and documented set of herbarium specimens is needed, such as for the extraction of morphological traits, handwriting recognition and colour analysis of specimens

    Progress in authority management of people names for collections

    Get PDF
    The concept of building a network of relationships between entities, a knowledge graph, is one of the most effective methods to understand the relations between data. By organizing data, we facilitate the discovery of complex patterns not otherwise evident in the raw data. Each datum at the nodes of a knowledge graph needs a persistent identifier (PID) to reference it unambiguously. In the biodiversity knowledge graph, people are key elements (Page 2016). They collect and identify specimens, they publish, observe, work with each other and they name organisms. Yet biodiversity informatics has been slow to adopt PIDs for people and people are currently represented in collection management systems as text strings in various formats. These text strings often do not separate individuals within a collecting team and little biographical information is collected to disambiguate collectors. In March 2019 we organised an international workshop to find solutions to the problem of PIDs for people in collections with the aim of identifying people unambiguously across the world's natural history collections in all of their various roles. Stakeholders were represented from 11 countries, representing libraries, collections, publishers, developers and name registers. We want to identify people for many reasons. Cross-validation of information about a specimen with biographical information on the specimen can be used to clean data. Mapping specimens from individual collectors across multiple herbaria can geolocate specimens accurately. By linking literature to specimens through their authors and collectors we can create collaboration networks leading to a much better understanding of the scientific contribution of collectors and their institutions. For taxonomists, it will be easier to identify nomenclatural type and syntype material, essential for reliable typification. Overall, it will mean that geographically dispersed specimens can be treated much more like a single distributed infrastructure of specimens as is envisaged in the European Distributed Systems of Scientific Collections Infrastructure (DiSSCo). There are several person identifier systems in use. For example, the Virtual International Authority File (VIAF) is a widely used system for published authors. The International Standard Name Identifier (ISNI), has broader scope and incorporates VIAF. The ORCID identifier system provides self-registration of living researchers. Also, Wikidata has identifiers of people, which have the advantage of being easy to add to and correct. There are also national systems, such as the French and German authority files, and considerable sharing of identifiers, particularly on Wikidata. This creates an integrated network of identifiers that could act as a brokerage system. Attendees agreed that no one identifier system should be recommended, however, some are more appropriate for particular circumstances. Some difficulties have still to be resolved to use those identifier schemes for biodiversity : 1) duplicate entries in the same identifier system; 2) handling collector teams and preserving the order of collectors; 3) how we integrate identifiers with standards such as Darwin Core, ABCD and in the Global Biodiversity Information Facility; and 4) many living and dead collectors are only known from their specimens and so they may not pass notability standards required by many authority systems. The participants of the workshop are now working on a number of fronts to make progress on the adoption of PIDs for people in collections. This includes extending pilots that have already been trialed, working with identifier systems to make them more suitable for specimen collectors and talking to service providers to encourage them to use ORCID iDs to identify their users. It was concluded that resolving the problem of person identifiers for collections is largely not a lack of a solution, but a need to implement solutions that already exist

    People of Collections: Facilitators of Interoperability?

    No full text
    In March 2019, the MusĂ©um national d’histoire naturelle, Paris (MNHN) launched the datapoc.mnhn.fr project, funded by the French research infrastructures CollEX-PersĂ©e and E-recolnat. This proof of concept was imagined and is supported by a group of partners coming from different communities working at the MusĂ©um (specimen collection curators, librarians, researchers, data scientists, publishers). The initial motivation of this team for getting together was to imagine a way to link the massive data produced and preserved in the heterogeneous institutional collection databases and repositories of the MusĂ©um in order to improve global access and visibility for the benefit of end-users as well as data curation processes. After a year of sharing and deliberating, the group concluded that focusing on people’s names and identification, could be a promising way to explore interoperability and alignment solutions in order to match data hosted in the different systems. The project has thus two main goals: first, to improve biodiversity and taxonomic data quality for the qualification of personal identities, publications and scientific names by resolving frequent ambiguities and issues in people’s names assignment ; second, to develop and assess machine-driven linking strategies between specimen and authorship metadata and resources derived from various institutional datasilos of interest to the research community. In order to test this idea and to experiment innovative data computing and visualization technologies, all parties involved in the project agreed to develop a proof of concept focused on a dataset of 500 names of major MNHN naturalists from its foundation until nowadays. This proof of concept will consist in building a structured authority file for people's names, which could be shared by all services producing and using biodiversity data at MNHN, as well as reusable as open data by external stakeholders and international partners. This structured file will strengthen data and databases production and maintenance workflows, but could also help improving the quality of end-user experience by allowing individuals or machines to match, link or otherwise compute and analyse data that is still difficult to handle because of the diversity of IT applications and limited standardisation practises. It is key to the project that this structured file should somehow comply with international interoperability and semantic web standards so to facilitate global access and data exchanges with similar institutions around the world. Linked datasets and related resources derived from this work will be displayed on a public website designed for researchers as well as for the public via diverse applications and formats (API, RDF). The project will be run from April 2019 to April 2020 by the core team of partners who initiated it, with the support of a private IT and data computing service called Logilab. Some of the challenges of this project include finding an efficient way for building the structured file and then succeed in aligning and disambiguising names already present existing databases. A way to approach this issue is to confront and consolidate MNHN biodiversity datasets with external repositories by using people identifiers systems like ISNI, VIAF, IdREF, which are already familiar to libraries, archives and other cultural institutions. How can those various people identifiers systems be profitable to parse MNHN "people of collections" and help disambiguise them? Is there a particular people identifier system which will prove to be most relevant for all types of collections? Which parsing method will give the best results, and how could it scale up and possibly be reused by other institutions or even future European taxonomic infrastructures? Those are some of the questions the MNHN team is eager to deal with and to share and discuss at the Biodiversity Next Symposium

    TreePics: visualizing trees with pictures

    No full text
    While many programs are available to edit phylogenetic trees, associating pictures with branch tips in an efficient and automatic way is not an available option. Here, we present TreePics, a standalone software that uses a web browser to visualize phylogenetic trees in Newick format and that associates pictures (typically, pictures of the voucher specimens) to the tip of each branch. Pictures are visualized as thumbnails and can be enlarged by a mouse rollover. Further, several pictures can be selected and displayed in a separate window for visual comparison. TreePics works either online or in a full standalone version, where it can display trees with several thousands of pictures (depending on the memory available). We argue that TreePics can be particularly useful in a preliminary stage of research, such as to quickly detect conflicts between a DNA-based phylogenetic tree and morphological variation, that may be due to contamination that needs to be removed prior to final analyses, or the presence of species complexes

    TreePics: visualizing trees with pictures

    No full text
    While many programs are available to edit phylogenetic trees, associating pictures with branch tips in an efficient and automatic way is not an available option. Here, we present TreePics, a standalone software that uses a web browser to visualize phylogenetic trees in Newick format and that associates pictures (typically, pictures of the voucher specimens) to the tip of each branch. Pictures are visualized as thumbnails and can be enlarged by a mouse rollover. Further, several pictures are can be selected and displayed in a separate window for visual comparison. TreePics works either online or in a full standalone version, where it can display trees with several thousands of pictures (depending on the memory available). We argue that TreePics can be particularly useful in a preliminary stage of research, such as to quickly detect conflicts between a DNA-based phylogenetic tree and morphological variation, that may be due to contamination that needs to be removed prior to final analyses, or the presence of species complexes

    L’herbier virtuel A. de Saint-Hilaire, un nouvel outil Ă©volutif pour Ă©tudier la botanique du BrĂ©sil.

    No full text
    International audienceThe new Franco-Brazilian website “Saint-Hilaire virtual herbarium” offers dynamic online consultation of all specimens and manuscripts of the naturalist Auguste de Saint-Hilaire, providing links between specimen images and associated textual data, including notes available in his field books. This tool aims at facilitating the work of taxonomy and systematic botany and allowing a more accurate reconstruction of the routes and time frame of Saint-Hilaire's exploration. All specimens are being digitized by the Paris herbarium (P) and added online. The system will also offer Saint-Hilaire's major publications online. The nomenclature and determinations are automatically updated through dynamic links to the SONNERAT/MNHN database. In this paper, we propose moreover a standard for the correct citation of Saint-Hilaire specimens.Le nouveau site franco-brĂ©silien « Herbier Virtuel A. de Saint-Hilaire » permettra une consultation dynamique de l’ensemble des spĂ©cimens et desmanuscrits du naturaliste Auguste de Saint-Hilaire, en offrant un lien entre les photos des spĂ©cimens et les donnĂ©es associĂ©es qui figurent sur les cahiers de rĂ©coltes. Cet outil facilitera le travail de taxonomie et de systĂ©matique botanique et donnera la capacitĂ© Ă  reconstituer avec prĂ©cision les trajets et la chronologie des explorations de Saint-Hilaire. L’ensemble des spĂ©cimens sera progressivement mis en ligne Ă  la suite de la numĂ©risation de l’herbier de Paris (P). Le site mettra Ă©galement Ă  disposition les principales publications de Saint-Hilaire. La nomenclature et les dĂ©terminations sont maintenues Ă  jour de façon dynamique par un lien avec la base de donnĂ©es SONNERAT du MNHN. Dans ce travail, nous proposons en outre un standard pour la citation correcte des spĂ©cimens de Saint-Hilaire

    Standardised Globally Unique Specimen Identifiers

    No full text
    A simple, permanent and reliable specimen identifier system is needed to take the informatics of collections into a new era of interoperability. A system of identifiers based on HTTP URI (Uniform Resource Identifiers), endorsed by the Consortium of European Taxonomic Facilities (CETAF), has now been rolled out to 14 member organisations (GĂŒntsch et al. 2017). CETAF-Identifiers have a Linked Open Data redirection mechanism for both human- and machine-readable access and, if fully implemented, provide Resource Description Framework (RDF) -encoded specimen data following best practices continuously improved by members of the initiative. To date, more than 20 million physical collection objects have been equipped with CETAF Identifiers (Groom et al. 2017). To facilitate the implementation of stable identifiers, simple redirection scripts and guidelines for deciding on the local identifier syntax have been compiled (http://cetafidentifiers.biowikifarm.net/wiki/Main_Page). Furthermore, a capable "CETAF Specimen URI Tester" (http://herbal.rbge.info/) provides an easy-to-use service for testing whether the existing identifiers are operational. For the usability and potential of any identifier system associated with evolving data objects, active links to the source information are critically important. This is particularly true for natural history collections facing the next wave of industrialised mass digitisation, where specimens come online with only basic, but rapidly evolving label data. Specimen identifier systems must therefore have components for monitoring the availability and correct implementation of individual data objects. Our next implementation steps will involve the development of a "Semantic Specimen Catalogue", which has a list of all existing specimen identifiers together with the latest RDF metadata snapshot. The catalogue will be used for semantic inference across collections as well as the basis for periodic testing of identifiers

    Progress in Authority Management of People Names for Collections

    No full text
    The concept of building a network of relationships between entities, a knowledge graph, is one of the most effective methods to understand the relations between data. By organizing data, we facilitate the discovery of complex patterns not otherwise evident in the raw data. Each datum at the nodes of a knowledge graph needs a persistent identifier (PID) to reference it unambiguously. In the biodiversity knowledge graph, people are key elements (Page 2016). They collect and identify specimens, they publish, observe, work with each other and they name organisms. Yet biodiversity informatics has been slow to adopt PIDs for people and people are currently represented in collection management systems as text strings in various formats. These text strings often do not separate individuals within a collecting team and little biographical information is collected to disambiguate collectors. In March 2019 we organised an international workshop to find solutions to the problem of PIDs for people in collections with the aim of identifying people unambiguously across the world's natural history collections in all of their various roles. Stakeholders were represented from 11 countries, representing libraries, collections, publishers, developers and name registers. We want to identify people for many reasons. Cross-validation of information about a specimen with biographical information on the specimen can be used to clean data. Mapping specimens from individual collectors across multiple herbaria can geolocate specimens accurately. By linking literature to specimens through their authors and collectors we can create collaboration networks leading to a much better understanding of the scientific contribution of collectors and their institutions. For taxonomists, it will be easier to identify nomenclatural type and syntype material, essential for reliable typification. Overall, it will mean that geographically dispersed specimens can be treated much more like a single distributed infrastructure of specimens as is envisaged in the European Distributed Systems of Scientific Collections Infrastructure (DiSSCo). There are several person identifier systems in use. For example, the Virtual International Authority File (VIAF) is a widely used system for published authors. The International Standard Name Identifier (ISNI), has broader scope and incorporates VIAF. The ORCID identifier system provides self-registration of living researchers. Also, Wikidata has identifiers of people, which have the advantage of being easy to add to and correct. There are also national systems, such as the French and German authority files, and considerable sharing of identifiers, particularly on Wikidata. This creates an integrated network of identifiers that could act as a brokerage system. Attendees agreed that no one identifier system should be recommended, however, some are more appropriate for particular circumstances. Some difficulties have still to be resolved to use those identifier schemes for biodiversity : 1) duplicate entries in the same identifier system; 2) handling collector teams and preserving the order of collectors; 3) how we integrate identifiers with standards such as Darwin Core, ABCD and in the Global Biodiversity Information Facility; and 4) many living and dead collectors are only known from their specimens and so they may not pass notability standards required by many authority systems. The participants of the workshop are now working on a number of fronts to make progress on the adoption of PIDs for people in collections. This includes extending pilots that have already been trialed, working with identifier systems to make them more suitable for specimen collectors and talking to service providers to encourage them to use ORCID iDs to identify their users. It was concluded that resolving the problem of person identifiers for collections is largely not a lack of a solution, but a need to implement solutions that already exist
    corecore