Search CORE

579 research outputs found

Toward a service-based workflow for automated information extraction from herbarium specimens

Author: Berendsohn Walter
Bügel Ulrich
Chaves Fernando
Güntsch Anton
Kirchhoff Agnes
Reimeier Fabian
Röpert Dominik
Santamaria Eduard
Steinke Karl-Heinz
Tebbje Alexander
Publication venue
Publication date: 01/01/2018
Field of study

Over the past years, herbarium collections worldwide have started to digitize millions of specimens on an industrial scale. Although the imaging costs are steadily falling, capturing the accompanying label information is still predominantly done manually and develops into the principal cost factor. In order to streamline the process of capturing herbarium specimen metadata, we specified a formal extensible workflow integrating a wide range of automated specimen image analysis services. We implemented the workflow on the basis of OpenRefine together with a plugin for handling service calls and responses. The evolving system presently covers the generation of optical character recognition (OCR) from specimen images, the identification of regions of interest in images and the extraction of meaningful information items from OCR. These implementations were developed as part of the Deutsche Forschungsgemeinschaft-funded a standardised and optimised process for data acquisition from digital images of herbarium specimens (StanDAP-Herb) Project

Institutional Repository of the Freie Universität Berlin

Fraunhofer-ePrints

Server für wissenschaftliche Schriften der Hochschule Hannover

Semi‐automated workflows for acquiring specimen data from label images in herbarium collections

Author: Beach James H.
Granzow-De La Cerda Íñigo
Publication venue: 'Wiley'
Publication date: 01/12/2010
Field of study

Computational workflow environments are an active area of computer science and informatics research; they promise to be effective for automating biological information processing for increasing research efficiency and impact. In this project, semi‐automated data processing workflows were developed to test the efficiency of computerizing information contained in herbarium plant specimen labels. Our test sample consisted of mexican and Central American plant specimens held in the University of michigan Herbarium (MICH). The initial data acquisition process consisted of two parts: (1) the capture of digital images of specimen labels and of full‐specimen herbarium sheets, and (2) creation of a minimal field database, or "pre‐catalog", of records that contain only information necessary to uniquely identify specimens. For entering "pre‐catalog" data, two methods were tested: key‐stroking the information (a) from the specimen labels directly, or (b) from digital images of specimen labels. In a second step, locality and latitude/longitude data fields were filled in if the values were present on the labels or images. If values were not available, geo‐coordinates were assigned based on further analysis of the descriptive locality information on the label. Time and effort for the various steps were measured and recorded. Our analysis demonstrates a clear efficiency benefit of articulating a biological specimen data acquisition workflow into discrete steps, which in turn could be individually optimized. First, we separated the step of capturing data from the specimen from most keystroke data entry tasks. We did this by capturing a digital image of the specimen for the first step, and also by limiting initial key‐stroking of data to create only a minimal "pre‐catalog" database for the latter tasks. By doing this, specimen handling logistics were streamlined to minimize staff time and cost. Second, by then obtaining most of the specimen data from the label images, the more intellectually challenging task of label data interpretation could be moved electronically out of the herbarium to the location of more highly trained specialists for greater efficiency and accuracy. This project used experts in the plants’ country of origin, mexico, to verify localities, geography, and to derive geo‐coordinates. Third, with careful choice of data fields for the "pre‐catalog" database, specimen image files linked to the minimal tracking records could be sorted by collector and date of collection to minimize key‐stroking of redundant data in a continuous series of labels, resulting in improved data entry efficiency and data quality.Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/146956/1/tax596014.pd

Deep Blue Documents at the University of Michigan

Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections [Version 1]

Author: Groom Quentin
Hardisty Alex
Leegwater Thijs
Livermore Laurence
Owen David
Spasic Irena
van Walsum Myriam
Wijkamp Noortje
Publication venue: Pensoft Publishers
Publication date: 01/01/2020
Field of study

We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on some of the state-of-the-art technologies.Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images.Not all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google's Cloud Vision, which is based on deep learning, is trained on large-scale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text.Determining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0. Potentially, NER could be used in conjunction with other online services, such as those of the Biodiversity Heritage Library to map the named entities to entities in the biodiversity literature (https://www.biodiversitylibrary.org/docs/api3.html).We have highlighted the main recommendations for potential pipeline components. The document also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process

Online Research @ Cardiff

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

ARPHA OAI-PMH Endpoint

ARPHA Preprints

Landscape Analysis for the Specimen Data Refinery

Author: Bánki Olaf
Cubey Robert
Drinkwater Robyn
Englund Markus
Goble Carole
Groom Quentin
Kermorvant Christopher
Livermore Laurence
Rey Isabel
Santos Celia
Scott Ben
Walton Stephanie
Williams Alan
Wu Zhengzhe
Publication venue
Publication date: 01/01/2020
Field of study

This report reviews the current state-of-the-art applied approaches on automated tools, services and workflows for extracting information from images of natural history specimens and their labels. We consider the potential for repurposing existing tools, including workflow management systems; and areas where more development is required. This paper was written as part of the SYNTHESYS+ project for software development teams and informatics teams working on new software-based approaches to improve mass digitisation of natural history specimens

ZENODO

The University of Manchester - Institutional Repository

Digital.CSIC

ARPHA OAI-PMH Endpoint

ARPHA Preprints

Categorization of species as native or nonnative using DNA sequence signatures without a complete reference library.

Author: Andersen Jeremy C
Charlat Sylvain
Davies Neil
Ewing Curtis
Gillespie Rosemary G
Krehenwinkel Henrik
Lim Jun Ying
Meyer Christopher
Noriyuki Suzuki
Oboyski Peter
Ramage Thibault
Roderick George K
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

New genetic diagnostic approaches have greatly aided efforts to document global biodiversity and improve biosecurity. This is especially true for organismal groups in which species diversity has been underestimated historically due to difficulties associated with sampling, the lack of clear morphological characteristics, and/or limited availability of taxonomic expertise. Among these methods, DNA sequence barcoding (also known as "DNA barcoding") and by extension, meta-barcoding for biological communities, has emerged as one of the most frequently utilized methods for DNA-based species identifications. Unfortunately, the use of DNA barcoding is limited by the availability of complete reference libraries (i.e., a collection of DNA sequences from morphologically identified species), and by the fact that the vast majority of species do not have sequences present in reference databases. Such conditions are critical especially in tropical locations that are simultaneously biodiversity rich and suffer from a lack of exploration and DNA characterization by trained taxonomic specialists. To facilitate efforts to document biodiversity in regions lacking complete reference libraries, we developed a novel statistical approach that categorizes unidentified species as being either likely native or likely nonnative based solely on measures of nucleotide diversity. We demonstrate the utility of this approach by categorizing a large sample of specimens of terrestrial insects and spiders (collected as part of the Moorea BioCode project) using a generalized linear mixed model (GLMM). Using a training data set of known endemic (n = 45) and known introduced species (n = 102), we then estimated the likely native/nonnative status for 4,663 specimens representing an estimated 1,288 species (412 identified species), including both those specimens that were either unidentified or whose endemic/introduced status was uncertain. Using this approach, we were able to increase the number of categorized specimens by a factor of 4.4 (from 794 to 3,497), and the number of categorized species by a factor of 4.8 from (147 to 707) at a rate much greater than chance (77.6% accuracy). The study identifies phylogenetic signatures of both native and nonnative species and suggests several practical applications for this approach including monitoring biodiversity and facilitating biosecurity

INRIA a CCSD electronic archive server

HAL Descartes

eScholarship - University of California

Simple identification tools in FishBase

Author: Atanacio Rachek
Bailly Nicolas
Froese Rainer
Reyes Jr. Rodolfo
Publication venue: EUT - Edizioni Università di Trieste
Publication date: 01/01/2010
Field of study

Simple identification tools for fish species were included in the FishBase information system from its inception. Early tools made use of the relational model and characters like fin ray meristics. Soon pictures and drawings were added as a further help, similar to a field guide. Later came the computerization of existing dichotomous keys, again in combination with pictures and other information, and the ability to restrict possible species by country, area, or taxonomic group. Today, www.FishBase.org offers four different ways to identify species. This paper describes these tools with their advantages and disadvantages, and suggests various options for further development. It explores the possibility of a holistic and integrated computeraided strategy

OceanRep

OpenstarTs

A benchmark dataset of herbarium specimen images with label data

Author: Chagnoux Simon
Dillen Mathias
Groom Quentin
Güntsch Anton
Hardisty Alex
Haston Elspeth
Livermore Laurence
Phillips Sarah
Runnel Veljo
Schulman Leif
Willemse Luc
Wu Zengzhe
Publication venue: Pensoft Publishers
Publication date: 01/01/2019
Field of study

More and more herbaria are digitising their collections. Images of specimens are made available online to facilitate access to them and allow extraction of information from them. Transcription of the data written on specimens is critical for general discoverability and enables incorporation into large aggregated research datasets. Different methods, such as crowdsourcing and artificial intelligence, are being developed to optimise transcription, but herbarium specimens pose difficulties in data extraction for many reasons. To provide developers of transcription methods with a means of optimisation, we have compiled a benchmark dataset of 1,800 herbarium specimen images with corresponding transcribed data. These images originate from nine different collections and include specimens that reflect the multiple potential obstacles that transcription methods may encounter, such as differences in language, text format (printed or handwritten), specimen age and nomenclatural type status. We are making these specimens available with a Creative Commons Zero licence waiver and with permanent online storage of the data. By doing this, we are minimising the obstacles to the use of these images for transcription training. This benchmark dataset of images may also be used where a defined and documented set of herbarium specimens is needed, such as for the extraction of morphological traits, handwriting recognition and colour analysis of specimens

Institutional Repository of the Freie Universität Berlin

Shared Research Repository

Online Research @ Cardiff

ZENODO

Directory of Open Access Journals

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

ARPHA OAI-PMH Endpoint

Helsingin yliopiston digitaalinen arkisto

ARPHA Preprints

Repositories for Taxonomic Data: Where We Are and What is Missing

Author: Begerow Dominik
Beszteri Bank
Bonkowski Michael
Bruy Teddy
Felden Janine
Gemeinholzer Birgit
Glaw Frank
Glöckner Frank Oliver
Hawlitschek Oliver
Kostadinov Ivaylo
Miralles Aurélien
Nattkemper Tim W
Printzen Christian
Renner Susanne S
Renz Jasmin
Rybalka Nataliya
Scherz Mark D
Stadler Marc
Vences Miguel
Weibulat Tanja
Wilke Thomas
Wolcott Katherine
Publication venue: Oxford University Press (OUP)
Publication date: 01/01/2020
Field of study

AbstractNatural history collections are leading successful large-scale projects of specimen digitization (images, metadata, DNA barcodes), thereby transforming taxonomy into a big data science. Yet, little effort has been directed towards safeguarding and subsequently mobilizing the considerable amount of original data generated during the process of naming 15,000–20,000 species every year. From the perspective of alpha-taxonomists, we provide a review of the properties and diversity of taxonomic data, assess their volume and use, and establish criteria for optimizing data repositories. We surveyed 4113 alpha-taxonomic studies in representative journals for 2002, 2010, and 2018, and found an increasing yet comparatively limited use of molecular data in species diagnosis and description. In 2018, of the 2661 papers published in specialized taxonomic journals, molecular data were widely used in mycology (94%), regularly in vertebrates (53%), but rarely in botany (15%) and entomology (10%). Images play an important role in taxonomic research on all taxa, with photographs used in &gt;80% and drawings in 58% of the surveyed papers. The use of omics (high-throughput) approaches or 3D documentation is still rare. Improved archiving strategies for metabarcoding consensus reads, genome and transcriptome assemblies, and chemical and metabolomic data could help to mobilize the wealth of high-throughput data for alpha-taxonomy. Because long-term—ideally perpetual—data storage is of particular importance for taxonomy, energy footprint reduction via less storage-demanding formats is a priority if their information content suffices for the purpose of taxonomic studies. Whereas taxonomic assignments are quasifacts for most biological disciplines, they remain hypotheses pertaining to evolutionary relatedness of individuals for alpha-taxonomy. For this reason, an improved reuse of taxonomic data, including machine-learning-based species identification and delimitation pipelines, requires a cyberspecimen approach—linking data via unique specimen identifiers, and thereby making them findable, accessible, interoperable, and reusable for taxonomic research. This poses both qualitative challenges to adapt the existing infrastructure of data centers to a specimen-centered concept and quantitative challenges to host and connect an estimated

\le

2 million images produced per year by alpha-taxonomic studies, plus many millions of images from digitization campaigns. Of the 30,000–40,000 taxonomists globally, many are thought to be nonprofessionals, and capturing the data for online storage and reuse therefore requires low-complexity submission workflows and cost-free repository use. Expert taxonomists are the main stakeholders able to identify and formalize the needs of the discipline; their expertise is needed to implement the envisioned virtual collections of cyberspecimens. [Big data; cyberspecimen; new species; omics; repositories; specimen identifier; taxonomy; taxonomic data.]</jats:p

Helmholtz Zentrum für Infektionsforschung Repository

Kölner UniversitätsPublikationsServer

HAL Descartes

Electronic Publication Information Center

Publications at Bielefeld University

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY