197 research outputs found

    Location, location, location: utilizing pipelines and services to more effectively georeference the world's biodiversity data

    Get PDF
    Abstract Background Increasing the quantity and quality of data is a key goal of biodiversity informatics, leading to increased fitness for use in scientific research and beyond. This goal is impeded by a legacy of geographic locality descriptions associated with biodiversity records that are often heterogeneous and not in a map-ready format. The biodiversity informatics community has developed best practices and tools that provide the means to do retrospective georeferencing (e.g., the BioGeomancer toolkit), a process that converts heterogeneous descriptions into geographic coordinates and a measurement of spatial uncertainty. Even with these methods and tools, data publishers are faced with the immensely time-consuming task of vetting georeferenced localities. Furthermore, it is likely that overlap in georeferencing effort is occurring across data publishers. Solutions are needed that help publishers more effectively georeference their records, verify their quality, and eliminate the duplication of effort across publishers. Results We have developed a tool called BioGeoBIF, which incorporates the high throughput and standardized georeferencing methods of BioGeomancer into a beginning-to-end workflow. Custodians who publish their data to the Global Biodiversity Information Facility (GBIF) can use this system to improve the quantity and quality of their georeferences. BioGeoBIF harvests records directly from the publishers' access points, georeferences the records using the BioGeomancer web-service, and makes results available to data managers for inclusion at the source. Using a web-based, password-protected, group management system for each data publisher, we leave data ownership, management, and vetting responsibilities with the managers and collaborators of each data set. We also minimize the georeferencing task, by combining and storing unique textual localities from all registered data access points, and dynamically linking that information to the password protected record information for each publisher. Conclusion We have developed one of the first examples of services that can help create higher quality data for publishers mediated through the Global Biodiversity Information Facility and its data portal. This service is one step towards solving many problems of data quality in the growing field of biodiversity informatics. We envision future improvements to our service that include faster results returns and inclusion of more georeferencing engines

    LOCALITY UNCERTAINTY AND THE DIFFERENTIAL PERFORMANCE OF FOUR COMMON NICHE-BASED MODELING TECHNIQUES

    Get PDF
    We address a poorly understood aspect of ecological niche modeling: its sensitivity to different levels of geographic uncertainty in organism occurrence data. Our primary interest was to assess how accuracy degrades under increasing uncertainty, with performance measured indirectly through model consistency. We used Monte Carlo simulations and a similarity measure to assess model sensitivity across three variables: locality accuracy, niche modeling method, and species. Randomly generated data sets with known levels of locality uncertainty were compared to an original prediction using Fuzzy Kappa. Data sets where locality uncertainty is low were expected to produce similar distribution maps to the original. In contrast, data sets where locality uncertainty is high were expected to produce less similar maps. BIOCLIM, DOMAIN, Maxent and GARP were used to predict the distributions for 1200 simulated datasets (3 species x 4 buffer sizes x 100 randomized data sets). Thus, our experimental design produced a total of 4800 similarity measures, with each of the simulated distributions compared to the prediction of the original data set and corresponding modeling method. A general linear model (GLM) analysis was performed which enables us to simultaneously measure the effect of buffer size, modeling method, and species, as well as interactions among all variables. Our results show that modeling method has the largest effect on similarity scores and uniquely accounts for 40% of the total variance in the model. The second most important factor was buffer size, but it uniquely accounts for only 3% of the variation in the model. The newer and currently more popular methods, GARP and Maxent, were shown to produce more inconsistent predictions than the earlier and simpler methods, BIOCLIM and DOMAIN. Understanding the performance of different niche modeling methods under varying levels of geographic uncertainty is an important step toward more productive applications of historical biodiversity collections

    LOCALITY UNCERTAINTY AND THE DIFFERENTIAL PERFORMANCE OF FOUR COMMON NICHE-BASED MODELING TECHNIQUES

    Full text link

    Developing Global Maps of the Dominant Anopheles Vectors of Human Malaria

    Get PDF
    Simon Hay and colleagues describe how the Malaria Atlas Project has collated anopheline occurrence data to map the geographic distributions of the dominant mosquito vectors of human malaria

    Automated Georeferencing of Antarctic Species

    Get PDF
    Many text documents in the biological domain contain references to the toponym of specific phenomena (e.g. species sightings) in natural language form "In Garwood Valley summer activity was 0.2% for Umbilicaria aprina and 1.7% for Caloplaca sp. ..." While methods have been developed to extract place names from documents, and attention has been given to the interpretation of spatial prepositions, the ability to connect toponym mentions in text with the phenomena to which they refer (in this case species) has been given limited attention, but would be of considerable benefit for the task of mapping specific phenomena mentioned in text documents. As part of work to create a pipeline to automate georeferencing of species within legacy documents, this paper proposes a method to: (1) recognise species and toponyms within text and (2) match each species mention to the relevant toponym mention. Our methods find significant promise in a bespoke rules- and dictionary-based approach to recognise species within text (F1 scores up to 0.87 including partial matches) but less success, as yet, recognising toponyms using multiple gazetteers combined with an off the shelf natural language processing tool (F1 up to 0.62). Most importantly, we offer a contribution to the relatively nascent area of matching toponym references to the object they locate (in our case species), including cases in which the toponym and species are in different sentences. We use tree-based models to achieve precision as high as 0.88 or an F1 score up to 0.68 depending on the downsampling rate. Initial results out perform previous research on detecting entity relationships that may cross sentence boundaries within biomedical text, and differ from previous work in specifically addressing species mapping

    Uncertainty matters: ascertaining where specimens in natural history collections come from and its implications for predicting species distributions

    Get PDF
    Natural history collections (NHCs) represent an enormous and largely untapped wealth of information on the Earth's biota, made available through GBIF as digital preserved specimen records. Precise knowledge of where the specimens were collected is paramount to rigorous ecological studies, especially in the field of species distribution modelling. Here, we present a first comprehensive analysis of georeferencing quality for all preserved specimen records served by GBIF, and illustrate the impact that coordinate uncertainty may have on predicted potential distributions. We used all GBIF preserved specimen records to analyse the availability of coordinates and associated spatial uncertainty across geography, spatial resolution, taxonomy, publishing institutions and collection time. We used three plant species across their native ranges in different parts of the world to show the impact of uncertainty on predicted potential distributions. We found that 38% of the 180+ million records provide coordinates only and 18% coordinates and uncertainty. Georeferencing quality is determined more by country of collection and publishing than by taxonomic group. Distinct georeferencing practices are more determinant than implicit characteristics and georeferencing difficulty of specimens. Availability and quality of records contrasts across world regions. Uncertainty values are not normally distributed but peak at very distinct values, which can be traced back to specific regions of the world. Uncertainty leads to a wide spectrum of range sizes when modelling species distributions, potentially affecting conclusions in biogeographical and climate change studies. In summary, the digitised fraction of the world's NHCs are far from optimal in terms of georeferencing and quality mainly depends on where the collections are hosted. A collective effort between communities around NHC institutions, ecological research and data infrastructure is needed to bring the data on a par with its importance and relevance for ecological research
    • …
    corecore