9 research outputs found

    ENM2020 : A FREE ONLINE COURSE AND SET OF RESOURCES ON MODELING SPECIES NICHES AND DISTRIBUTIONS

    Get PDF
    The field of distributional ecology has seen considerable recent attention, particularly surrounding the theory, protocols, and tools for Ecological Niche Modeling (ENM) or Species Distribution Modeling (SDM). Such analyses have grown steadily over the past two decades-including a maturation of relevant theory and key concepts-but methodological consensus has yet to be reached. In response, and following an online course taught in Spanish in 2018, we designed a comprehensive English-language course covering much of the underlying theory and methods currently applied in this broad field. Here, we summarize that course, ENM2020, and provide links by which resources produced for it can be accessed into the future. ENM2020 lasted 43 weeks, with presentations from 52 instructors, who engaged with >2500 participants globally through >14,000 hours of viewing and >90,000 views of instructional video and question-and-answer sessions. Each major topic was introduced by an "Overview" talk, followed by more detailed lectures on subtopics. The hierarchical and modular format of the course permits updates, corrections, or alternative viewpoints, and generally facilitates revision and reuse, including the use of only the Overview lectures for introductory courses. All course materials are free and openly accessible (CC-BY license) to ensure these resources remain available to all interested in distributional ecology.Peer reviewe

    Integrating data-cleaning with data analysis to enhance usability of biodiversity big-data

    No full text
    Biodiversity big-data (BBD) has the potential to provide answers to some unresolved questions – at spatial and taxonomic swathes that were previously inaccessible. However, BBDs contain serious error and bias. Therefore, any study that uses BBD should ask whether data quality is sufficient to provide a reliable answer to the research question. We propose that the question of data quality and the research question could be addressed simultaneously, by binding data-cleaning to data analysis. The change in signal between the pre- and post-cleaning phases, in addition to the signal itself, can be used to evaluate the findings, their implications, and their robustness. This approach includes five steps: Downloading raw occurrence data from a BBD. Data analysis, statistical and / or simulation modeling in order to answer the research question, using the raw data after the necessary basic cleaning. This part is similar to the common practice. Comprehensive data-cleaning. Repeated data analysis using the cleaned data. Comparing the results of steps 2 and 4 (i.e., before- and after data-cleaning). This comparison will address the issue of data quality, as well as answer the research question itself.  The results of step 2 alone may be misleading, due to the error and bias in the data. Even the results of step 4 may not be trustworthy, since data-cleaning is never complete, and some of the error and much bias remain in the data. However, the changes in the results before- and after cleaning are important keys to answer the research question. If cleaned data reveal a stronger and clearer signal than raw data, then the signal is most likely trustworthy, and the respective hypothesis is confirmed. Conversely, if the cleaned data show a weaker signal than obtained from the raw data, then the respective hypothesis, even if confirmed by original data, needs to be rejected. Lastly, if there is a mixed trend, whereby in some cases the signal is stronger and in others it is weaker – the data is probably inadequate and findings cannot be considered conclusive. Thus, we propose that data-cleaning and data analysis should be conducted jointly. We present a case study on the effects of environmental factors on species distribution, using GBIF data of all Australian mammals. We used the performance of a species distribution model (SDM) as a proxy for the strength of environmental factors in determining gradients of species richness. We implemented three different SDM algorithms for 190 species in several different grid cells, that vary in their species richness. We examined the correlations between species richness and 10 different SDM performance indices. Species-environment affinity was weaker in species-rich areas, across all SDM algorithms. The results support the notion that the impact of environmental factors on species distribution at a continental scale decreases with increasing species richness. Seemingly, the results also support the continuum hypothesis, namely that in species-poor areas, species have strong affinities to particular niches, but this structure breaks in species-rich communities. Furthermore, a much stronger signal was revealed after data-cleaning. Thus, a joint study of a research question and data-cleaning provides a more reliable means for using BBDs

    Integrating data-cleaning with data analysis to enhance usability of biodiversity big-data

    No full text
    Biodiversity big-data (BBD) has the potential to provide answers to some unresolved questions – at spatial and taxonomic swathes that were previously inaccessible. However, BBDs contain serious error and bias. Therefore, any study that uses BBD should ask whether data quality is sufficient to provide a reliable answer to the research question. We propose that the question of data quality and the research question could be addressed simultaneously, by binding data-cleaning to data analysis. The change in signal between the pre- and post-cleaning phases, in addition to the signal itself, can be used to evaluate the findings, their implications, and their robustness. This approach includes five steps: Downloading raw occurrence data from a BBD. Data analysis, statistical and / or simulation modeling in order to answer the research question, using the raw data after the necessary basic cleaning. This part is similar to the common practice. Comprehensive data-cleaning. Repeated data analysis using the cleaned data. Comparing the results of steps 2 and 4 (i.e., before- and after data-cleaning). This comparison will address the issue of data quality, as well as answer the research question itself. The results of step 2 alone may be misleading, due to the error and bias in the data. Even the results of step 4 may not be trustworthy, since data-cleaning is never complete, and some of the error and much bias remain in the data. However, the changes in the results before- and after cleaning are important keys to answer the research question. If cleaned data reveal a stronger and clearer signal than raw data, then the signal is most likely trustworthy, and the respective hypothesis is confirmed. Conversely, if the cleaned data show a weaker signal than obtained from the raw data, then the respective hypothesis, even if confirmed by original data, needs to be rejected. Lastly, if there is a mixed trend, whereby in some cases the signal is stronger and in others it is weaker – the data is probably inadequate and findings cannot be considered conclusive. Thus, we propose that data-cleaning and data analysis should be conducted jointly. We present a case study on the effects of environmental factors on species distribution, using GBIF data of all Australian mammals. We used the performance of a species distribution model (SDM) as a proxy for the strength of environmental factors in determining gradients of species richness. We implemented three different SDM algorithms for 190 species in several different grid cells, that vary in their species richness. We examined the correlations between species richness and 10 different SDM performance indices. Species-environment affinity was weaker in species-rich areas, across all SDM algorithms. The results support the notion that the impact of environmental factors on species distribution at a continental scale decreases with increasing species richness. Seemingly, the results also support the continuum hypothesis, namely that in species-poor areas, species have strong affinities to particular niches, but this structure breaks in species-rich communities. Furthermore, a much stronger signal was revealed after data-cleaning. Thus, a joint study of a research question and data-cleaning provides a more reliable means for using BBDs

    Google Summer of Code: Why TDWG should participate

    No full text
    Google Summer of Code (GSoC) is a global program, operating since 2005, which brings student developers into open source software development. Students work with different open source organizations in summer-long programming projects, closely supervised by mentors from the organization. Google pays students a stipend for this three-month program. The selection procedure is rigorous, where the organization mentors post project ideas on websites, students select ideas to work on, and develop project proposals in consultation with mentors and submit on GSoC website. Mentors evaluate and discuss the proposals, and recommend a few for acceptance to Google. Depending on the number of slots available (usually 1000–1200) every year, successful projects are announced. During the period of the program, students are evaluated by mentors and on approval, are paid the stipend directly. Several organizations have hosted projects related to biodiversity informatics over the years. In 2010 and 2011, the Marine Biological Laboratory and Encyclopedia of Life executed some projects successfully. Since 2012, R-project organization has hosted various projects related to biodiversity data like rgbif (Chamberlain et al. 2017),  rvertnet (Chamberlain et al. 2016), and bdvis the biodiversity data visualization package (Barve and Otegui 2016). Here we review the GSoC projects implemented in the domain of biodiversity informatics so far and wish to explore the involvement of TDWG as a potential mentor organization in future GSoC programs

    Introducing bdclean: a user friendly biodiversity data cleaning pipeline

    No full text
    A new R package for biodiversity data cleaning, 'bdclean', was initiated in the Google Summer of Code (GSoC) 2017 and is available on github. Several R packages have great data validation and cleaning functions, but 'bdclean' provides features to manage a complete pipeline for biodiversity data cleaning; from data quality explorations, to cleaning procedures and reporting. Users are able go through the quality control process in a very structured, intuitive, and effective way. A modular approach to data cleaning functionality should make this package extensible for many biodiversity data cleaning needs. Under GSoC 2018, 'bdclean' will go through a comprehensive upgrade. New features will be highlighted in the demonstration

    Introducing ‘The bdverse’: a family of R packages for biodiversity data

    No full text
    The bdverse is a collection of packages that form a general framework for facilitating biodiversity science in R. We build it to serve as a sustainable and agile infrastructure that enhances the value of biodiversity data by allowing users to conveniently employ R, for data exploration, quality assessment, data cleaning, and standardization. The bdverse supports users with and without programming capabilities. It includes six unique packages in a hierarchal structure — representing different functionality levels (Fig. 1). Major features of three core packages will be highlighted and demonstrated: (i) bdDwC provides an interactive Shiny app and a set of functions for standardizing field names in compliance with Darwin Core (DwC) format; (ii) bdchecks is an infrastructure for performing, filtering and managing various biodiversity data checks; (iii) bdclean is a user-friendly data cleaning Shiny app for the inexperienced R user. It provides features to manage complete workflow for biodiversity data cleaning, including data upload; user input - in order to adjust cleaning procedures; data cleaning; and finally, generation of various reports and versions of the data. We are now working on submitting the bdverse packages to rOpenSci software review, and as soon as the packages meet core requirements, we will officially release the bdverse. The bdverse project won the 2nd prize in the 2018 Ebbe Nielsen Challenge

    Towards a comprehensive workflow for biodiversity data in R

    No full text
    Increasing number of scientists are using R for their data analyses, however, proficiency required to manage biodiversity data in R is considerably rarer. Since, users need to retrieve, manage and assess high-volume data with inherent complex structure (Darwin Core standard, DwC), various R packages dealing with biodiversity data and specifically data cleaning have been published. Though numerous new procedures are now available, implementing them require users to provide a great deal of efforts in exploring and learning each R package. For the common users, this task can be daunting. In order to truly facilitate data cleaning using R, there is an urgent need for a package that will fully integrate functionality of existing packages, enhance their functionality, and simplify its implementation. Furthermore, it is also necessary to identify and develop missing crucial functionalities. We are attempting to address these issues by developing two projects under Google Summer of Code (GSoC)-- an international annual program that matches up students with open source organizations to develop code during their summer break. The first project is dealing with the integration challenge by developing a taxonomic cleaning workflow; standardizing various spatial and temporal data quality checks; and enhancing different data retrieval and data management techniques. The second project aims at advancing new and exciting features, such as establishing a flagging system (HashMap-like) in R, an innovative set of DwC summary tables, and developing new techniques for outliers analysis. The products of these projects lay down crucial infrastructure for data quality assessment in R. Obviously this is a work in progress and needs further inputs. By developing a comprehensive framework for handling biodiversity data, we can fully harness the synergetic quality of R, and hopefully supply more holistic and agile solutions for the user

    bddashboard: An infrastructure for biodiversity dashboards in R

    No full text
    The bdverse is a collection of packages that form a general framework for facilitating biodiversity science in R (programming language). Exploratory and diagnostic visualization can unveil hidden patterns and anomalies in data and allow quick and efficient exploration of massive datasets. The development of an interactive yet flexible dashboard that can be easily deployed locally or remotely is a highly valuable biodiversity informatics tool. To this end, we have developed 'bddashboard', which serves as an agile framework for biodiversity dashboard development. This project is built in R, using the Shiny package (RStudio, Inc 2021) that helps build interactive web apps in R. The following key components were developed:Core Interactive Components The basic building blocks of every dashboard are interactive plots, maps, and tables. We have explored all major visualization libraries in R and have concluded that 'plotly' (Sievert 2020) is the most mature and showcases the best value for effort. Additionally, we have concluded that 'leaflet' (Graul 2016) shows the most diverse and high-quality mapping features, and DT (DataTables library) (Xie et al. 2021) is best for rendering tabular data. Each component was modularized to better adjust it for biodiversity data and to enhance its flexibility.Field Selector The field selector is a unique module that makes each interactive component much more versatile. Users have different data and needs; thus, every combination or selection of fields can tell a different story. The field selector allows users to change the X and Y axis on plots, to choose the columns that are visible on a table, and to easily control map settings. All that in real-time, without reloading the page or disturbing the reactivity. The field selector automatically detects how many columns a plot needs and what type of columns can be passed to the X-axis or Y-axis. The field selector also displays the completeness of each field. Plot Navigation We developed the plot navigation module to prevent unwanted extreme cases. Technically, drawing 1,000 bars on a single bar plot is possible, but this visualization is not human-friendly. Navigation allows users to decide how many values they want to see on a single plot. This technique allows for fast drawing of extensive datasets without affecting page reactivity, dramatically improving performance and functioning as a fail-safe mechanism. Reactivity Reactivity creates the connection between different components. The changes in input values automatically flow to the plots, text, maps, and tables that use the input, and cause them to update. Reactivity facilitates drilling down functionality, which enhances the user’s ability to explore and investigate the data. We developed a novel and robust reactivity technique that allows us to add a new component and effectively connect it with all existing components within a dashboard tab, using only one line of code.Generic Biodiversity Tabs We developed five useful dashboard tabs (Fig. 1): (i) the Data Summary tab to give a quick overview of a dataset; (ii) the Data Completeness tab helps users get valuable information about missing records and missing Darwin Core fields; (iii) the Spatial tab is dedicated to spatial visualizations; (iv) the Taxonomic tab is designed to visualize taxonomy; and (v) the Temporal tab is designed to visualize time-related aspects. Performance and Agility To make a dashboard work smoothly and react quickly, hundreds of small and large modules, functions, and techniques must work together. Our goal was to minimize dashboard latency and maximize its data capacity. We used asynchronous modules to write non-blocking code, clusters in map components, and preprocessing and filtering data before passing it to plots to reduce the load. The 'bddashboard' package modularized architecture allows us to develop completely different interactive and reactive dashboards within mere minutes
    corecore