36 research outputs found
ecocomDP: A flexible data design pattern for ecological community survey data
The idea of harmonizing data is not new. Decades of amassing data in databases according to community standards - both locally and globally - have been more successful for some research domains than others. It is particularly difficult to harmonize data across studies where sampling protocols vary greatly and complex environmental conditions need to be understood to apply analytical methods correctly. However, a body of longterm ecological community observations is increasingly becoming publicly available and has been used in important studies. Here, we discuss an approach to preparing harmonized community survey data by an environmental data repository, in collaboration with a national observatory. The workflow framework and repository infrastructure are used to create a decentralized, asynchronous model to reformat data without altering original data through cleaning or aggregation, while retaining metadata about sampling methods and provenance, and enabling programmatic data access. This approach does not create another data ‘silo’ but will allow the repository to contribute subsets of available data to a variety of different analysis-ready data preparation efforts. With certain limitations (e.g., changes to the sampling protocol over time), data updates and downstream processing may be completely automated. In addition to supporting reuse of community observation data by synthesis science, a goal for this harmonization and workflow effort is to contribute these datasets to the Global Biodiversity Information Facility (GBIF) to increase the data’s discovery and use
A global database of lake surface temperatures collected by in situ and satellite methods from 1985–2009
Global environmental change has influenced lake surface temperatures, a key driver of ecosystem structure and function. Recent studies have suggested significant warming of water temperatures in individual lakes across many different regions around the world. However, the spatial and temporal coherence associated with the magnitude of these trends remains unclear. Thus, a global data set of water temperature is required to understand and synthesize global, long-term trends in surface water temperatures of inland bodies of water. We assembled a database of summer lake surface temperatures for 291 lakes collected in situ and/or by satellites for the period 1985–2009. In addition, corresponding climatic drivers (air temperatures, solar radiation, and cloud cover) and geomorphometric characteristics (latitude, longitude, elevation, lake surface area, maximum depth, mean depth, and volume) that influence lake surface temperatures were compiled for each lake. This unique dataset offers an invaluable baseline perspective on global-scale lake thermal conditions as environmental change continues
Facilitating and Improving Environmental Research Data Repository Interoperability
Environmental research data repositories provide much needed services for data preservation and data dissemination to diverse communities with domain specific or programmatic data needs and standards. Due to independent development these repositories serve their communities well, but were developed with different technologies, data models and using different ontologies. Hence, the effectiveness and efficiency of these services can be vastly improved if repositories work together adhering to a shared community platform that focuses on the implementation of agreed upon standards and best practices for curation and dissemination of data. Such a community platform drives forward the convergence of technologies and practices that will advance cross-domain interoperability. It will also facilitate contributions from investigators through standardized and streamlined workflows and provide increased visibility for the role of data managers and the curation services provided by data repositories, beyond preservation infrastructure. Ten specific suggestions for such standardizations are outlined without any suggestions for priority or technical implementation. Although the recommendations are for repositories to implement, they have been chosen specifically with the data provider/data curator and synthesis scientist in mind
Long-term ecological research in a human-dominated world
Author Posting. © American Institute of Biological Sciences, 2012. This article is posted here by permission of American Institute of Biological Sciences for personal use, not for redistribution. The definitive version was published in BioScience 62 (2012): 342-253, doi:10.1525/bio.2012.62.4.6.The US Long Term Ecological Research (LTER) Network enters its fourth decade with a distinguished record of achievement in ecological science. The value of long-term observations and experiments has never been more important for testing ecological theory and for addressing today's most difficult environmental challenges. The network's potential for tackling emergent continent-scale questions such as cryosphere loss and landscape change is becoming increasingly apparent on the basis of a capacity to combine long-term observations and experimental results with new observatory-based measurements, to study socioecological systems, to advance the use of environmental cyberinfrastructure, to promote environmental science literacy, and to engage with decisionmakers in framing major directions for research. The long-term context of network science, from understanding the past to forecasting the future, provides a valuable perspective for helping to solve many of the crucial environmental problems facing society today.2012-10-0
Generating community-built tools for data sharing and analysis in environmental networks
Rapid data growth in many environmental sectors has necessitated tools to manage and analyze these data. The development of tools often lags behind the proliferation of data, however, which may slow exploratory opportunities and scientific progress. The Global Lake Ecological Observatory Network (GLEON) collaborative model supports an efficient and comprehensive data–analysis–insight life cycle, including implementations of data quality control checks, statistical calculations/derivations, models, and data visualizations. These tools are community-built and openly shared. We discuss the network structure that enables tool development and a culture of sharing, leading to optimized output from limited resources. Specifically, data sharing and a flat collaborative structure encourage the development of tools that enable scientific insights from these data. Here we provide a cross-section of scientific advances derived from global-scale analyses in GLEON. We document enhancements to science capabilities made possible by the development of analytical tools and highlight opportunities to expand this framework to benefit other environmental networks
Recommended from our members
BioTIME: A database of biodiversity time series for the Anthropocene.
MotivationThe BioTIME database contains raw data on species identities and abundances in ecological assemblages through time. These data enable users to calculate temporal trends in biodiversity within and amongst assemblages using a broad range of metrics. BioTIME is being developed as a community-led open-source database of biodiversity time series. Our goal is to accelerate and facilitate quantitative analysis of temporal patterns of biodiversity in the Anthropocene.Main types of variables includedThe database contains 8,777,413 species abundance records, from assemblages consistently sampled for a minimum of 2 years, which need not necessarily be consecutive. In addition, the database contains metadata relating to sampling methodology and contextual information about each record.Spatial location and grainBioTIME is a global database of 547,161 unique sampling locations spanning the marine, freshwater and terrestrial realms. Grain size varies across datasets from 0.0000000158 km2 (158 cm2) to 100 km2 (1,000,000,000,000 cm2).Time period and grainBioTIME records span from 1874 to 2016. The minimal temporal grain across all datasets in BioTIME is a year.Major taxa and level of measurementBioTIME includes data from 44,440 species across the plant and animal kingdoms, ranging from plants, plankton and terrestrial invertebrates to small and large vertebrates.Software format.csv and .SQL
The Tao of open science for ecology
The field of ecology is poised to take advantage of emerging technologies that facilitate the gathering, analyzing, and sharing of data, methods, and results. The concept of transparency at all stages of the research process, coupled with free and open access to data, code, and papers, constitutes “open science.” Despite the many benefits of an open approach to science, a number of barriers to entry exist that may prevent researchers from embracing openness in their own work. Here we describe several key shifts in mindset that underpin the transition to more open science. These shifts in mindset include thinking about data stewardship rather than data ownership, embracing transparency throughout the data life‐cycle and project duration, and accepting critique in public. Though foreign and perhaps frightening at first, these changes in thinking stand to benefit the field of ecology by fostering collegiality and broadening access to data and findings. We present an overview of tools and best practices that can enable these shifts in mindset at each stage of the research process, including tools to support data management planning and reproducible analyses, strategies for soliciting constructive feedback throughout the research process, and methods of broadening access to final research products
A global database of lake surface temperatures collected by in situ and satellite methods from 1985–2009
Global environmental change has influenced lake surface temperatures, a key driver of ecosystem structure and function. Recent studies have suggested significant warming of water temperatures in individual lakes across many different regions around the world. However, the spatial and temporal coherence associated with the magnitude of these trends remains unclear. Thus, a global data set of water temperature is required to understand and synthesize global, long-term trends in surface water temperatures of inland bodies of water. We assembled a database of summer lake surface temperatures for 291 lakes collected in situ and/or by satellites for the period 1985–2009. In addition, corresponding climatic drivers (air temperatures, solar radiation, and cloud cover) and geomorphometric characteristics (latitude, longitude, elevation, lake surface area, maximum depth, mean depth, and volume) that influence lake surface temperatures were compiled for each lake. This unique dataset offers an invaluable baseline perspective on global-scale lake thermal conditions as environmental change continues
Recommended from our members
Building a multi-scaled geospatial temporal ecology database from disparate data sources: fostering open science and data reuse
Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km² ). LAGOS includes two modules: LAGOS[subscript]GEO , with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOS[subscript]LIMNO , with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.Keywords: LAGOS, Integrated database, Data harmonization, Database
Ecoinformatics, Macrosystems ecology, Landscape limnology, Water qualityKeywords: LAGOS, Integrated database, Ecoinformatics, Data harmonization, Water quality, Data sharing, Landscape limnology, Macrosystems ecology, Database documentation, Data reus
Bivariate Zero-Inflated Regression for Count Data: A Bayesian Approach with Application to Plant Counts
Lately, bivariate zero-inflated (BZI) regression models have been used in many instances in the medical sciences to model excess zeros. Examples include the BZI Poisson (BZIP), BZI negative binomial (BZINB) models, etc. Such formulations vary in the basic modeling aspect and use the EM algorithm (Dempster, Laird and Rubin, 1977) for parameter estimation. A different modeling formulation in the Bayesian context is given by Dagne (2004). We extend the modeling to a more general setting for multivariate ZIP models for count data with excess zeros as proposed by Li, Lu, Park, Kim, Brinkley and Peterson (1999), focusing on a particular bivariate regression formulation. For the basic formulation in the case of bivariate data, we assume that Xi are (latent) independent Poisson random variables with parameters ? i, i = 0, 1, 2. A bi-variate count vector (Y1, Y2) response follows a mixture of four distributions; p0 stands for the mixing probability of a point mass distribution at (0, 0); p1, the mixing probability that Y2 = 0, while Y1 = X0 + X1; p2, the mixing probability that Y1 = 0 while Y2 = X0 + X2; and finally (1 - p0 - p1 - p2), the mixing probability that Yi = Xi + X0, i = 1, 2. The choice of the parameters {pi, ? i, i = 0, 1, 2} ensures that the marginal distributions of Yi are zero inflated Poisson (? 0 + ? i). All the parameters thus introduced are allowed to depend on co-variates through canonical link generalized linear models (McCullagh and Nelder, 1989). This flexibility allows for a range of real-life applications, especially in the medical and biological fields, where the counts are bivariate in nature (with strong association between the processes) and where there are excess of zeros in one or both processes. Our contribution in this paper is to employ a fully Bayesian approach consolidating the work of Dagne (2004) and Li et al. (1999) generalizing the modeling and sampling-based methods described by Ghosh, Mukhopadhyay and Lu (2006) to estimate the parameters and obtain posterior credible intervals both in the case where co-variates are not available as well as in the case where they are. In this context, we provide explicit data augmentation techniques that lend themselves to easier implementation of the Gibbs sampler by giving rise to well-known and closed-form posterior distributions in the bivariate ZIP case. We then use simulations to explore the effectiveness of this estimation using the Bayesian BZIP procedure, comparing the performance to the Bayesian and classical ZIP approaches. Finally, we demonstrate the methodology based on bivariate plant count data with excess zeros that was collected on plots in the Phoenix metropolitan area and compare the results with independent ZIP regression models fitted to both processes.