18 research outputs found

    State tagging for improved Earth and environmental data quality assurance

    Get PDF
    Environmental data allows us to monitor the constantly changing environment that we live in. It allows us to study trends and helps us to develop better models to describe processes in our environment and they, in turn, can provide information to improve management practices. To ensure that the data are reliable for analysis and interpretation, they must undergo quality assurance procedures. Such procedures generally include standard operating procedures during sampling and laboratory measurement (if applicable), as well as data validation upon entry to databases. The latter usually involves compliance (i.e., format) and conformity (i.e., value) checks that are most likely to be in the form of single parameter range tests. Such tests take no consideration of the system state at which each measurement is made, and provide the user with little contextual information on the probable cause for a measurement to be flagged out of range. We propose the use of data science techniques to tag each measurement with an identified system state. The term “state” here is defined loosely and they are identified using k-means clustering, an unsupervised machine learning method. The meaning of the states is open to specialist interpretation. Once the states are identified, state-dependent prediction intervals can be calculated for each observational variable. This approach provides the user with more contextual information to resolve out-of-range flags and derive prediction intervals for observational variables that considers the changes in system states. The users can then apply further analysis and filtering as they see fit. We illustrate our approach with two well-established long-term monitoring datasets in the UK: moth and butterfly data from the UK Environmental Change Network (ECN), and the UK CEH Cumbrian Lakes monitoring scheme. Our work contributes to the ongoing development of a better data science framework that allows researchers and other stakeholders to find and use the data they need more readily

    River reach-level machine learning estimation of nutrient concentrations in Great Britain

    Get PDF
    Nitrogen (N) and phosphorus (P) are essential nutrients necessary for plant growth and support life in aquatic ecosystems. However, excessive N and P can lead to algal blooms that deplete oxygen and lead to fish death and the release of toxins that are harmful to humans. Estimates of N and P levels in rivers are typically calculated at station or grid (>1 km) scale; therefore, it is difficult to visualise the evolution of water quality as water travels downstream. Using a high-resolution reach-scale river network and associating each reach with land cover fractions and catchment descriptors, we trained random forest models on aggregated data (2010–2020) from the Environmental Agency Open Water Quality Data Archive for 2,343 stations to predict long-term nitrate and orthophosphate concentrations at each river reach in Great Britain (GB). We separated the model training and predictions for different seasons to investigate the potential difference in feature importance. Our model predicted concentrations with an average testing coefficient of determination (R2) of 0.71 for nitrate and 0.58 for orthophosphate using 5-fold cross-validation. Our model showed slightly better performance for higher Strahler stream orders, highlighting the challenges of making predictions in small streams. Our results revealed that arable and horticultural land use is the strongest and most reliable predictor for nitrate, while floodplain extents and standard percentage runoff are stronger predictors for orthophosphate. Nationally, higher orthophosphate concentrations were observed in urbanised areas. This study shows how combining a river network model with machine learning can easily provide a river network understanding of the spatial distribution of water quality levels

    Ensemble Kalman inversion of induced polarization data

    Get PDF
    This paper explores the applicability of Ensemble Kalman Inversion (EKI) with level-set parameterization for solving geophysical inverse problems. In particular, we focus on its extension to induced polarization (IP) data with uncertainty quantification. IP data may provide rich information on characteristics of geological materials due to its sensitivity to characteristics of the pore-grain interface. In many IP studies, different geological units are juxtaposed and the goal is to delineate these units and obtain estimates of unit properties with uncertainty bounds. Conventional inversion of IP data does not resolve well sharp interfaces and tends to reduce and smooth resistivity variations, while not readily providing uncertainty estimates. Recently, it has been shown for DC resistivity that EKI is an efficient solver for inverse problems which provides uncertainty quantification, and its combination with level set parameterization can delineate arbitrary interfaces well. In this contribution, we demonstrate the extension of EKI to IP data using a sequential approach, where the mean field obtained from DC resistivity inversion is used as input for a separate phase angle inversion. We illustrate our workflow using a series of synthetic and field examples. Variations with uncertainty bounds in both DC resistivity and phase angles are recovered by EKI, which provides useful information for hydrogeological site characterization. While phase angles are less well-resolved than DC resistivity, partly due to their smaller range and higher percentage data errors, it complements DC resistivity for site characterization. Overall, EKI with level set parameterization provides a practical approach forward for efficient hydrogeophysical imaging under uncertainty

    River reach-level machine learning estimation of nutrient concentrations in Great Britain

    Get PDF
    Nitrogen (N) and phosphorus (P) are essential nutrients necessary for plant growth and support life in aquatic ecosystems. However, excessive N and P can lead to algal blooms that deplete oxygen and lead to fish death and the release of toxins that are harmful to humans. Estimates of N and P levels in rivers are typically calculated at station or grid (>1 km) scale; therefore, it is difficult to visualise the evolution of water quality as water travels downstream. Using a high-resolution reach-scale river network and associating each reach with land cover fractions and catchment descriptors, we trained random forest models on aggregated data (2010–2020) from the Environmental Agency Open Water Quality Data Archive for 2,343 stations to predict long-term nitrate and orthophosphate concentrations at each river reach in Great Britain (GB). We separated the model training and predictions for different seasons to investigate the potential difference in feature importance. Our model predicted concentrations with an average testing coefficient of determination (R2) of 0.71 for nitrate and 0.58 for orthophosphate using 5-fold cross-validation. Our model showed slightly better performance for higher Strahler stream orders, highlighting the challenges of making predictions in small streams. Our results revealed that arable and horticultural land use is the strongest and most reliable predictor for nitrate, while floodplain extents and standard percentage runoff are stronger predictors for orthophosphate. Nationally, higher orthophosphate concentrations were observed in urbanised areas. This study shows how combining a river network model with machine learning can easily provide a river network understanding of the spatial distribution of water quality levels

    Advancing reproducible research by publishing R markdown notebooks as interactive sandboxes using the learnr package

    Get PDF
    Various R packages and best practices have played a pivotal role to promote the Findability, Accessibility, Interoperability, and Reuse (FAIR) principles of open science. For example, (1) well-documented R scripts and notebooks with rich narratives are deposited at a trusted data centre, (2) R Markdown interactive notebooks can be run on-demand as a web service, and (3) R Shiny web apps provide nice user interfaces to explore research outputs. However, notebooks require users to go through the entire analysis, while Shiny apps do not expose the underlying code and require extra work for UI design. We propose using the learnr package to expose certain code chunks in R Markdown so that users can readily experiment with them in guided, editable, isolated, executable, and resettable code sandboxes. Our approach does not replace the existing use of notebooks and Shiny apps, but it adds another level of abstraction between them to promote reproducible science

    Multi-product characterization of surface soil moisture drydowns in the UK

    Get PDF
    The persistence or memory of soil moisture (θ) after rainfall has substantial environmental implications. Much work has been done to study soil moisture drydown for in-situ and satellite data separately. In this work, we present a comparison of drydown characteristics across multiple UK soil moisture products, including satellite-merged (i.e. TCM), in-situ (i.e. COSMOS-UK), hydrological model (i.e. G2G), statistical model (i.e. SMUK) and land surface model (LSM) (i.e. CHESS) data. The drydown decay time scale (τ) for all gridded products are computed at an unprecedented resolution of 1-2 km, a scale relevant to weather and climate models. While their range of τ differ (except SMUK and CHESS are similar) due to differences such as sensing depths, their spatial patterns are correlated to land cover and soil types. We further analyse the occurrence of drydown events at COSMOS-UK sites. We show that soil moisture drydown regimes exhibit strong seasonal dependencies, whereby the soil dries out quicker in summer than winter. These seasonal dependencies are important to consider during model benchmarking and evaluation. We show that fitted τ based on COSMOS and LSM are well correlated, with a bias of lower τ for COSMOS. Our findings contribute to a growing body of literature to characterize τ, with the aim of developing a method to systematically validate model soil moisture products at a range of scales

    The relative importance of head, flux, and prior information in hydraulic tomography analysis

    Get PDF
    Using cross-correlation analysis, we demonstrate that flux measurements at observation locations during hydraulic tomography (HT) surveys carry nonredundant information about heterogeneity that are complementary to head measurements at the same locations. We then hypothesize that a joint interpretation of head and flux data, even when the same observation network as head has been used, can enhance the resolution of HT estimates. Subsequently, we use numerical experiments to test this hypothesis and investigate the impact of flux conditioning and prior information (such as correlation lengths and initial mean models (i.e., uniform mean or distributed means)) on the HT estimates of a nonstationary, layered medium. We find that the addition of flux conditioning to HT analysis improves the estimates in all of the prior models tested. While prior information on geologic structures could be useful, its influence on the estimates reduces as more nonredundant data (i.e., flux) are used in the HT analysis. Lastly, recommendations for conducting HT surveys and analysis are presented

    Reconstructing GRACE-derived terrestrial water storage anomalies with in-situ groundwater level measurements and meteorological forcing data

    Get PDF
    Study region: North China Plain (NCP), China, a semi-arid region with intense groundwater withdrawals. Study focus: This paper developed a framework using meteorological data, model-simulated terrestrial water storage anomalies (TWSA), and additional in-situ (groundwater level, GL) data to improve the unsatisfactory GRACE-TWSA reconstruction in arid and semi-arid regions due to the intense anthropogenic influence on groundwater. The inconsistency between point-scale data (GL) and grid-scale data (GRACE-TWSA and predictors other than GL) is handled by feature extraction techniques. Moreover, to deal with temporal non-stationarity, the time series are separated into trend and detrended components, the patterns of which are further learned by linear and nonlinear machine learning models, respectively.New hydrological insights for the region: Multi-site GL observations in NCP can not only serve as validation data but also as predictors providing invaluable information on human effects for the reconstructed TWSA improvement (from 6.51 to 3.86 cm for Root Mean Square Error and from 0.56 to 0.82 for Nash-Sutcliffe Efficiency). Our results show that multi-site GL data in NCP are highly inter-correlated and can be represented by several principal components, demonstrating the strong hydraulic connectivity in NCP. We also find a significant one-month lag and linear relationship between the trends of GRACE-TWSA and GL changes in NCP. These deeper understandings of hydrologic processes have implications for enhancing the GRACE-TWSA estimations in other similar regions

    Ensemble Kalman inversion of induced polarization data

    No full text
    This paper explores the applicability of Ensemble Kalman Inversion (EKI) with level-set parameterization for solving geophysical inverse problems. In particular, we focus on its extension to induced polarization (IP) data with uncertainty quantification. IP data may provide rich information on characteristics of geological materials due to its sensitivity to characteristics of the pore-grain interface. In many IP studies, different geological units are juxtaposed and the goal is to delineate these units and obtain estimates of unit properties with uncertainty bounds. Conventional inversion of IP data does not resolve well sharp interfaces and tends to reduce and smooth resistivity variations, while not readily providing uncertainty estimates. Recently, it has been shown for DC resistivity that EKI is an efficient solver for inverse problems which provides uncertainty quantification, and its combination with level set parameterization can delineate arbitrary interfaces well. In this contribution, we demonstrate the extension of EKI to IP data using a sequential approach, where the mean field obtained from DC resistivity inversion is used as input for a separate phase angle inversion. We illustrate our workflow using a series of synthetic and field examples. Variations with uncertainty bounds in both DC resistivity and phase angles are recovered by EKI, which provides useful information for hydrogeological site characterization. While phase angles are less well-resolved than DC resistivity, partly due to their smaller range and higher percentage data errors, it complements DC resistivity for site characterization. Overall, EKI with level set parameterization provides a practical approach forward for efficient hydrogeophysical imaging under uncertainty
    corecore