20 research outputs found

    Benchmarking environmental machine-learning models: methodological progress and an application to forest health

    Get PDF
    Geospatial machine learning is a versatile approach to analyze environmental data and can help to better understand the interactions and current state of our environment. Due to the artificial intelligence of these algorithms, complex relationships can possibly be discovered which might be missed by other analysis methods. Modeling the interaction of creatures with their environment is referred to as ecological modeling, which is a subcategory of environmental modeling. A subfield of ecological modeling is SDM, which aims to understand the relation between the presence or absence of certain species in their environments. SDM is different from classical mapping/detection analysis. While the latter primarily aim for a visual representation of a species spatial distribution, the former focuses on using the available data to build models and interpreting these. Because no single best option exists to build such models, different settings need to be evaluated and compared against each other. When conducting such modeling comparisons, which are commonly referred to as benchmarking, care needs to be taken throughout the analysis steps to achieve meaningful and unbiased results. These steps are composed out of data preprocessing, model optimization and performance assessment. While these general principles apply to any modeling analysis, their application in an environmental context often requires additional care with respect to data handling, possibly hidden underlying data effects and model selection. To conduct all in a programmatic (and efficient) way, toolboxes in the form of programming modules or packages are needed. This work makes methodological contributions which focus on efficient, machine-learning based analysis of environmental data. In addition, research software to generalize and simplify the described process has been created throughout this work

    Splitting methods for nonlinear Dirac equations with thirring type interaction in the nonrelativistic limit regime

    Get PDF
    Nonlinear Dirac equations describe the motion of relativistic spin-12\frac{1}{2} particles in presence of external electromagnetic felds, modelled by an electric and magnetic potential, and taking into account a nonlinear particle self-interaction. In recent years, the construction of numerical splitting schemes for the solution of these systems in the nonrelativistic limit regime, i.e., the speed of light c formally tending to infnity, has gained a lot of attention. In this paper, we consider a nonlinear Dirac equation with Thirring type interaction, where in contrast to the case of the Soler type nonlinearity a classical twoterm splitting scheme cannot be straightforwardly applied. Thus, we propose and analyze a three-term Strang splitting scheme which relies on splitting the full problem into the free Dirac subproblem, a potential subproblem, and a nonlinear subproblem, where each subproblem can be solved exactly in time. Moreover, our analysis shows that the error of our scheme improves from O\mathcal{O} (rr2^{2}cc4^{4}) to O\mathcal{O} (rr2^{2}cc3^{3}) if the magnetic potential in the system vanishes. Furthermore, we propose an effcient limit approximation scheme for solving nonlinear Dirac systems in the nonrelativistic limit regime cc ≫\gg 1 which allows errors of O\mathcal{O} (cc−1^{-1}) without any cc-dependent time step restriction

    Efficient time integration of the Maxwell-Klein-Gordon equation in the nonrelativistic limit regime

    Get PDF
    The Maxwell-Klein-Gordon equation describes the interaction of a charged particle with an electromagnetic field. Solving this equation in the non-relativistic limit regime, i.e. the speed of light c formally tending to infinity, is numerically very delicate as the solution becomes highly-oscillatory in time. In order to resolve the oscillations, standard numerical time integration schemes require severe time step restrictions depending on the large parameter c2. The idea to overcome this numerical challenge is to filter out the high frequencies explicitly by asymptotically expanding the exact solution with respect to the small parameter c-2. This allows us to reduce the highly-oscillatory problem to its corresponding non-oscillatory Schrodinger-Poisson limit system. On the basis of this expansion we are then able to construct effcient numerical time integration schemes, which do NOT suffer from any c-dependent time step restriction

    mlr3spatiotempcv: Spatiotemporal resampling methods for machine learning in R

    Full text link
    Spatial and spatiotemporal machine-learning models require a suitable framework for their model assessment, model selection, and hyperparameter tuning, in order to avoid error estimation bias and over-fitting. This contribution reviews the state-of-the-art in spatial and spatiotemporal cross-validation, and introduces the {R} package {mlr3spatiotempcv} as an extension package of the machine-learning framework {mlr3}. Currently various {R} packages implementing different spatiotemporal partitioning strategies exist: {blockCV}, {CAST}, {skmeans} and {sperrorest}. The goal of {mlr3spatiotempcv} is to gather the available spatiotemporal resampling methods in {R} and make them available to users through a simple and common interface. This is made possible by integrating the package directly into the {mlr3} machine-learning framework, which already has support for generic non-spatiotemporal resampling methods such as random partitioning. One advantage is the use of a consistent nomenclature in an overarching machine-learning toolkit instead of a varying package-specific syntax, making it easier for users to choose from a variety of spatiotemporal resampling methods. This package avoids giving recommendations which method to use in practice as this decision depends on the predictive task at hand, the autocorrelation within the data, and the spatial structure of the sampling design or geographic objects being studied.Comment: 35 pages, 15 Figures, 1 Tabl

    Performance evaluation and hyperparameter tuning of statistical and machine-learning models using spatial data

    Get PDF
    Machine-learning algorithms have gained popularity in recent years in the field of ecological modeling due to their promising results in predictive performance of classification problems. While the application of such algorithms has been highly simplified in the last years due to their well-documented integration in commonly used statistical programming languages such as R, there are several practical challenges in the field of ecological modeling related to unbiased performance estimation, optimization of algorithms using hyperparameter tuning and spatial autocorrelation. We address these issues in the comparison of several widely used machine-learning algorithms such as Boosted Regression Trees (BRT), k-Nearest Neighbor (WKNN), Random Forest (RF) and Support Vector Machine (SVM) to traditional parametric algorithms such as logistic regression (GLM) and semi-parametric ones like generalized additive models (GAM). Different nested cross-validation methods including hyperparameter tuning methods are used to evaluate model performances with the aim to receive bias-reduced performance estimates. As a case study the spatial distribution of forest disease Diplodia sapinea in the Basque Country in Spain is investigated using common environmental variables such as temperature, precipitation, soil or lithology as predictors. Results show that GAM and RF (mean AUROC estimates 0.708 and 0.699) outperform all other methods in predictive accuracy. The effect of hyperparameter tuning saturates at around 50 iterations for this data set. The AUROC differences between the bias-reduced (spatial cross-validation) and overoptimistic (non-spatial cross-validation) performance estimates of the GAM and RF are 0.167 (24%) and 0.213 (30%), respectively. It is recommended to also use spatial partitioning for cross-validation hyperparameter tuning of spatial data

    Predicting forest cover in distinct ecosystems: the potential of multi-source sentinel-1 and -2 data fusion

    Get PDF
    The fusion of microwave and optical data sets is expected to provide great potential for the derivation of forest cover around the globe. As Sentinel-1 and Sentinel-2 are now both operating in twin mode, they can provide an unprecedented data source to build dense spatial and temporal high-resolution time series across a variety of wavelengths. This study investigates (i) the ability of the individual sensors and (ii) their joint potential to delineate forest cover for study sites in two highly varied landscapes located in Germany (temperate dense mixed forests) and South Africa (open savanna woody vegetation and forest plantations). We used multi-temporal Sentinel-1 and single time steps of Sentinel-2 data in combination to derive accurate forest/non-forest (FNF) information via machine-learning classifiers. The forest classification accuracies were 90.9% and 93.2% for South Africa and Thuringia, respectively, estimated while using autocorrelation corrected spatial cross-validation (CV) for the fused data set. Sentinel-1 only classifications provided the lowest overall accuracy of 87.5%, while Sentinel-2 based classifications led to higher accuracies of 91.9%. Sentinel-2 short-wave infrared (SWIR) channels, biophysical parameters (Leaf Area Index (LAI), and Fraction of Absorbed Photosynthetically Active Radiation (FAPAR)) and the lower spectrum of the Sentinel-1 synthetic aperture radar (SAR) time series were found to be most distinctive in the detection of forest cover. In contrast to homogenous forests sites, Sentinel-1 time series information improved forest cover predictions in open savanna-like environments with heterogeneous regional features. The presented approach proved to be robust and it displayed the benefit of fusing optical and SAR data at high spatial resolution

    Performance evaluation and hyperparameter tuning of statistical and machine-learning models using spatial data

    Get PDF
    Machine-learning algorithms have gained popularity in recent years in the field of ecological modeling due to their promising results in predictive performance of classification problems. While the application of such algorithms has been highly simplified in the last years due to their well-documented integration in commonly used statistical programming languages such as R, there are several practical challenges in the field of ecological modeling related to unbiased performance estimation, optimization of algorithms using hyperparameter tuning and spatial autocorrelation. We address these issues in the comparison of several widely used machine-learning algorithms such as Boosted Regression Trees (BRT), kNearest Neighbor (WKNN), Random Forest (RF) and Support Vector Machine (SVM) to traditional parametric algorithms such as logistic regression (GLM) and semi-parametric ones like Generalized Additive Models (GAM). Different nested cross-validation methods including hyperparameter tuning methods are used to evaluate model performances with the aim to receive bias-reduced performance estimates. As a case study the spatial distribution of forest disease (Diplodia sapinea) in the Basque Country in Spain is investigated using common environmental variables such as temperature, precipitation, soil or lithology as predictors. Results show that GAM and Random Forest (RF) (mean AUROC estimates 0.708 and 0.699) outperform all other methods in predictive accuracy. The effect of hyperparameter tuning saturates at around 50 iterations for this data set. The AUROC differences between the bias-reduced (spatial cross-validation) and overoptimistic (non-spatial cross-validation) performance estimates of the GAM and RF are 0.167 (24%) and 0.213 (30%), respectively. It is recommended to also use spatial partitioning for cross-validation hyperparameter tuning of spatial data. The models developed in this study enhance the detection of Diplodia sapinea in the Basque Country compared to previous studies

    Simplified odds ratio calculation of binomial GAM/GLM models

    No full text
    Simplified odds ratio calculation of GAM(M)s & GLM(M)s. Provides structured output (data frame) of all predictors and their corresponding odds ratios and confident intervals for further analyses. It helps to avoid false references of predictors and increments by specifying these parameters in a list instead of using 'exp(coef(model))' (standard approach of odds ratio calculation for GLMs) which just returns a plain numeric output. For GAM(M)s, odds ratio calculation is highly simplified with this package since it takes care of the multiple 'predict()' calls of the chosen predictor while holding other predictors constant. Also, this package allows odds ratio calculation of percentage steps across the whole predictor distribution range for GAM(M)s. In both cases, confident intervals are returned additionally. Calculated odds ratio of GAM(M)s can be inserted into the smooth function plo

    Modeling the Spatial Distribution of Hail Damage in Pine Plantations of Northern Spain as a Major Risk Factor for Forest Disease

    No full text
    Hail storms are able to cause severe damages to all kind of goods. While mostly economic damages of hail events are considered, damages to vegetation are more complex to quantify due to their complexity and hetereogen- ity regarding species and types. Few research exists on this topic which relies on the complexity of hail as a phenomenon itself: Due to its small-scale characteristics only few in situ measurement systems exist, making it problematic to gather long time series of reliable data. Furthermore, almost no research has been done under- taken yet on the follow-up e ects of hail damage to plants. is work aims to contribute to this science eld by analyzing the spatial distribution of hail damages in pine plantations in northern Spain. For this purpose, binomial statistical learning methods (Generalized Linear Mixed Model (GLMM) and Generalized Additive Mixed Model (GAMM)) were applied to surveyed "hail damage to trees" distributed across the Basque Country. Climate variables like precipitation, temperature and Potential Incoming Solar Radia- tion (PISR), extracted from a long term climate data set with a spatial resolution of 200 m, were used as pre- dictors in the models to explore the relationship between them and the response. Age of the surveyed trees was used as a biological component in the model. Underlying grouping structures (spatial autocorrelation and ran- dom e ects) in the data were investigated and accounted for in the models. Additionally, the synoptic weather situation of hail occurence was analyzed using long term weather station data for the cities Bilbao, San Sebastian and Vitoria. e prime time for hail occurence was found to be between November and April. e analysis of the weather station data revealed non-linear relationships between hail occurence and climatic variables. e GAMM, ac- counting for the underlying spatial autocorrelation, did not converge. Hence, these results have to be treated with caution due to a violation of the independence assumption of the residuals. Di erent risk areas were carried out with the result of the northeast of the Basque Country being most susceptible of "hail damage to trees" (for both models). A considerably decrease of "hail damage to trees" susceptibility was observed along the Cantabrian Range with very low estimated probabilities of "hail damage to trees" for areas located further south. is nding runs contrary to the absolute occurence of hail events which is highest in areas with estimated low probabilities, inferring that most of the hail events in this region happen with a low destructive energy. A substantial increase of "hail damage to trees" probability was observed in the Generalized Additive Model (GAM) for the top third range of the predictor range of precipitation and minimum temperature with examplary odds ratios of 7.9 (0.125 m/mm2 - 0.14 m/mm2) and 3.99 (5°C - 6°C), respectively. Estimated probabilities range between 0%-50% for the GLMM and 0%-100% for the GAM. e latter revealed high uncertainties in areas with low precipitation and/or temperature values pointing to a likely over tting of the model which is also con rmed by the large gap between the (100 repetitions, ten fold) spatial cross-validation result of the training set (0.87) and the test set (0.62). Further research using more environmental variables explaining hail occurence (e.g. wind speed) is suggested. Also, the outcomes of this work (risk areas, estimated probabilities) need to be compared to analyses using direct hail observations (in contrast to derivated observations like in this work) in the Basque Country

    sperrorest v1.0.0: Parallelized spatial error estimation and variable importance assessment for geospatial machine learning

    No full text
    Computational and statistical prediction methods such as the support vector machine have gained popularity in remote-sensing applications in recent years and are often compared to more traditional approaches like maximum-likelihood classification. However, the accuracy assessment of such predictive models in a spatial context needs to account for the presence of spatial autocorrelation in geospatial data by using spatial cross-validation and bootstrap strategies instead of their now more widely used non-spatial equivalent. The R package sperrorest by A. Brenning [IEEE International Geoscience and Remote Sensing Symposium, 1, 374 (2012)] provides a generic interface for performing (spatial) cross-validation of any statistical or machine-learning technique available in R. Since spatial statistical models as well as flexible machine-learning algorithms can be computationally expensive, parallel computing strategies are required to perform cross-validation efficiently. The most recent major release of sperrorest therefore comes with two new features (aside from improved documentation): The first one is the parallelized version of sperrorest(), parsperrorest(). This function features two parallel modes to greatly speed up cross-validation runs. Both parallel modes are platform independent and provide progress information. par.mode = 1 relies on the pbapply package and calls interactively (depending on the platform) parallel::mclapply() or parallel::parApply() in the background. While forking is used on Unix-Systems, Windows systems use a cluster approach for parallel execution. par.mode = 2 uses the foreach package to perform parallelization. This method uses a different way of cluster parallelization than the parallel package does. In summary, the robustness of parsperrorest() is increased with the implementation of two independent parallel modes. A new way of partitioning the data in sperrorest is provided by partition.factor.cv(). This function gives the user the possibility to perform cross-validation at the level of some grouping structure. As an example, in remote sensing of agricultural land uses, pixels from the same field contain nearly identical information and will thus be jointly placed in either the test set or the training set. Other spatial sampling resampling strategies are already available and can be extended by the user
    corecore