11 research outputs found

    Performance evaluation and hyperparameter tuning of statistical and machine-learning models using spatial data

    Get PDF
    Machine-learning algorithms have gained popularity in recent years in the field of ecological modeling due to their promising results in predictive performance of classification problems. While the application of such algorithms has been highly simplified in the last years due to their well-documented integration in commonly used statistical programming languages such as R, there are several practical challenges in the field of ecological modeling related to unbiased performance estimation, optimization of algorithms using hyperparameter tuning and spatial autocorrelation. We address these issues in the comparison of several widely used machine-learning algorithms such as Boosted Regression Trees (BRT), k-Nearest Neighbor (WKNN), Random Forest (RF) and Support Vector Machine (SVM) to traditional parametric algorithms such as logistic regression (GLM) and semi-parametric ones like generalized additive models (GAM). Different nested cross-validation methods including hyperparameter tuning methods are used to evaluate model performances with the aim to receive bias-reduced performance estimates. As a case study the spatial distribution of forest disease Diplodia sapinea in the Basque Country in Spain is investigated using common environmental variables such as temperature, precipitation, soil or lithology as predictors. Results show that GAM and RF (mean AUROC estimates 0.708 and 0.699) outperform all other methods in predictive accuracy. The effect of hyperparameter tuning saturates at around 50 iterations for this data set. The AUROC differences between the bias-reduced (spatial cross-validation) and overoptimistic (non-spatial cross-validation) performance estimates of the GAM and RF are 0.167 (24%) and 0.213 (30%), respectively. It is recommended to also use spatial partitioning for cross-validation hyperparameter tuning of spatial data

    Performance evaluation and hyperparameter tuning of statistical and machine-learning models using spatial data

    Get PDF
    Machine-learning algorithms have gained popularity in recent years in the field of ecological modeling due to their promising results in predictive performance of classification problems. While the application of such algorithms has been highly simplified in the last years due to their well-documented integration in commonly used statistical programming languages such as R, there are several practical challenges in the field of ecological modeling related to unbiased performance estimation, optimization of algorithms using hyperparameter tuning and spatial autocorrelation. We address these issues in the comparison of several widely used machine-learning algorithms such as Boosted Regression Trees (BRT), kNearest Neighbor (WKNN), Random Forest (RF) and Support Vector Machine (SVM) to traditional parametric algorithms such as logistic regression (GLM) and semi-parametric ones like Generalized Additive Models (GAM). Different nested cross-validation methods including hyperparameter tuning methods are used to evaluate model performances with the aim to receive bias-reduced performance estimates. As a case study the spatial distribution of forest disease (Diplodia sapinea) in the Basque Country in Spain is investigated using common environmental variables such as temperature, precipitation, soil or lithology as predictors. Results show that GAM and Random Forest (RF) (mean AUROC estimates 0.708 and 0.699) outperform all other methods in predictive accuracy. The effect of hyperparameter tuning saturates at around 50 iterations for this data set. The AUROC differences between the bias-reduced (spatial cross-validation) and overoptimistic (non-spatial cross-validation) performance estimates of the GAM and RF are 0.167 (24%) and 0.213 (30%), respectively. It is recommended to also use spatial partitioning for cross-validation hyperparameter tuning of spatial data. The models developed in this study enhance the detection of Diplodia sapinea in the Basque Country compared to previous studies

    R package qgisprocess: use QGIS processing algorithms

    No full text
    <p>R package <code>qgisprocess</code> provides seamless access to the <a href="https://qgis.org/en/site/">QGIS</a> processing toolbox using the standalone <code>qgis_process</code> command-line utility. Both native and third-party (plugin) processing providers are supported. Beside referring data sources from file, also common objects from <code>sf</code>, <code>terra</code> and <code>stars</code> are supported. The native processing algorithms are documented <a href="https://docs.qgis.org/latest/en/docs/user_manual/processing_algs/">at QGIS.org</a>. URL: <a href="https://r-spatial.github.io/qgisprocess">https://r-spatial.github.io/qgisprocess</a>. CRAN landing page: <a href="https://cran.r-project.org/package=qgisprocess">https://CRAN.R-project.org/package=qgisprocess</a>.</p&gt

    A review of ecological gradient research in the Tropics: identifying research gaps, future directions, and conservation priorities

    Full text link
    The Tropics are global centers of biodiversity. Ecological and land use gradients play a major role in the origin and maintenance of this diversity, yet a comprehensive synthesis of the corresponding large body of literature is still missing. We searched all ISI-listed journals for tropical gradient studies. From the resulting 1023 studies, we extracted study-specific information, and analyzed it using descriptive analytical tools and GLMs. Our results reveal that dry tropical areas are vastly understudied compared to their humid counterparts. The same holds true for large parts of Africa, but also the Philippines and the South Asian region. However, we also found that (applied) research output of developing tropical countries is nowadays on par with the output of developed countries. Vegetation and elevation were the most studied response variable and gradient, respectively. By contrast, inconspicous organisms such as oribatid mites and edaphic gradients were largely missing in the literature. Regarding biodiversity, tropical gradient studies dealt extensively with species richness and ecosystem diversity, but much less with genetic diversity. We encourage a wider use of modern statistical learning tools such as non-linear (spatio-temporal) regression and classification techniques, and simulations. Finally, we would embrace an even further development of synergies between applied and basic research and between researchers based in developed and in tropical countries. Keywords Synthesis Tropical ecology Environmental gradient relationships Biodiversit

    Monitoring Forest Health Using Hyperspectral Imagery: Does Feature Selection Improve the Performance of Machine-Learning Techniques?

    No full text
    This study analyzed highly correlated, feature-rich datasets from hyperspectral remote sensing data using multiple statistical and machine-learning methods. The effect of filter-based feature selection methods on predictive performance was compared. In addition, the effect of multiple expert-based and data-driven feature sets, derived from the reflectance data, was investigated. Defoliation of trees (%), derived from in situ measurements from fall 2016, was modeled as a function of reflectance. Variable importance was assessed using permutation-based feature importance. Overall, the support vector machine (SVM) outperformed other algorithms, such as random forest (RF), extreme gradient boosting (XGBoost), and lasso (L1) and ridge (L2) regressions by at least three percentage points. The combination of certain feature sets showed small increases in predictive performance, while no substantial differences between individual feature sets were observed. For some combinations of learners and feature sets, filter methods achieved better predictive performances than using no feature selection. Ensemble filters did not have a substantial impact on performance. The most important features were located around the red edge. Additional features in the near-infrared region (800–1000 nm) were also essential to achieve the overall best performances. Filter methods have the potential to be helpful in high-dimensional situations and are able to improve the interpretation of feature effects in fitted models, which is an essential constraint in environmental modeling studies. Nevertheless, more training data and replication in similar benchmarking studies are needed to be able to generalize the results

    Monitoring Forest Health Using Hyperspectral Imagery: Does Feature Selection Improve the Performance of Machine-Learning Techniques?

    No full text
    This study analyzed highly correlated, feature-rich datasets from hyperspectral remote sensing data using multiple statistical and machine-learning methods. The effect of filter-based feature selection methods on predictive performance was compared. In addition, the effect of multiple expert-based and data-driven feature sets, derived from the reflectance data, was investigated. Defoliation of trees (%), derived from in situ measurements from fall 2016, was modeled as a function of reflectance. Variable importance was assessed using permutation-based feature importance. Overall, the support vector machine (SVM) outperformed other algorithms, such as random forest (RF), extreme gradient boosting (XGBoost), and lasso (L1) and ridge (L2) regressions by at least three percentage points. The combination of certain feature sets showed small increases in predictive performance, while no substantial differences between individual feature sets were observed. For some combinations of learners and feature sets, filter methods achieved better predictive performances than using no feature selection. Ensemble filters did not have a substantial impact on performance. The most important features were located around the red edge. Additional features in the near-infrared region (800–1000 nm) were also essential to achieve the overall best performances. Filter methods have the potential to be helpful in high-dimensional situations and are able to improve the interpretation of feature effects in fitted models, which is an essential constraint in environmental modeling studies. Nevertheless, more training data and replication in similar benchmarking studies are needed to be able to generalize the results

    Soil texture and altitude, respectively, largely determine the floristic gradient of the most diverse fog oasis in the Peruvian desert

    No full text
    Studying species turnover along gradients is a key topic in tropical ecology. Crucial drivers, among others, are fog deposition and soil properties. In northern Peru, a fog-dependent vegetation formation develops on mountains along the hyper-arid coast. Despite their uniqueness, these fog oases are largely uninvestigated. This study addresses the influence of environmental factors on the vegetation of these unique fog oases. Accordingly, vegetation and soil properties were recorded on 66 4 × 4-m plots along an altitudinal gradient ranging from 200 to 950 m asl. Ordination and modelling techniques were used to study altitudinal vegetation belts and floristic composition. Four vegetation belts were identified: a low-elevation Tillandsia belt, a herbaceous belt, a bromeliad belt showing highest species richness and an uppermost succulent belt. Different altitudinal levels might reflect water availability, which is highest below the temperature inversion at around 700 m asl. Altitude alone explained 96% of the floristic composition. Soil texture and salinity accounted for 88%. This is in contrast with more humid tropical ecosystems where soil nutrients appear to be more important. Concluding, this study advances the understanding of tropical gradients in fog-dependent and ENSO-affected ecosystems
    corecore