6,920 research outputs found
Performance evaluation and hyperparameter tuning of statistical and machine-learning models using spatial data
Machine-learning algorithms have gained popularity in recent years in the
field of ecological modeling due to their promising results in predictive
performance of classification problems. While the application of such
algorithms has been highly simplified in the last years due to their
well-documented integration in commonly used statistical programming languages
such as R, there are several practical challenges in the field of ecological
modeling related to unbiased performance estimation, optimization of algorithms
using hyperparameter tuning and spatial autocorrelation. We address these
issues in the comparison of several widely used machine-learning algorithms
such as Boosted Regression Trees (BRT), k-Nearest Neighbor (WKNN), Random
Forest (RF) and Support Vector Machine (SVM) to traditional parametric
algorithms such as logistic regression (GLM) and semi-parametric ones like
generalized additive models (GAM). Different nested cross-validation methods
including hyperparameter tuning methods are used to evaluate model performances
with the aim to receive bias-reduced performance estimates. As a case study the
spatial distribution of forest disease Diplodia sapinea in the Basque Country
in Spain is investigated using common environmental variables such as
temperature, precipitation, soil or lithology as predictors. Results show that
GAM and RF (mean AUROC estimates 0.708 and 0.699) outperform all other methods
in predictive accuracy. The effect of hyperparameter tuning saturates at around
50 iterations for this data set. The AUROC differences between the bias-reduced
(spatial cross-validation) and overoptimistic (non-spatial cross-validation)
performance estimates of the GAM and RF are 0.167 (24%) and 0.213 (30%),
respectively. It is recommended to also use spatial partitioning for
cross-validation hyperparameter tuning of spatial data
Performance evaluation and hyperparameter tuning of statistical and machine-learning models using spatial data
Machine-learning algorithms have gained popularity in recent years in the field of ecological modeling due to their promising results in predictive performance of classification problems. While the application of such algorithms has been highly simplified in the last years due to their well-documented integration in commonly used statistical programming languages such as R, there are several practical challenges in the field of ecological modeling related to unbiased performance estimation, optimization of algorithms using hyperparameter tuning and spatial autocorrelation. We address these issues in the comparison of several widely used machine-learning algorithms such as Boosted Regression Trees (BRT), kNearest Neighbor (WKNN), Random Forest (RF) and Support Vector Machine (SVM) to traditional parametric algorithms such as logistic regression (GLM) and semi-parametric ones like Generalized Additive Models (GAM). Different nested cross-validation methods including hyperparameter tuning methods are used to evaluate model performances with the aim to receive bias-reduced performance estimates. As a case study the spatial distribution of forest disease (Diplodia sapinea) in the Basque Country in Spain is investigated using common environmental variables such as temperature, precipitation, soil or lithology as predictors.
Results show that GAM and Random Forest (RF) (mean AUROC estimates 0.708 and 0.699) outperform all other methods in predictive accuracy. The effect of hyperparameter tuning saturates at around 50 iterations for this data set. The AUROC differences between the bias-reduced (spatial cross-validation) and overoptimistic (non-spatial cross-validation) performance estimates of the GAM and RF are 0.167 (24%) and 0.213 (30%), respectively. It is recommended to also use spatial partitioning for cross-validation hyperparameter tuning of spatial data. The models developed in this study enhance the detection of Diplodia sapinea in the Basque Country compared to previous studies
Slice sampling covariance hyperparameters of latent Gaussian models
The Gaussian process (GP) is a popular way to specify dependencies between
random variables in a probabilistic model. In the Bayesian framework the
covariance structure can be specified using unknown hyperparameters.
Integrating over these hyperparameters considers different possible
explanations for the data when making predictions. This integration is often
performed using Markov chain Monte Carlo (MCMC) sampling. However, with
non-Gaussian observations standard hyperparameter sampling approaches require
careful tuning and may converge slowly. In this paper we present a slice
sampling approach that requires little tuning while mixing well in both strong-
and weak-data regimes.Comment: 9 pages, 4 figures, 4 algorithms. Minor corrections to previous
version. This version to appear in Advances in Neural Information Processing
Systems (NIPS) 23, 201
- …