The spatial prediction sandbox - Investigating the use of spatially-explicit modelling and cross-validation strategies in spatial interpolation machine learning problems

Abstract

Dissertation submitted in partial fulfilment of the requirements for the Degree of Master of Science in Geospatial TechnologiesMachine Learning (ML) methods are increasingly used for spatial interpolation and di erent strategies have been proposed to introduce space into the modelling and validation phases. Nevertheless, a comparison of these methods under di erent landscape autocorrelation ranges and sampling designs is still missing. This Master Thesis investigates under which scenarios spatially-explicit ML modelling and validation strategies are appropriate for spatial interpolation problems. We designed a framework that allowed us to simulate predictor and outcome spatial elds with di erent autocorrelation ranges, as well as samples with di erent number of points and distributions. With these data, we tested di erent non-spatial and spatially-explicit (coordinates, EDF, RFsp) Random Forest ML models and evaluated them using the simulated surfaces as well as di erent standard (Leave-One- Out, LOO) and spatially-explicit (spatial bu er LOO, sbLOO) Cross-Validation (CV) strategies. We developed a new method called Nearest Distance Matching (NDM) to estimate the appropriate radius for sbLOO CV for spatial interpolation based on sample distribution and landscape range, and compared it to state-of-the art methods for radius search, only based on range. While for short ranges non-spatial models were superior to spatially-explicit models regardless of the sample size and distribution; for long ranges, spatial models performed better under regular and random sampling designs, but not clustered and non-uniform. CV results indicated that although LOO correctly estimated model performance under random designs, it yielded overestimated errors for regular samples and underestimated errors for clustered and non-uniform designs under long ranges. Results of sbLOO combined with NDM correctly addressed error underestimation of LOO in clustered and non-uniform samples, whereas sbLOO based solely on the range resulted in error overestimation for all designs under long ranges. This Master Thesis provides important insights to the eld of predictive mapping: it elucidates in which cases spatially-explicit methods may be preferred, and establishes that state-of-the-art approaches for spatial CV designed to assess model transferability are not suited for spatial interpolation and proposes an alternative

    Similar works