The spatial prediction sandbox - Investigating the use of spatially-explicit modelling and cross-validation strategies in spatial interpolation machine learning problems
Dissertation submitted in partial fulfilment of the requirements for the Degree of Master of Science in Geospatial TechnologiesMachine Learning (ML) methods are increasingly used for spatial interpolation and
di erent strategies have been proposed to introduce space into the modelling and
validation phases. Nevertheless, a comparison of these methods under di erent
landscape autocorrelation ranges and sampling designs is still missing. This Master
Thesis investigates under which scenarios spatially-explicit ML modelling and
validation strategies are appropriate for spatial interpolation problems.
We designed a framework that allowed us to simulate predictor and outcome spatial
elds with di erent autocorrelation ranges, as well as samples with di erent number
of points and distributions. With these data, we tested di erent non-spatial and
spatially-explicit (coordinates, EDF, RFsp) Random Forest ML models and evaluated
them using the simulated surfaces as well as di erent standard (Leave-One-
Out, LOO) and spatially-explicit (spatial bu er LOO, sbLOO) Cross-Validation
(CV) strategies. We developed a new method called Nearest Distance Matching
(NDM) to estimate the appropriate radius for sbLOO CV for spatial interpolation
based on sample distribution and landscape range, and compared it to state-of-the
art methods for radius search, only based on range.
While for short ranges non-spatial models were superior to spatially-explicit models
regardless of the sample size and distribution; for long ranges, spatial models performed
better under regular and random sampling designs, but not clustered and
non-uniform. CV results indicated that although LOO correctly estimated model
performance under random designs, it yielded overestimated errors for regular samples
and underestimated errors for clustered and non-uniform designs under long
ranges. Results of sbLOO combined with NDM correctly addressed error underestimation
of LOO in clustered and non-uniform samples, whereas sbLOO based solely
on the range resulted in error overestimation for all designs under long ranges.
This Master Thesis provides important insights to the eld of predictive mapping:
it elucidates in which cases spatially-explicit methods may be preferred, and establishes
that state-of-the-art approaches for spatial CV designed to assess model
transferability are not suited for spatial interpolation and proposes an alternative