1 research outputs found
Improving predictions by nonlinear regression models from outlying input data
When applying machine learning/statistical methods to the environmental
sciences, nonlinear regression (NLR) models often perform only slightly better
and occasionally worse than linear regression (LR). The proposed reason for
this conundrum is that NLR models can give predictions much worse than LR when
given input data which lie outside the domain used in model training.
Continuous unbounded variables are widely used in environmental sciences,
whence not uncommon for new input data to lie far outside the training domain.
For six environmental datasets, inputs in the test data were classified as
"outliers" and "non-outliers" based on the Mahalanobis distance from the
training input data. The prediction scores (mean absolute error, Spearman
correlation) showed NLR to outperform LR for the non-outliers, but often
underperform LR for the outliers. An approach based on Occam's Razor (OR) was
proposed, where linear extrapolation was used instead of nonlinear
extrapolation for the outliers. The linear extrapolation to the outlier domain
was based on the NLR model within the non-outlier domain. This
NLR approach reduced occurrences of very poor extrapolation by
NLR, and it tended to outperform NLR and LR for the outliers. In conclusion,
input test data should be screened for outliers. For outliers, the unreliable
NLR predictions can be replaced by NLR or LR predictions, or by
issuing a "no reliable prediction" warning.Comment: 26 pages, 12 figures. Preprint of a paper accepted for publication by
the Journal of Environmental Informatic