Safe deployment of AI models requires proactive detection of potential
prediction failures to prevent costly errors. While failure detection in
classification problems has received significant attention, characterizing
failure modes in regression tasks is more complicated and less explored.
Existing approaches rely on epistemic uncertainties or feature inconsistency
with the training distribution to characterize model risk. However, we show
that uncertainties are necessary but insufficient to accurately characterize
failure, owing to the various sources of error. In this paper, we propose PAGER
(Principled Analysis of Generalization Errors in Regressors), a framework to
systematically detect and characterize failures in deep regression models.
Built upon the recently proposed idea of anchoring in deep models, PAGER
unifies both epistemic uncertainties and novel, complementary non-conformity
scores to organize samples into different risk regimes, thereby providing a
comprehensive analysis of model errors. Additionally, we introduce novel
metrics for evaluating failure detectors in regression tasks. We demonstrate
the effectiveness of PAGER on synthetic and real-world benchmarks. Our results
highlight the capability of PAGER to identify regions of accurate
generalization and detect failure cases in out-of-distribution and
out-of-support scenarios