This paper establishes a novel evaluation framework for assessing the
performance of out-of-distribution (OOD) detection in realistic settings. Our
goal is to expose the shortcomings of existing OOD detection benchmarks and
encourage a necessary research direction shift toward satisfying the
requirements of real-world applications. We expand OOD detection research by
introducing new OOD test datasets CIFAR-10-R, CIFAR-100-R, and MVTec-R, which
allow researchers to benchmark OOD detection performance under realistic
distribution shifts. We also introduce a generalizability score to measure a
method's ability to generalize from standard OOD detection test datasets to a
realistic setting. Contrary to existing OOD detection research, we demonstrate
that further performance improvements on standard benchmark datasets do not
increase the usability of such models in the real world. State-of-the-art
(SOTA) methods tested on our realistic distributionally-shifted datasets drop
in performance for up to 45%. This setting is critical for evaluating the
reliability of OOD models before they are deployed in real-world environments