We present a comprehensive experimental study on pretrained feature
extractors for visual out-of-distribution (OOD) detection. We examine several
setups, based on the availability of labels or image captions and using
different combinations of in- and out-distributions. Intriguingly, we find that
(i) contrastive language-image pretrained models achieve state-of-the-art
unsupervised out-of-distribution performance using nearest neighbors feature
similarity as the OOD detection score, (ii) supervised state-of-the-art OOD
detection performance can be obtained without in-distribution fine-tuning,
(iii) even top-performing billion-scale vision transformers trained with
natural language supervision fail at detecting adversarially manipulated OOD
images. Finally, we argue whether new benchmarks for visual anomaly detection
are needed based on our experiments. Using the largest publicly available
vision transformer, we achieve state-of-the-art performance across all 18
reported OOD benchmarks, including an AUROC of 87.6\% (9.2\% gain,
unsupervised) and 97.4\% (1.2\% gain, supervised) for the challenging task of
CIFAR100 β CIFAR10 OOD detection. The code will be open-sourced