Removing out-of-distribution (OOD) images from noisy images scraped from the
Internet is an important preprocessing for constructing datasets, which can be
addressed by zero-shot OOD detection with vision language foundation models
(CLIP). The existing zero-shot OOD detection setting does not consider the
realistic case where an image has both in-distribution (ID) objects and OOD
objects. However, it is important to identify such images as ID images when
collecting the images of rare classes or ethically inappropriate classes that
must not be missed. In this paper, we propose a novel problem setting called
in-distribution (ID) detection, where we identify images containing ID objects
as ID images, even if they contain OOD objects, and images lacking ID objects
as OOD images. To solve this problem, we present a new approach,
\textbf{G}lobal-\textbf{L}ocal \textbf{M}aximum \textbf{C}oncept
\textbf{M}atching (GL-MCM), based on both global and local visual-text
alignments of CLIP features, which can identify any image containing ID objects
as ID images. Extensive experiments demonstrate that GL-MCM outperforms
comparison methods on both multi-object datasets and single-object ImageNet
benchmarks