Referring camouflaged object detection (Ref-COD) is a recently-proposed
problem aiming to segment out specified camouflaged objects matched with a
textual or visual reference. This task involves two major challenges: the COD
domain-specific perception and multimodal reference-image alignment. Our
motivation is to make full use of the semantic intelligence and intrinsic
knowledge of recent Multimodal Large Language Models (MLLMs) to decompose this
complex task in a human-like way. As language is highly condensed and
inductive, linguistic expression is the main media of human knowledge learning,
and the transmission of knowledge information follows a multi-level progression
from simplicity to complexity. In this paper, we propose a large-model-based
Multi-Level Knowledge-Guided multimodal method for Ref-COD termed MLKG, where
multi-level knowledge descriptions from MLLM are organized to guide the large
vision model of segmentation to perceive the camouflage-targets and
camouflage-scene progressively and meanwhile deeply align the textual
references with camouflaged photos. To our knowledge, our contributions mainly
include: (1) This is the first time that the MLLM knowledge is studied for
Ref-COD and COD. (2) We, for the first time, propose decomposing Ref-COD into
two main perspectives of perceiving the target and scene by integrating MLLM
knowledge, and contribute a multi-level knowledge-guided method. (3) Our method
achieves the state-of-the-art on the Ref-COD benchmark outperforming numerous
strong competitors. Moreover, thanks to the injected rich knowledge, it
demonstrates zero-shot generalization ability on uni-modal COD datasets. We
will release our code soon