Compositional Zero-shot Learning (CZSL) aims to recognize novel concepts
composed of known knowledge without training samples. Standard CZSL either
identifies visual primitives or enhances unseen composed entities, and as a
result, entanglement between state and object primitives cannot be fully
utilized. Admittedly, vision-language models (VLMs) could naturally cope with
CZSL through tuning prompts, while uneven entanglement leads prompts to be
dragged into local optimum. In this paper, we take a further step to introduce
a novel Disentangled and Recurrent Prompt Tuning framework termed DRPT to
better tap the potential of VLMs in CZSL. Specifically, the state and object
primitives are deemed as learnable tokens of vocabulary embedded in prompts and
tuned on seen compositions. Instead of jointly tuning state and object, we
devise a disentangled and recurrent tuning strategy to suppress the traction
force caused by entanglement and gradually optimize the token parameters,
leading to a better prompt space. Notably, we develop a progressive fine-tuning
procedure that allows for incremental updates to the prompts, optimizing the
object first, then the state, and vice versa. Meanwhile, the optimization of
state and object is independent, thus clearer features can be learned to
further alleviate the issue of entangling misleading optimization. Moreover, we
quantify and analyze the entanglement in CZSL and supplement entanglement
rebalancing optimization schemes. DRPT surpasses representative
state-of-the-art methods on extensive benchmark datasets, demonstrating
superiority in both accuracy and efficiency