3 research outputs found
ProCC: Progressive Cross-primitive Consistency for Open-World Compositional Zero-Shot Learning
Open-World Compositional Zero-shot Learning (OW-CZSL) aims to recognize novel
compositions of state and object primitives in images with no priors on the
compositional space, which induces a tremendously large output space containing
all possible state-object compositions. Existing works either learn the joint
compositional state-object embedding or predict simple primitives with separate
classifiers. However, the former heavily relies on external word embedding
methods, and the latter ignores the interactions of interdependent primitives,
respectively. In this paper, we revisit the primitive prediction approach and
propose a novel method, termed Progressive Cross-primitive Consistency (ProCC),
to mimic the human learning process for OW-CZSL tasks. Specifically, the
cross-primitive consistency module explicitly learns to model the interactions
of state and object features with the trainable memory units, which efficiently
acquires cross-primitive visual attention and avoids cross-primitive
feasibility scores. Moreover, considering the partial-supervision setting
(pCZSL) as well as the imbalance issue of multiple tasks prediction, we design
a progressive training paradigm to enable the primitive classifiers to interact
to obtain discriminative information in an easy-to-hard manner. Extensive
experiments on three widely used benchmark datasets demonstrate that our method
outperforms other representative methods on both OW-CZSL and pCZSL settings by
DRPT: Disentangled and Recurrent Prompt Tuning for Compositional Zero-Shot Learning
Compositional Zero-shot Learning (CZSL) aims to recognize novel concepts
composed of known knowledge without training samples. Standard CZSL either
identifies visual primitives or enhances unseen composed entities, and as a
result, entanglement between state and object primitives cannot be fully
utilized. Admittedly, vision-language models (VLMs) could naturally cope with
CZSL through tuning prompts, while uneven entanglement leads prompts to be
dragged into local optimum. In this paper, we take a further step to introduce
a novel Disentangled and Recurrent Prompt Tuning framework termed DRPT to
better tap the potential of VLMs in CZSL. Specifically, the state and object
primitives are deemed as learnable tokens of vocabulary embedded in prompts and
tuned on seen compositions. Instead of jointly tuning state and object, we
devise a disentangled and recurrent tuning strategy to suppress the traction
force caused by entanglement and gradually optimize the token parameters,
leading to a better prompt space. Notably, we develop a progressive fine-tuning
procedure that allows for incremental updates to the prompts, optimizing the
object first, then the state, and vice versa. Meanwhile, the optimization of
state and object is independent, thus clearer features can be learned to
further alleviate the issue of entangling misleading optimization. Moreover, we
quantify and analyze the entanglement in CZSL and supplement entanglement
rebalancing optimization schemes. DRPT surpasses representative
state-of-the-art methods on extensive benchmark datasets, demonstrating
superiority in both accuracy and efficiency
Graph Knows Unknowns: Reformulate Zero-Shot Learning as Sample-Level Graph Recognition
Zero-shot learning (ZSL) is an extreme case of transfer learning that aims to recognize samples (e.g., images) of unseen classes relying on a train-set covering only seen classes and a set of auxiliary knowledge (e.g., semantic descriptors). Existing methods usually resort to constructing a visual-to-semantics mapping based on features extracted from each whole sample. However, since the visual and semantic spaces are inherently independent and may exist in different manifolds, these methods may easily suffer from the domain bias problem due to the knowledge transfer from seen to unseen classes. Unlike existing works, this paper investigates the fine-grained ZSL from a novel perspective of sample-level graph. Specifically, we decompose an input into several fine-grained elements and construct a graph structure per sample to measure and utilize element-granularity relations within each sample. Taking advantage of recently developed graph neural networks (GNNs), we formulate the ZSL problem to a graph-to-semantics mapping task, which can better exploit element-semantics correlation and local sub-structural information in samples. Experimental results on the widely used benchmark datasets demonstrate that the proposed method can mitigate the domain bias problem and achieve competitive performance against other representative methods