5 research outputs found
Compositional Zero-shot Learning via Progressive Language-based Observations
Compositional zero-shot learning aims to recognize unseen state-object
compositions by leveraging known primitives (state and object) during training.
However, effectively modeling interactions between primitives and generalizing
knowledge to novel compositions remains a perennial challenge. There are two
key factors: object-conditioned and state-conditioned variance, i.e., the
appearance of states (or objects) can vary significantly when combined with
different objects (or states). For instance, the state "old" can signify a
vintage design for a "car" or an advanced age for a "cat". In this paper, we
argue that these variances can be mitigated by predicting composition
categories based on pre-observed primitive. To this end, we propose Progressive
Language-based Observations (PLO), which can dynamically determine a better
observation order of primitives. These observations comprise a series of
concepts or languages that allow the model to understand image content in a
step-by-step manner. Specifically, PLO adopts pre-trained vision-language
models (VLMs) to empower the model with observation capabilities. We further
devise two variants: 1) PLO-VLM: a two-step method, where a pre-observing
classifier dynamically determines the observation order of two primitives. 2)
PLO-LLM: a multi-step scheme, which utilizes large language models (LLMs) to
craft composition-specific prompts for step-by-step observing. Extensive
ablations on three challenging datasets demonstrate the superiority of PLO
compared with state-of-the-art methods, affirming its abilities in
compositional recognition
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
Recent LLM-driven visual agents mainly focus on solving image-based tasks,
which limits their ability to understand dynamic scenes, making it far from
real-life applications like guiding students in laboratory experiments and
identifying their mistakes. Hence, this paper explores DoraemonGPT, a
comprehensive and conceptually elegant system driven by LLMs to understand
dynamic scenes. Considering the video modality better reflects the
ever-changing nature of real-world scenarios, we exemplify DoraemonGPT as a
video agent. Given a video with a question/task, DoraemonGPT begins by
converting the input video into a symbolic memory that stores task-related
attributes. This structured representation allows for spatial-temporal querying
and reasoning by well-designed sub-task tools, resulting in concise
intermediate results. Recognizing that LLMs have limited internal knowledge
when it comes to specialized domains (e.g., analyzing the scientific principles
underlying experiments), we incorporate plug-and-play tools to assess external
knowledge and address tasks across different domains. Moreover, a novel
LLM-driven planner based on Monte Carlo Tree Search is introduced to explore
the large planning space for scheduling various tools. The planner iteratively
finds feasible solutions by backpropagating the result's reward, and multiple
solutions can be summarized into an improved final answer. We extensively
evaluate DoraemonGPT's effectiveness on three benchmarks and several
in-the-wild scenarios. The code will be released at
https://github.com/z-x-yang/DoraemonGPT
Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models
Pretrained vision-language models, such as CLIP, have demonstrated strong
generalization capabilities, making them promising tools in the realm of
zero-shot visual recognition. Visual relation detection (VRD) is a typical task
that identifies relationship (or interaction) types between object pairs within
an image. However, naively utilizing CLIP with prevalent class-based prompts
for zero-shot VRD has several weaknesses, e.g., it struggles to distinguish
between different fine-grained relation types and it neglects essential spatial
information of two objects. To this end, we propose a novel method for
zero-shot VRD: RECODE, which solves RElation detection via COmposite
DEscription prompts. Specifically, RECODE first decomposes each predicate
category into subject, object, and spatial components. Then, it leverages large
language models (LLMs) to generate description-based prompts (or visual cues)
for each component. Different visual cues enhance the discriminability of
similar relation categories from different perspectives, which significantly
boosts performance in VRD. To dynamically fuse different cues, we further
introduce a chain-of-thought method that prompts LLMs to generate reasonable
weights for different visual cues. Extensive experiments on four VRD benchmarks
have demonstrated the effectiveness and interpretability of RECODE
Compositional Feature Augmentation for Unbiased Scene Graph Generation
Scene Graph Generation (SGG) aims to detect all the visual relation triplets
in a given image. With the emergence of various advanced
techniques for better utilizing both the intrinsic and extrinsic information in
each relation triplet, SGG has achieved great progress over the recent years.
However, due to the ubiquitous long-tailed predicate distributions, today's SGG
models are still easily biased to the head predicates. Currently, the most
prevalent debiasing solutions for SGG are re-balancing methods, e.g., changing
the distributions of original training samples. In this paper, we argue that
all existing re-balancing strategies fail to increase the diversity of the
relation triplet features of each predicate, which is critical for robust SGG.
To this end, we propose a novel Compositional Feature Augmentation (CFA)
strategy, which is the first unbiased SGG work to mitigate the bias issue from
the perspective of increasing the diversity of triplet features. Specifically,
we first decompose each relation triplet feature into two components: intrinsic
feature and extrinsic feature, which correspond to the intrinsic
characteristics and extrinsic contexts of a relation triplet, respectively.
Then, we design two different feature augmentation modules to enrich the
feature diversity of original relation triplets by replacing or mixing up
either their intrinsic or extrinsic features from other samples. Due to its
model-agnostic nature, CFA can be seamlessly incorporated into various SGG
frameworks. Extensive ablations have shown that CFA achieves a new
state-of-the-art performance on the trade-off between different metrics.Comment: Accepted by ICCV 202
Neural Clustering based Visual Representation Learning
We investigate a fundamental aspect of machine vision: the measurement of
features, by revisiting clustering, one of the most classic approaches in
machine learning and data analysis. Existing visual feature extractors,
including ConvNets, ViTs, and MLPs, represent an image as rectangular regions.
Though prevalent, such a grid-style paradigm is built upon engineering practice
and lacks explicit modeling of data distribution. In this work, we propose
feature extraction with clustering (FEC), a conceptually elegant yet
surprisingly ad-hoc interpretable neural clustering framework, which views
feature extraction as a process of selecting representatives from data and thus
automatically captures the underlying data distribution. Given an image, FEC
alternates between grouping pixels into individual clusters to abstract
representatives and updating the deep features of pixels with current
representatives. Such an iterative working mechanism is implemented in the form
of several neural layers and the final representatives can be used for
downstream tasks. The cluster assignments across layers, which can be viewed
and inspected by humans, make the forward process of FEC fully transparent and
empower it with promising ad-hoc interpretability. Extensive experiments on
various visual recognition models and tasks verify the effectiveness,
generality, and interpretability of FEC. We expect this work will provoke a
rethink of the current de facto grid-style paradigm.Comment: CVPR 2024. Code: https://github.com/guikunchen/FEC