1,581 research outputs found
CUDA-GR: Controllable Unsupervised Domain Adaptation for Gaze Redirection
The aim of gaze redirection is to manipulate the gaze in an image to the
desired direction. However, existing methods are inadequate in generating
perceptually reasonable images. Advancement in generative adversarial networks
has shown excellent results in generating photo-realistic images. Though, they
still lack the ability to provide finer control over different image
attributes. To enable such fine-tuned control, one needs to obtain ground truth
annotations for the training data which can be very expensive. In this paper,
we propose an unsupervised domain adaptation framework, called CUDA-GR, that
learns to disentangle gaze representations from the labeled source domain and
transfers them to an unlabeled target domain. Our method enables fine-grained
control over gaze directions while preserving the appearance information of the
person. We show that the generated image-labels pairs in the target domain are
effective in knowledge transfer and can boost the performance of the downstream
tasks. Extensive experiments on the benchmarking datasets show that the
proposed method can outperform state-of-the-art techniques in both quantitative
and qualitative evaluation
Evaluating Multi-Agent Coordination Abilities in Large Language Models
A pivotal aim in contemporary AI research is to develop agents proficient in
multi-agent coordination, enabling effective collaboration with both humans and
other systems. Large Language Models (LLMs), with their notable ability to
understand, generate, and interpret language in a human-like manner, stand out
as promising candidates for the development of such agents. In this study, we
build and assess the effectiveness of agents crafted using LLMs in various
coordination scenarios. We introduce the LLM-Coordination (LLM-Co) Framework,
specifically designed to enable LLMs to play coordination games. With the
LLM-Co framework, we conduct our evaluation with three game environments and
organize the evaluation into five aspects: Theory of Mind, Situated Reasoning,
Sustained Coordination, Robustness to Partners, and Explicit Assistance. First,
the evaluation of the Theory of Mind and Situated Reasoning reveals the
capabilities of LLM to infer the partner's intention and reason actions
accordingly. Then, the evaluation around Sustained Coordination and Robustness
to Partners further showcases the ability of LLMs to coordinate with an unknown
partner in complex long-horizon tasks, outperforming Reinforcement Learning
baselines. Lastly, to test Explicit Assistance, which refers to the ability of
an agent to offer help proactively, we introduce two novel layouts into the
Overcooked-AI benchmark, examining if agents can prioritize helping their
partners, sacrificing time that could have been spent on their tasks. This
research underscores the promising capabilities of LLMs in sophisticated
coordination environments and reveals the potential of LLMs in building strong
real-world agents for multi-agent coordination
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
Large Language Models (LLMs) have garnered significant attention for their
advancements in natural language processing, demonstrating unparalleled prowess
in text comprehension and generation. Yet, the simultaneous generation of
images with coherent textual narratives remains an evolving frontier. In
response, we introduce an innovative interleaved vision-and-language generation
technique anchored by the concept of "generative vokens," acting as the bridge
for harmonized image-text outputs. Our approach is characterized by a
distinctive two-staged training strategy focusing on description-free
multimodal generation, where the training requires no comprehensive
descriptions of images. To bolster model integrity, classifier-free guidance is
incorporated, enhancing the effectiveness of vokens on image generation. Our
model, MiniGPT-5, exhibits substantial improvement over the baseline Divter
model on the MMDialog dataset and consistently delivers superior or comparable
multimodal outputs in human evaluations on the VIST dataset, highlighting its
efficacy across diverse benchmarks.Comment: 20 pages, 9 figure
ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
In our work, we explore the synergistic capabilities of pre-trained
vision-and-language models (VLMs) and large language models (LLMs) for visual
commonsense reasoning (VCR). We categorize the problem of VCR into visual
commonsense understanding (VCU) and visual commonsense inference (VCI). For
VCU, which involves perceiving the literal visual content, pre-trained VLMs
exhibit strong cross-dataset generalization. On the other hand, in VCI, where
the goal is to infer conclusions beyond image content, VLMs face difficulties.
We find that a baseline where VLMs provide perception results (image captions)
to LLMs leads to improved performance on VCI. However, we identify a challenge
with VLMs' passive perception, which often misses crucial context information,
leading to incorrect or uncertain reasoning by LLMs. To mitigate this issue, we
suggest a collaborative approach where LLMs, when uncertain about their
reasoning, actively direct VLMs to concentrate on and gather relevant visual
elements to support potential commonsense inferences. In our method, named
ViCor, pre-trained LLMs serve as problem classifiers to analyze the problem
category, VLM commanders to leverage VLMs differently based on the problem
classification, and visual commonsense reasoners to answer the question. VLMs
will perform visual recognition and understanding. We evaluate our framework on
two VCR benchmark datasets and outperform all other methods that do not require
in-domain supervised fine-tuning
ComCLIP: Training-Free Compositional Image and Text Matching
Contrastive Language-Image Pretraining (CLIP) has demonstrated great
zero-shot performance for image-text matching because of its holistic use of
natural language supervision that covers large-scale, open-world visual
concepts. However, it is still challenging to adapt CLIP to compositional image
and text matching -- a more challenging image and matching mask requiring the
model understanding of compositional word concepts and visual components.
Towards better compositional generalization in zero-shot image and text
matching, in this paper, we study the problem from a causal perspective: the
erroneous semantics of individual entities are essentially confounders that
cause the matching failure. Therefore, we propose a novel training-free
compositional CLIP model (ComCLIP). ComCLIP disentangles input images into
subjects, objects, and action sub-images and composes CLIP's vision encoder and
text encoder to perform evolving matching over compositional text embedding and
sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations
introduced by the pretrained CLIP models and dynamically assess the
contribution of each entity when performing image and text matching.
Experiments on compositional image-text matching on SVO and ComVG and general
image-text retrieval on Flickr8K demonstrate the effectiveness of our
plug-and-play method, which boosts the zero-shot inference ability of CLIP even
without further training or fine-tuning of CLIP
- …