36 research outputs found
Learning Human-Human Interactions in Images from Weak Textual Supervision
Interactions between humans are diverse and context-dependent, but previous
works have treated them as categorical, disregarding the heavy tail of possible
interactions. We propose a new paradigm of learning human-human interactions as
free text from a single still image, allowing for flexibility in modeling the
unlimited space of situations and relationships between people. To overcome the
absence of data labelled specifically for this task, we use knowledge
distillation applied to synthetic caption data produced by a large language
model without explicit supervision. We show that the pseudo-labels produced by
this procedure can be used to train a captioning model to effectively
understand human-human interactions in images, as measured by a variety of
metrics that measure textual and semantic faithfulness and factual groundedness
of our predictions. We further show that our approach outperforms SOTA image
captioning and situation recognition models on this task. We will release our
code and pseudo-labels along with Waldo and Wenda, a manually-curated test set
for still image human-human interaction understanding.Comment: To be presented at ICCV 2023. Project webpage:
https://learning-interactions.github.i
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Key to tasks that require reasoning about natural language in visual contexts
is grounding words and phrases to image regions. However, observing this
grounding in contemporary models is complex, even if it is generally expected
to take place if the task is addressed in a way that is conductive to
generalization. We propose a framework to jointly study task performance and
phrase grounding, and propose three benchmarks to study the relation between
the two. Our results show that contemporary models demonstrate inconsistency
between their ability to ground phrases and solve tasks. We show how this can
be addressed through brute-force training on ground phrasing annotations, and
analyze the dynamics it creates. Code and at available at
https://github.com/lil-lab/phrase_grounding
Vox-E: Text-guided Voxel Editing of 3D Objects
Large scale text-guided diffusion models have garnered significant attention
due to their ability to synthesize diverse images that convey complex visual
concepts. This generative power has more recently been leveraged to perform
text-to-3D synthesis. In this work, we present a technique that harnesses the
power of latent diffusion models for editing existing 3D objects. Our method
takes oriented 2D images of a 3D object as input and learns a grid-based
volumetric representation of it. To guide the volumetric representation to
conform to a target text prompt, we follow unconditional text-to-3D methods and
optimize a Score Distillation Sampling (SDS) loss. However, we observe that
combining this diffusion-guided loss with an image-based regularization loss
that encourages the representation not to deviate too strongly from the input
object is challenging, as it requires achieving two conflicting goals while
viewing only structure-and-appearance coupled 2D projections. Thus, we
introduce a novel volumetric regularization loss that operates directly in 3D
space, utilizing the explicit nature of our 3D representation to enforce
correlation between the global structure of the original and edited object.
Furthermore, we present a technique that optimizes cross-attention volumetric
grids to refine the spatial extent of the edits. Extensive experiments and
comparisons demonstrate the effectiveness of our approach in creating a myriad
of edits which cannot be achieved by prior works.Comment: Project webpage: https://tau-vailab.github.io/Vox-E