775 research outputs found
Unified Visual Relationship Detection with Vision and Language Models
This work focuses on training a single visual relationship detector
predicting over the union of label spaces from multiple datasets. Merging
labels spanning different datasets could be challenging due to inconsistent
taxonomies. The issue is exacerbated in visual relationship detection when
second-order visual semantics are introduced between pairs of objects. To
address this challenge, we propose UniVRD, a novel bottom-up method for Unified
Visual Relationship Detection by leveraging vision and language models (VLMs).
VLMs provide well-aligned image and text embeddings, where similar
relationships are optimized to be close to each other for semantic unification.
Our bottom-up design enables the model to enjoy the benefit of training with
both object detection and visual relationship datasets. Empirical results on
both human-object interaction detection and scene-graph generation demonstrate
the competitive performance of our model. UniVRD achieves 38.07 mAP on
HICO-DET, outperforming the current best bottom-up HOI detector by 14.26 mAP.
More importantly, we show that our unified detector performs as well as
dataset-specific models in mAP, and achieves further improvements when we scale
up the model. Our code will be made publicly available on GitHub.Comment: Accepted to ICCV 2023. Code is available at
https://github.com/google-research/scenic/tree/main/scenic/projects/univr
Energy-based Self-attentive Learning of Abstractive Communities for Spoken Language Understanding
Abstractive community detection is an important spoken language understanding
task, whose goal is to group utterances in a conversation according to whether
they can be jointly summarized by a common abstractive sentence. This paper
provides a novel approach to this task. We first introduce a neural contextual
utterance encoder featuring three types of self-attention mechanisms. We then
train it using the siamese and triplet energy-based meta-architectures.
Experiments on the AMI corpus show that our system outperforms multiple
energy-based and non-energy based baselines from the state-of-the-art. Code and
data are publicly available.Comment: Update baseline
Vision-by-Language for Training-Free Compositional Image Retrieval
Given an image and a target modification (e.g an image of the Eiffel tower
and the text "without people and at night-time"), Compositional Image Retrieval
(CIR) aims to retrieve the relevant target image in a database. While
supervised approaches rely on annotating triplets that is costly (i.e. query
image, textual modification, and target image), recent research sidesteps this
need by using large-scale vision-language models (VLMs), performing Zero-Shot
CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require
training task-specific, customized models over large amounts of image-text
pairs. In this work, we propose to tackle CIR in a training-free manner via our
Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple,
yet human-understandable and scalable pipeline that effectively recombines
large-scale VLMs with large language models (LLMs). By captioning the reference
image using a pre-trained generative VLM and asking a LLM to recompose the
caption based on the textual target modification for subsequent retrieval via
e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we
find competitive, in-part state-of-the-art performance - improving over
supervised methods. Moreover, the modularity of CIReVL offers simple
scalability without re-training, allowing us to both investigate scaling laws
and bottlenecks for ZS-CIR while easily scaling up to in parts more than double
of previously reported results. Finally, we show that CIReVL makes CIR
human-understandable by composing image and text in a modular fashion in the
language domain, thereby making it intervenable, allowing to post-hoc re-align
failure cases. Code will be released upon acceptance
- …