13,441 research outputs found
Weakly-supervised learning of visual relations
This paper introduces a novel approach for modeling visual relations between
pairs of objects. We call relation a triplet of the form (subject, predicate,
object) where the predicate is typically a preposition (eg. 'under', 'in front
of') or a verb ('hold', 'ride') that links a pair of objects (subject, object).
Learning such relations is challenging as the objects have different spatial
configurations and appearances depending on the relation in which they occur.
Another major challenge comes from the difficulty to get annotations,
especially at box-level, for all possible triplets, which makes both learning
and evaluation difficult. The contributions of this paper are threefold. First,
we design strong yet flexible visual features that encode the appearance and
spatial configuration for pairs of objects. Second, we propose a
weakly-supervised discriminative clustering model to learn relations from
image-level labels only. Third we introduce a new challenging dataset of
unusual relations (UnRel) together with an exhaustive annotation, that enables
accurate evaluation of visual relation retrieval. We show experimentally that
our model results in state-of-the-art results on the visual relationship
dataset significantly improving performance on previously unseen relations
(zero-shot learning), and confirm this observation on our newly introduced
UnRel dataset
Weakly-supervised learning of visual relations
This paper introduces a novel approach for modeling visual relations between
pairs of objects. We call relation a triplet of the form (subject, predicate,
object) where the predicate is typically a preposition (eg. 'under', 'in front
of') or a verb ('hold', 'ride') that links a pair of objects (subject, object).
Learning such relations is challenging as the objects have different spatial
configurations and appearances depending on the relation in which they occur.
Another major challenge comes from the difficulty to get annotations,
especially at box-level, for all possible triplets, which makes both learning
and evaluation difficult. The contributions of this paper are threefold. First,
we design strong yet flexible visual features that encode the appearance and
spatial configuration for pairs of objects. Second, we propose a
weakly-supervised discriminative clustering model to learn relations from
image-level labels only. Third we introduce a new challenging dataset of
unusual relations (UnRel) together with an exhaustive annotation, that enables
accurate evaluation of visual relation retrieval. We show experimentally that
our model results in state-of-the-art results on the visual relationship
dataset significantly improving performance on previously unseen relations
(zero-shot learning), and confirm this observation on our newly introduced
UnRel dataset
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Recent work in vision-and-language pretraining has investigated supervised
signals from object detection data to learn better, fine-grained multimodal
representations. In this work, we take a step further and explore how we can
tap into supervision from small-scale visual relation data. In particular, we
propose two pretraining approaches to contextualise visual entities in a
multimodal setup. With verbalised scene graphs, we transform visual relation
triplets into structured captions, and treat them as additional image
descriptions. With masked relation prediction, we further encourage relating
entities from image regions with visually masked contexts. When applied to
strong baselines pretrained on large amounts of Web data, zero-shot evaluations
on both coarse-grained and fine-grained tasks show the efficacy of our methods
in learning multimodal representations from weakly-supervised relations data.Comment: EMNLP 202
- …