Search CORE

20 research outputs found

Doubly Right Object Recognition: A Why Prompt for Visual Rationales

Author: Mao Chengzhi
Menon Sachit
Sundar Amrutha
Teotia Revant
Vondrick Carl
Wang Xin
Yang Junfeng
Publication venue
Publication date: 12/12/2022
Field of study

Many visual recognition models are evaluated only on their classification accuracy, a metric for which they obtain strong performance. In this paper, we investigate whether computer vision models can also provide correct rationales for their predictions. We propose a ``doubly right'' object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales. We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions. However, by transferring the rationales from language models into visual representations through a tailored dataset, we show that we can learn a ``why prompt,'' which adapts large visual representations to produce correct rationales. Visualizations and empirical experiments show that our prompts significantly improve performance on doubly right object recognition, in addition to zero-shot transfer to unseen tasks and datasets

arXiv.org e-Print Archive

RELATE: Physically Plausible Multi-Object Scene Synthesis Using Structured Latent Spaces

Author: Ehrhardt Sebastien
Engelcke Martin
Groth Oliver
Mitra Niloy
Monszpart Aron
Posner Ingmar
Vedaldi Andrea
Publication venue
Publication date: 01/01/2020
Field of study

We present RELATE, a model that learns to generate physically plausible scenes and videos of multiple interacting objects. Similar to other generative approaches, RELATE is trained end-to-end on raw, unlabeled data. RELATE combines an object-centric GAN formulation with a model that explicitly accounts for correlations between individual objects. This allows the model to generate realistic scenes and videos from a physically-interpretable parameterization. Furthermore, we show that modeling the object correlation is necessary to learn to disentangle object positions and identity. We find that RELATE is also amenable to physically realistic scene editing and that it significantly outperforms prior art in object-centric scene generation in both synthetic (CLEVR, ShapeStacks) and real-world data (cars). In addition, in contrast to state-of-the-art methods in object-centric generative modeling, RELATE also extends naturally to dynamic scenes and generates videos of high visual fidelity. Source code, datasets and more results are available at http://geometry.cs.ucl.ac.uk/projects/2020/relate/

arXiv.org e-Print Archive

Oxford University Research Archive