5,799 research outputs found
ViP-CNN: Visual Phrase Guided Convolutional Neural Network
As the intermediate level task connecting image captioning and object
detection, visual relationship detection started to catch researchers'
attention because of its descriptive power and clear structure. It detects the
objects and captures their pair-wise interactions with a
subject-predicate-object triplet, e.g. person-ride-horse. In this paper, each
visual relationship is considered as a phrase with three components. We
formulate the visual relationship detection as three inter-connected
recognition problems and propose a Visual Phrase guided Convolutional Neural
Network (ViP-CNN) to address them simultaneously. In ViP-CNN, we present a
Phrase-guided Message Passing Structure (PMPS) to establish the connection
among relationship components and help the model consider the three problems
jointly. Corresponding non-maximum suppression method and model training
strategy are also proposed. Experimental results show that our ViP-CNN
outperforms the state-of-art method both in speed and accuracy. We further
pretrain ViP-CNN on our cleansed Visual Genome Relationship dataset, which is
found to perform better than the pretraining on the ImageNet for this task.Comment: 10 pages, 5 figures, accepted by CVPR 201
Modularized Zero-shot VQA with Pre-trained Models
Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In
this paper, we study how to leverage them for zero-shot visual question
answering (VQA). Our approach is motivated by a few observations. First, VQA
questions often require multiple steps of reasoning, which is still a
capability that most PTMs lack. Second, different steps in VQA reasoning chains
require different skills such as object detection and relational reasoning, but
a single PTM may not possess all these skills. Third, recent work on zero-shot
VQA does not explicitly consider multi-step reasoning chains, which makes them
less interpretable compared with a decomposition-based approach. We propose a
modularized zero-shot network that explicitly decomposes questions into sub
reasoning steps and is highly interpretable. We convert sub reasoning tasks to
acceptable objectives of PTMs and assign tasks to proper PTMs without any
adaptation. Our experiments on two VQA benchmarks under the zero-shot setting
demonstrate the effectiveness of our method and better interpretability
compared with several baselines.Comment: accepted as Findings in ACL 202
- …