23,173 research outputs found
Revisiting Visual Question Answering Baselines
Visual question answering (VQA) is an interesting learning setting for
evaluating the abilities and shortcomings of current systems for image
understanding. Many of the recently proposed VQA systems include attention or
memory mechanisms designed to support "reasoning". For multiple-choice VQA,
nearly all of these systems train a multi-class classifier on image and
question features to predict an answer. This paper questions the value of these
common practices and develops a simple alternative model based on binary
classification. Instead of treating answers as competing choices, our model
receives the answer as input and predicts whether or not an
image-question-answer triplet is correct. We evaluate our model on the Visual7W
Telling and the VQA Real Multiple Choice tasks, and find that even simple
versions of our model perform competitively. Our best model achieves
state-of-the-art performance on the Visual7W Telling task and compares
surprisingly well with the most complex systems proposed for the VQA Real
Multiple Choice task. We explore variants of the model and study its
transferability between both datasets. We also present an error analysis of our
model that suggests a key problem of current VQA systems lies in the lack of
visual grounding of concepts that occur in the questions and answers. Overall,
our results suggest that the performance of current VQA systems is not
significantly better than that of systems designed to exploit dataset biases.Comment: European Conference on Computer Visio
MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features
In this work, we tackle the problem of instance segmentation, the task of
simultaneously solving object detection and semantic segmentation. Towards this
goal, we present a model, called MaskLab, which produces three outputs: box
detection, semantic segmentation, and direction prediction. Building on top of
the Faster-RCNN object detector, the predicted boxes provide accurate
localization of object instances. Within each region of interest, MaskLab
performs foreground/background segmentation by combining semantic and direction
prediction. Semantic segmentation assists the model in distinguishing between
objects of different semantic classes including background, while the direction
prediction, estimating each pixel's direction towards its corresponding center,
allows separating instances of the same semantic class. Moreover, we explore
the effect of incorporating recent successful methods from both segmentation
and detection (i.e. atrous convolution and hypercolumn). Our proposed model is
evaluated on the COCO instance segmentation benchmark and shows comparable
performance with other state-of-art models.Comment: 10 pages including referenc
- …