52,293 research outputs found
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
Being Negative but Constructively: Lessons Learnt from Creating Better Visual Question Answering Datasets
Visual question answering (Visual QA) has attracted a lot of attention
lately, seen essentially as a form of (visual) Turing test that artificial
intelligence should strive to achieve. In this paper, we study a crucial
component of this task: how can we design good datasets for the task? We focus
on the design of multiple-choice based datasets where the learner has to select
the right answer from a set of candidate ones including the target (\ie the
correct one) and the decoys (\ie the incorrect ones). Through careful analysis
of the results attained by state-of-the-art learning models and human
annotators on existing datasets, we show that the design of the decoy answers
has a significant impact on how and what the learning models learn from the
datasets. In particular, the resulting learner can ignore the visual
information, the question, or both while still doing well on the task. Inspired
by this, we propose automatic procedures to remedy such design deficiencies. We
apply the procedures to re-construct decoy answers for two popular Visual QA
datasets as well as to create a new Visual QA dataset from the Visual Genome
project, resulting in the largest dataset for this task. Extensive empirical
studies show that the design deficiencies have been alleviated in the remedied
datasets and the performance on them is likely a more faithful indicator of the
difference among learning models. The datasets are released and publicly
available via http://www.teds.usc.edu/website_vqa/.Comment: Accepted for Oral Presentation at NAACL-HLT 201
Learning to segment with image-level supervision
Deep convolutional networks have achieved the state-of-the-art for semantic
image segmentation tasks. However, training these networks requires access to
densely labeled images, which are known to be very expensive to obtain. On the
other hand, the web provides an almost unlimited source of images annotated at
the image level. How can one utilize this much larger weakly annotated set for
tasks that require dense labeling? Prior work often relied on localization
cues, such as saliency maps, objectness priors, bounding boxes etc., to address
this challenging problem. In this paper, we propose a model that generates
auxiliary labels for each image, while simultaneously forcing the output of the
CNN to satisfy the mean-field constraints imposed by a conditional random
field. We show that one can enforce the CRF constraints by forcing the
distribution at each pixel to be close to the distribution of its neighbors.
This is in stark contrast with methods that compute a recursive expansion of
the mean-field distribution using a recurrent architecture and train the
resultant distribution. Instead, the proposed model adds an extra loss term to
the output of the CNN, and hence, is faster than recursive implementations. We
achieve the state-of-the-art for weakly supervised semantic image segmentation
on VOC 2012 dataset, assuming no manually labeled pixel level information is
available. Furthermore, the incorporation of conditional random fields in CNN
incurs little extra time during training.Comment: Published in WACV 201
Evaluation campaigns and TRECVid
The TREC Video Retrieval Evaluation (TRECVid) is an
international benchmarking activity to encourage research
in video information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. TRECVid completed its fifth annual cycle at the end of 2005 and in 2006 TRECVid will involve almost 70 research organizations, universities and other consortia. Throughout its existence, TRECVid has benchmarked both interactive and automatic/manual searching for shots from within a video
corpus, automatic detection of a variety of semantic and
low-level video features, shot boundary detection and the
detection of story boundaries in broadcast TV news. This
paper will give an introduction to information retrieval (IR) evaluation from both a user and a system perspective, highlighting that system evaluation is by far the most prevalent type of evaluation carried out. We also include a summary of TRECVid as an example of a system evaluation benchmarking campaign and this allows us to discuss whether
such campaigns are a good thing or a bad thing. There are
arguments for and against these campaigns and we present
some of them in the paper concluding that on balance they
have had a very positive impact on research progress
- …