93 research outputs found
Being Negative but Constructively: Lessons Learnt from Creating Better Visual Question Answering Datasets
Visual question answering (Visual QA) has attracted a lot of attention
lately, seen essentially as a form of (visual) Turing test that artificial
intelligence should strive to achieve. In this paper, we study a crucial
component of this task: how can we design good datasets for the task? We focus
on the design of multiple-choice based datasets where the learner has to select
the right answer from a set of candidate ones including the target (\ie the
correct one) and the decoys (\ie the incorrect ones). Through careful analysis
of the results attained by state-of-the-art learning models and human
annotators on existing datasets, we show that the design of the decoy answers
has a significant impact on how and what the learning models learn from the
datasets. In particular, the resulting learner can ignore the visual
information, the question, or both while still doing well on the task. Inspired
by this, we propose automatic procedures to remedy such design deficiencies. We
apply the procedures to re-construct decoy answers for two popular Visual QA
datasets as well as to create a new Visual QA dataset from the Visual Genome
project, resulting in the largest dataset for this task. Extensive empirical
studies show that the design deficiencies have been alleviated in the remedied
datasets and the performance on them is likely a more faithful indicator of the
difference among learning models. The datasets are released and publicly
available via http://www.teds.usc.edu/website_vqa/.Comment: Accepted for Oral Presentation at NAACL-HLT 201
Evaluating Text-to-Image Matching using Binary Image Selection (BISON)
Providing systems the ability to relate linguistic and visual content is one
of the hallmarks of computer vision. Tasks such as text-based image retrieval
and image captioning were designed to test this ability but come with
evaluation measures that have a high variance or are difficult to interpret. We
study an alternative task for systems that match text and images: given a text
query, the system is asked to select the image that best matches the query from
a pair of semantically similar images. The system's accuracy on this Binary
Image SelectiON (BISON) task is interpretable, eliminates the reliability
problems of retrieval evaluations, and focuses on the system's ability to
understand fine-grained visual structure. We gather a BISON dataset that
complements the COCO dataset and use it to evaluate modern text-based image
retrieval and image captioning systems. Our results provide novel insights into
the performance of these systems. The COCO-BISON dataset and corresponding
evaluation code are publicly available from \url{http://hexianghu.com/bison/}
Learning Structured Inference Neural Networks with Label Relations
Images of scenes have various objects as well as abundant attributes, and
diverse levels of visual categorization are possible. A natural image could be
assigned with fine-grained labels that describe major components,
coarse-grained labels that depict high level abstraction or a set of labels
that reveal attributes. Such categorization at different concept layers can be
modeled with label graphs encoding label information. In this paper, we exploit
this rich information with a state-of-art deep learning framework, and propose
a generic structured model that leverages diverse label relations to improve
image classification performance. Our approach employs a novel stacked label
prediction neural network, capturing both inter-level and intra-level label
semantics. We evaluate our method on benchmark image datasets, and empirical
results illustrate the efficacy of our model.Comment: Conference on Computer Vision and Pattern Recognition(CVPR) 201
Compressed Video Action Recognition
Training robust deep video representations has proven to be much more
challenging than learning deep image representations. This is in part due to
the enormous size of raw video streams and the high temporal redundancy; the
true and interesting signal is often drowned in too much irrelevant data.
Motivated by that the superfluous information can be reduced by up to two
orders of magnitude by video compression (using H.264, HEVC, etc.), we propose
to train a deep network directly on the compressed video.
This representation has a higher information density, and we found the
training to be easier. In addition, the signals in a compressed video provide
free, albeit noisy, motion information. We propose novel techniques to use them
effectively. Our approach is about 4.6 times faster than Res3D and 2.7 times
faster than ResNet-152. On the task of action recognition, our approach
outperforms all the other methods on the UCF-101, HMDB-51, and Charades
dataset.Comment: CVPR 2018 (Selected for spotlight presentation
- …