3,514 research outputs found
How a General-Purpose Commonsense Ontology can Improve Performance of Learning-Based Image Retrieval
The knowledge representation community has built general-purpose ontologies
which contain large amounts of commonsense knowledge over relevant aspects of
the world, including useful visual information, e.g.: "a ball is used by a
football player", "a tennis player is located at a tennis court". Current
state-of-the-art approaches for visual recognition do not exploit these
rule-based knowledge sources. Instead, they learn recognition models directly
from training examples. In this paper, we study how general-purpose
ontologies---specifically, MIT's ConceptNet ontology---can improve the
performance of state-of-the-art vision systems. As a testbed, we tackle the
problem of sentence-based image retrieval. Our retrieval approach incorporates
knowledge from ConceptNet on top of a large pool of object detectors derived
from a deep learning technique. In our experiments, we show that ConceptNet can
improve performance on a common benchmark dataset. Key to our performance is
the use of the ESPGAME dataset to select visually relevant relations from
ConceptNet. Consequently, a main conclusion of this work is that
general-purpose commonsense ontologies improve performance on visual reasoning
tasks when properly filtered to select meaningful visual relations.Comment: Accepted in IJCAI-1
FVQA: Fact-based Visual Question Answering
Visual Question Answering (VQA) has attracted a lot of attention in both
Computer Vision and Natural Language Processing communities, not least because
it offers insight into the relationships between two important sources of
information. Current datasets, and the models built upon them, have focused on
questions which are answerable by direct analysis of the question and image
alone. The set of such questions that require no external information to answer
is interesting, but very limited. It excludes questions which require common
sense, or basic factual knowledge to answer, for example. Here we introduce
FVQA, a VQA dataset which requires, and supports, much deeper reasoning. FVQA
only contains questions which require external information to answer.
We thus extend a conventional visual question answering dataset, which
contains image-question-answerg triplets, through additional
image-question-answer-supporting fact tuples. The supporting fact is
represented as a structural triplet, such as .
We evaluate several baseline models on the FVQA dataset, and describe a novel
model which is capable of reasoning about an image on the basis of supporting
facts.Comment: 16 page
What value do explicit high level concepts have in vision to language problems?
Much of the recent progress in Vision-to-Language (V2L) problems has been
achieved through a combination of Convolutional Neural Networks (CNNs) and
Recurrent Neural Networks (RNNs). This approach does not explicitly represent
high-level semantic concepts, but rather seeks to progress directly from image
features to text. We propose here a method of incorporating high-level concepts
into the very successful CNN-RNN approach, and show that it achieves a
significant improvement on the state-of-the-art performance in both image
captioning and visual question answering. We also show that the same mechanism
can be used to introduce external semantic information and that doing so
further improves performance. In doing so we provide an analysis of the value
of high level semantic information in V2L problems.Comment: Accepted to IEEE Conf. Computer Vision and Pattern Recognition 2016.
Fixed titl
- …