48 research outputs found
Exploring Nearest Neighbor Approaches for Image Captioning
We explore a variety of nearest neighbor baseline approaches for image
captioning. These approaches find a set of nearest neighbor images in the
training set from which a caption may be borrowed for the query image. We
select a caption for the query image by finding the caption that best
represents the "consensus" of the set of candidate captions gathered from the
nearest neighbor images. When measured by automatic evaluation metrics on the
MS COCO caption evaluation server, these approaches perform as well as many
recent approaches that generate novel captions. However, human studies show
that a method that generates novel captions is still preferred over the nearest
neighbor approach
Blindfold Baselines for Embodied QA
We explore blindfold (question-only) baselines for Embodied Question
Answering. The EmbodiedQA task requires an agent to answer a question by
intelligently navigating in a simulated environment, gathering necessary visual
information only through first-person vision before finally answering.
Consequently, a blindfold baseline which ignores the environment and visual
information is a degenerate solution, yet we show through our experiments on
the EQAv1 dataset that a simple question-only baseline achieves
state-of-the-art results on the EmbodiedQA task in all cases except when the
agent is spawned extremely close to the object.Comment: NIPS 2018 Visually-Grounded Interaction and Language (ViGilL)
Worksho
Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?
We conduct large-scale studies on `human attention' in Visual Question
Answering (VQA) to understand where humans choose to look to answer questions
about images. We design and test multiple game-inspired novel
attention-annotation interfaces that require the subject to sharpen regions of
a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human
ATtention) dataset. We evaluate attention maps generated by state-of-the-art
VQA models against human attention both qualitatively (via visualizations) and
quantitatively (via rank-order correlation). Overall, our experiments show that
current attention models in VQA do not seem to be looking at the same regions
as humans.Comment: 5 pages, 4 figures, 3 tables, presented at 2016 ICML Workshop on
Human Interpretability in Machine Learning (WHI 2016), New York, NY. arXiv
admin note: substantial text overlap with arXiv:1606.0355
Simple Baseline for Visual Question Answering
We describe a very simple bag-of-words baseline for visual question
answering. This baseline concatenates the word features from the question and
CNN features from the image to predict the answer. When evaluated on the
challenging VQA dataset [2], it shows comparable performance to many recent
approaches using recurrent neural networks. To explore the strength and
weakness of the trained model, we also provide an interactive web demo and
open-source code. .Comment: One comparison method's scores are put into the correct column, and a
new experiment of generating attention map is adde
Shifting the Baseline: Single Modality Performance on Visual Navigation & QA
We demonstrate the surprising strength of unimodal baselines in multimodal
domains, and make concrete recommendations for best practices in future
research. Where existing work often compares against random or majority class
baselines, we argue that unimodal approaches better capture and reflect dataset
biases and therefore provide an important comparison when assessing the
performance of multimodal techniques. We present unimodal ablations on three
recent datasets in visual navigation and QA, seeing an up to 29% absolute gain
in performance over published baselines.Comment: Published at The Conference of the North American Chapter of the
Association for Computational Linguistics (NAACL) 201
Visual Referring Expression Recognition: What Do Systems Actually Learn?
We present an empirical analysis of the state-of-the-art systems for
referring expression recognition -- the task of identifying the object in an
image referred to by a natural language expression -- with the goal of gaining
insight into how these systems reason about language and vision. Surprisingly,
we find strong evidence that even sophisticated and linguistically-motivated
models for this task may ignore the linguistic structure, instead relying on
shallow correlations introduced by unintended biases in the data selection and
annotation process. For example, we show that a system trained and tested on
the input image can achieve a
precision of 71.2% in top-2 predictions. Furthermore, a system that predicts
only the object category given the input can achieve a precision of 84.2% in
top-2 predictions. These surprisingly positive results for what should be
deficient prediction scenarios suggest that careful analysis of what our models
are learning -- and further, how our data is constructed -- is critical as we
seek to make substantive progress on grounded language tasks.Comment: NAACL2018 shor
Contrastive Learning for Image Captioning
Image captioning, a popular topic in computer vision, has achieved
substantial progress in recent years. However, the distinctiveness of natural
descriptions is often overlooked in previous work. It is closely related to the
quality of captions, as distinctive captions are more likely to describe images
with their unique aspects. In this work, we propose a new learning method,
Contrastive Learning (CL), for image captioning. Specifically, via two
constraints formulated on top of a reference model, the proposed method can
encourage distinctiveness, while maintaining the overall quality of the
generated captions. We tested our method on two challenging datasets, where it
improves the baseline model by significant margins. We also showed in our
studies that the proposed method is generic and can be used for models with
various structures.Comment: accepted to 31st Conference on Neural Information Processing Systems
(NIPS 2017), Long Beach, CA, US
Search Engine Guided Non-Parametric Neural Machine Translation
In this paper, we extend an attention-based neural machine translation (NMT)
model by allowing it to access an entire training set of parallel sentence
pairs even after training. The proposed approach consists of two stages. In the
first stage--retrieval stage--, an off-the-shelf, black-box search engine is
used to retrieve a small subset of sentence pairs from a training set given a
source sentence. These pairs are further filtered based on a fuzzy matching
score based on edit distance. In the second stage--translation stage--, a novel
translation model, called translation memory enhanced NMT (TM-NMT), seamlessly
uses both the source sentence and a set of retrieved sentence pairs to perform
the translation. Empirical evaluation on three language pairs (En-Fr, En-De,
and En-Es) shows that the proposed approach significantly outperforms the
baseline approach and the improvement is more significant when more relevant
sentence pairs were retrieved.Comment: Accepted by AAAI 201
A Neural Compositional Paradigm for Image Captioning
Mainstream captioning models often follow a sequential structure to generate
captions, leading to issues such as introduction of irrelevant semantics, lack
of diversity in the generated captions, and inadequate generalization
performance. In this paper, we present an alternative paradigm for image
captioning, which factorizes the captioning procedure into two stages: (1)
extracting an explicit semantic representation from the given image; and (2)
constructing the caption based on a recursive compositional procedure in a
bottom-up manner. Compared to conventional ones, our paradigm better preserves
the semantic content through an explicit factorization of semantics and syntax.
By using the compositional generation procedure, caption construction follows a
recursive structure, which naturally fits the properties of human language.
Moreover, the proposed compositional procedure requires less data to train,
generalizes better, and yields more diverse captions.Comment: 32nd Conference on Neural Information Processing Systems (NIPS 2018),
Montr\'eal, Canad
Image Captioning with Semantic Attention
Automatically generating a natural language description of an image has
attracted interests recently both because of its importance in practical
applications and because it connects two major artificial intelligence fields:
computer vision and natural language processing. Existing approaches are either
top-down, which start from a gist of an image and convert it into words, or
bottom-up, which come up with words describing various aspects of an image and
then combine them. In this paper, we propose a new algorithm that combines both
approaches through a model of semantic attention. Our algorithm learns to
selectively attend to semantic concept proposals and fuse them into hidden
states and outputs of recurrent neural networks. The selection and fusion form
a feedback connecting the top-down and bottom-up computation. We evaluate our
algorithm on two public benchmarks: Microsoft COCO and Flickr30K. Experimental
results show that our algorithm significantly outperforms the state-of-the-art
approaches consistently across different evaluation metrics.Comment: 10 pages, 5 figures, CVPR1