17,960 research outputs found
Generalized Max Pooling
State-of-the-art patch-based image representations involve a pooling
operation that aggregates statistics computed from local descriptors. Standard
pooling operations include sum- and max-pooling. Sum-pooling lacks
discriminability because the resulting representation is strongly influenced by
frequent yet often uninformative descriptors, but only weakly influenced by
rare yet potentially highly-informative ones. Max-pooling equalizes the
influence of frequent and rare descriptors but is only applicable to
representations that rely on count statistics, such as the bag-of-visual-words
(BOV) and its soft- and sparse-coding extensions. We propose a novel pooling
mechanism that achieves the same effect as max-pooling but is applicable beyond
the BOV and especially to the state-of-the-art Fisher Vector -- hence the name
Generalized Max Pooling (GMP). It involves equalizing the similarity between
each patch and the pooled representation, which is shown to be equivalent to
re-weighting the per-patch statistics. We show on five public image
classification benchmarks that the proposed GMP can lead to significant
performance gains with respect to heuristic alternatives.Comment: (to appear) CVPR 2014 - IEEE Conference on Computer Vision & Pattern
Recognition (2014
Evaluation of Output Embeddings for Fine-Grained Image Classification
Image classification has advanced significantly in recent years with the
availability of large-scale image sets. However, fine-grained classification
remains a major challenge due to the annotation cost of large numbers of
fine-grained categories. This project shows that compelling classification
performance can be achieved on such categories even without labeled training
data. Given image and class embeddings, we learn a compatibility function such
that matching embeddings are assigned a higher score than mismatching ones;
zero-shot classification of an image proceeds by finding the label yielding the
highest joint compatibility score. We use state-of-the-art image features and
focus on different supervised attributes and unsupervised output embeddings
either derived from hierarchies or learned from unlabeled text corpora. We
establish a substantially improved state-of-the-art on the Animals with
Attributes and Caltech-UCSD Birds datasets. Most encouragingly, we demonstrate
that purely unsupervised output embeddings (learned from Wikipedia and improved
with fine-grained text) achieve compelling results, even outperforming the
previous supervised state-of-the-art. By combining different output embeddings,
we further improve results.Comment: @inproceedings {ARWLS15, title = {Evaluation of Output Embeddings for
Fine-Grained Image Classification}, booktitle = {IEEE Computer Vision and
Pattern Recognition}, year = {2015}, author = {Zeynep Akata and Scott Reed
and Daniel Walter and Honglak Lee and Bernt Schiele}
Neural Baby Talk
We introduce a novel framework for image captioning that can produce natural
language explicitly grounded in entities that object detectors find in the
image. Our approach reconciles classical slot filling approaches (that are
generally better grounded in images) with modern neural captioning approaches
(that are generally more natural sounding and accurate). Our approach first
generates a sentence `template' with slot locations explicitly tied to specific
image regions. These slots are then filled in by visual concepts identified in
the regions by object detectors. The entire architecture (sentence template
generation and slot filling with object detectors) is end-to-end
differentiable. We verify the effectiveness of our proposed model on different
image captioning tasks. On standard image captioning and novel object
captioning, our model reaches state-of-the-art on both COCO and Flickr30k
datasets. We also demonstrate that our model has unique advantages when the
train and test distributions of scene compositions -- and hence language priors
of associated captions -- are different. Code has been made available at:
https://github.com/jiasenlu/NeuralBabyTalkComment: 12 pages, 7 figures, CVPR 201
Feedback-prop: Convolutional Neural Network Inference under Partial Evidence
We propose an inference procedure for deep convolutional neural networks
(CNNs) when partial evidence is available. Our method consists of a general
feedback-based propagation approach (feedback-prop) that boosts the prediction
accuracy for an arbitrary set of unknown target labels when the values for a
non-overlapping arbitrary set of target labels are known. We show that existing
models trained in a multi-label or multi-task setting can readily take
advantage of feedback-prop without any retraining or fine-tuning. Our
feedback-prop inference procedure is general, simple, reliable, and works on
different challenging visual recognition tasks. We present two variants of
feedback-prop based on layer-wise and residual iterative updates. We experiment
using several multi-task models and show that feedback-prop is effective in all
of them. Our results unveil a previously unreported but interesting dynamic
property of deep CNNs. We also present an associated technical approach that
takes advantage of this property for inference under partial evidence in
general visual recognition tasks.Comment: Accepted to CVPR 201
Visual Entailment: A Novel Task for Fine-Grained Image Understanding
Existing visual reasoning datasets such as Visual Question Answering (VQA),
often suffer from biases conditioned on the question, image or answer
distributions. The recently proposed CLEVR dataset addresses these limitations
and requires fine-grained reasoning but the dataset is synthetic and consists
of similar objects and sentence structures across the dataset.
In this paper, we introduce a new inference task, Visual Entailment (VE) -
consisting of image-sentence pairs whereby a premise is defined by an image,
rather than a natural language sentence as in traditional Textual Entailment
tasks. The goal of a trained VE model is to predict whether the image
semantically entails the text. To realize this task, we build a dataset SNLI-VE
based on the Stanford Natural Language Inference corpus and Flickr30k dataset.
We evaluate various existing VQA baselines and build a model called Explainable
Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71%
accuracy and outperforms several other state-of-the-art VQA based models.
Finally, we demonstrate the explainability of EVE through cross-modal attention
visualizations. The SNLI-VE dataset is publicly available at
https://github.com/ necla-ml/SNLI-VE
- …