110 research outputs found
OBJ2TEXT: Generating Visually Descriptive Language from Object Layouts
Generating captions for images is a task that has recently received
considerable attention. In this work we focus on caption generation for
abstract scenes, or object layouts where the only information provided is a set
of objects and their locations. We propose OBJ2TEXT, a sequence-to-sequence
model that encodes a set of objects and their locations as an input sequence
using an LSTM network, and decodes this representation using an LSTM language
model. We show that our model, despite encoding object layouts as a sequence,
can represent spatial relationships between objects, and generate descriptions
that are globally coherent and semantically relevant. We test our approach in a
task of object-layout captioning by using only object annotations as inputs. We
additionally show that our model, combined with a state-of-the-art object
detector, improves an image captioning model from 0.863 to 0.950 (CIDEr score)
in the test benchmark of the standard MS-COCO Captioning task.Comment: Accepted at EMNLP 201
Feedback-prop: Convolutional Neural Network Inference under Partial Evidence
We propose an inference procedure for deep convolutional neural networks
(CNNs) when partial evidence is available. Our method consists of a general
feedback-based propagation approach (feedback-prop) that boosts the prediction
accuracy for an arbitrary set of unknown target labels when the values for a
non-overlapping arbitrary set of target labels are known. We show that existing
models trained in a multi-label or multi-task setting can readily take
advantage of feedback-prop without any retraining or fine-tuning. Our
feedback-prop inference procedure is general, simple, reliable, and works on
different challenging visual recognition tasks. We present two variants of
feedback-prop based on layer-wise and residual iterative updates. We experiment
using several multi-task models and show that feedback-prop is effective in all
of them. Our results unveil a previously unreported but interesting dynamic
property of deep CNNs. We also present an associated technical approach that
takes advantage of this property for inference under partial evidence in
general visual recognition tasks.Comment: Accepted to CVPR 201
Where and Who? Automatic Semantic-Aware Person Composition
Image compositing is a method used to generate realistic yet fake imagery by
inserting contents from one image to another. Previous work in compositing has
focused on improving appearance compatibility of a user selected foreground
segment and a background image (i.e. color and illumination consistency). In
this work, we instead develop a fully automated compositing model that
additionally learns to select and transform compatible foreground segments from
a large collection given only an input image background. To simplify the task,
we restrict our problem by focusing on human instance composition, because
human segments exhibit strong correlations with their background and because of
the availability of large annotated data. We develop a novel branching
Convolutional Neural Network (CNN) that jointly predicts candidate person
locations given a background image. We then use pre-trained deep feature
representations to retrieve person instances from a large segment database.
Experimental results show that our model can generate composite images that
look visually convincing. We also develop a user interface to demonstrate the
potential application of our method.Comment: 10 pages, 9 figure
Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints
Language is increasingly being used to define rich visual recognition
problems with supporting image collections sourced from the web. Structured
prediction models are used in these tasks to take advantage of correlations
between co-occurring labels and visual input but risk inadvertently encoding
social biases found in web corpora. In this work, we study data and models
associated with multilabel object classification and visual semantic role
labeling. We find that (a) datasets for these tasks contain significant gender
bias and (b) models trained on these datasets further amplify existing bias.
For example, the activity cooking is over 33% more likely to involve females
than males in a training set, and a trained model further amplifies the
disparity to 68% at test time. We propose to inject corpus-level constraints
for calibrating existing structured prediction models and design an algorithm
based on Lagrangian relaxation for collective inference. Our method results in
almost no performance loss for the underlying recognition task but decreases
the magnitude of bias amplification by 47.5% and 40.5% for multilabel
classification and visual semantic role labeling, respectively.Comment: 11 pages, published in EMNLP 201
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders
We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and
contrastive learning. ViC-MAE is trained using a global featured obtained by
pooling the local representations learned under an MAE reconstruction loss and
leveraging this representation under a contrastive objective across images and
video frames. We show that visual representations learned under ViC-MAE
generalize well to both video and image classification tasks. Particularly,
ViC-MAE obtains state-of-the-art transfer learning performance from video to
images on Imagenet-1k compared to the recently proposed OmniMAE by achieving a
top-1 accuracy of 86% (+1.3% absolute improvement) when trained on the same
data and 87.1% (+2.4% absolute improvement) when training on extra data. At the
same time ViC-MAE outperforms most other methods on video benchmarks by
obtaining 75.9% top-1 accuracy on the challenging Something something-v2 video
benchmark . When training on videos and images from a diverse combination of
datasets, our method maintains a balanced transfer-learning performance between
video and image classification benchmarks, coming only as a close second to the
best supervised method.Comment: More results on Video an Image datasets, ViC-MAE now supports
training on videos and image
Estimating and Maximizing Mutual Information for Knowledge Distillation
In this work, we propose Mutual Information Maximization Knowledge
Distillation (MIMKD). Our method uses a contrastive objective to simultaneously
estimate and maximize a lower bound on the mutual information of local and
global feature representations between a teacher and a student network. We
demonstrate through extensive experiments that this can be used to improve the
performance of low capacity models by transferring knowledge from more
performant but computationally expensive models. This can be used to produce
better models that can be run on devices with low computational resources. Our
method is flexible, we can distill knowledge from teachers with arbitrary
network architectures to arbitrary student networks. Our empirical results show
that MIMKD outperforms competing approaches across a wide range of
student-teacher pairs with different capacities, with different architectures,
and when student networks are with extremely low capacity. We are able to
obtain 74.55% accuracy on CIFAR100 with a ShufflenetV2 from a baseline accuracy
of 69.8% by distilling knowledge from ResNet-50. On Imagenet we improve a
ResNet-18 network from 68.88% to 70.32% accuracy (1.44%+) using a ResNet-34
teacher network
- …