359 research outputs found
Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks
An important goal of computer vision is to build systems that learn visual
representations over time that can be applied to many tasks. In this paper, we
investigate a vision-language embedding as a core representation and show that
it leads to better cross-task transfer than standard multi-task learning. In
particular, the task of visual recognition is aligned to the task of visual
question answering by forcing each to use the same word-region embeddings. We
show this leads to greater inductive transfer from recognition to VQA than
standard multitask learning. Visual recognition also improves, especially for
categories that have relatively few recognition training labels but appear
often in the VQA setting. Thus, our paper takes a small step towards creating
more general vision systems by showing the benefit of interpretable, flexible,
and trainable core representations.Comment: Accepted in ICCV 2017. The arxiv version has an extra analysis on
correlation with human attentio
What is Holding Back Convnets for Detection?
Convolutional neural networks have recently shown excellent results in
general object detection and many other tasks. Albeit very effective, they
involve many user-defined design choices. In this paper we want to better
understand these choices by inspecting two key aspects "what did the network
learn?", and "what can the network learn?". We exploit new annotations
(Pascal3D+), to enable a new empirical analysis of the R-CNN detector. Despite
common belief, our results indicate that existing state-of-the-art convnet
architectures are not invariant to various appearance factors. In fact, all
considered networks have similar weak points which cannot be mitigated by
simply increasing the training data (architectural changes are needed). We show
that overall performance can improve when using image renderings for data
augmentation. We report the best known results on the Pascal3D+ detection and
view-point estimation tasks
3D-PRNN: Generating Shape Primitives with Recurrent Neural Networks
The success of various applications including robotics, digital content
creation, and visualization demand a structured and abstract representation of
the 3D world from limited sensor data. Inspired by the nature of human
perception of 3D shapes as a collection of simple parts, we explore such an
abstract shape representation based on primitives. Given a single depth image
of an object, we present 3D-PRNN, a generative recurrent neural network that
synthesizes multiple plausible shapes composed of a set of primitives. Our
generative model encodes symmetry characteristics of common man-made objects,
preserves long-range structural coherence, and describes objects of varying
complexity with a compact representation. We also propose a method based on
Gaussian Fields to generate a large scale dataset of primitive-based shape
representations to train our network. We evaluate our approach on a wide range
of examples and show that it outperforms nearest-neighbor based shape retrieval
methods and is on-par with voxel-based generative models while using a
significantly reduced parameter space.Comment: ICCV 201
- …
