10,385 research outputs found
A Discriminative Representation of Convolutional Features for Indoor Scene Recognition
Indoor scene recognition is a multi-faceted and challenging problem due to
the diverse intra-class variations and the confusing inter-class similarities.
This paper presents a novel approach which exploits rich mid-level
convolutional features to categorize indoor scenes. Traditionally used
convolutional features preserve the global spatial structure, which is a
desirable property for general object recognition. However, we argue that this
structuredness is not much helpful when we have large variations in scene
layouts, e.g., in indoor scenes. We propose to transform the structured
convolutional activations to another highly discriminative feature space. The
representation in the transformed space not only incorporates the
discriminative aspects of the target dataset, but it also encodes the features
in terms of the general object categories that are present in indoor scenes. To
this end, we introduce a new large-scale dataset of 1300 object categories
which are commonly present in indoor scenes. Our proposed approach achieves a
significant performance boost over previous state of the art approaches on five
major scene classification datasets
Dual Attention Networks for Visual Reference Resolution in Visual Dialog
Visual dialog (VisDial) is a task which requires an AI agent to answer a
series of questions grounded in an image. Unlike in visual question answering
(VQA), the series of questions should be able to capture a temporal context
from a dialog history and exploit visually-grounded information. A problem
called visual reference resolution involves these challenges, requiring the
agent to resolve ambiguous references in a given question and find the
references in a given image. In this paper, we propose Dual Attention Networks
(DAN) for visual reference resolution. DAN consists of two kinds of attention
networks, REFER and FIND. Specifically, REFER module learns latent
relationships between a given question and a dialog history by employing a
self-attention mechanism. FIND module takes image features and reference-aware
representations (i.e., the output of REFER module) as input, and performs
visual grounding via bottom-up attention mechanism. We qualitatively and
quantitatively evaluate our model on VisDial v1.0 and v0.9 datasets, showing
that DAN outperforms the previous state-of-the-art model by a significant
margin.Comment: EMNLP 201
Semantically Guided Depth Upsampling
We present a novel method for accurate and efficient up- sampling of sparse
depth data, guided by high-resolution imagery. Our approach goes beyond the use
of intensity cues only and additionally exploits object boundary cues through
structured edge detection and semantic scene labeling for guidance. Both cues
are combined within a geodesic distance measure that allows for
boundary-preserving depth in- terpolation while utilizing local context. We
model the observed scene structure by locally planar elements and formulate the
upsampling task as a global energy minimization problem. Our method determines
glob- ally consistent solutions and preserves fine details and sharp depth
bound- aries. In our experiments on several public datasets at different levels
of application, we demonstrate superior performance of our approach over the
state-of-the-art, even for very sparse measurements.Comment: German Conference on Pattern Recognition 2016 (Oral
DA-RNN: Semantic Mapping with Data Associated Recurrent Neural Networks
3D scene understanding is important for robots to interact with the 3D world
in a meaningful way. Most previous works on 3D scene understanding focus on
recognizing geometrical or semantic properties of the scene independently. In
this work, we introduce Data Associated Recurrent Neural Networks (DA-RNNs), a
novel framework for joint 3D scene mapping and semantic labeling. DA-RNNs use a
new recurrent neural network architecture for semantic labeling on RGB-D
videos. The output of the network is integrated with mapping techniques such as
KinectFusion in order to inject semantic information into the reconstructed 3D
scene. Experiments conducted on a real world dataset and a synthetic dataset
with RGB-D videos demonstrate the ability of our method in semantic 3D scene
mapping.Comment: Published in RSS 201
- …