15,445 research outputs found
Edge-Semantic Learning Strategy for Layout Estimation in Indoor Environment
Visual cognition of the indoor environment can benefit from the spatial
layout estimation, which is to represent an indoor scene with a 2D box on a
monocular image. In this paper, we propose to fully exploit the edge and
semantic information of a room image for layout estimation. More specifically,
we present an encoder-decoder network with shared encoder and two separate
decoders, which are composed of multiple deconvolution (transposed convolution)
layers, to jointly learn the edge maps and semantic labels of a room image. We
combine these two network predictions in a scoring function to evaluate the
quality of the layouts, which are generated by ray sampling and from a
predefined layout pool. Guided by the scoring function, we apply a novel
refinement strategy to further optimize the layout hypotheses. Experimental
results show that the proposed network can yield accurate estimates of edge
maps and semantic labels. By fully utilizing the two different types of labels,
the proposed method achieves state-of-the-art layout estimation performance on
benchmark datasets
RoomNet: End-to-End Room Layout Estimation
This paper focuses on the task of room layout estimation from a monocular RGB
image. Prior works break the problem into two sub-tasks: semantic segmentation
of floor, walls, ceiling to produce layout hypotheses, followed by an iterative
optimization step to rank these hypotheses. In contrast, we adopt a more direct
formulation of this problem as one of estimating an ordered set of room layout
keypoints. The room layout and the corresponding segmentation is completely
specified given the locations of these ordered keypoints. We predict the
locations of the room layout keypoints using RoomNet, an end-to-end trainable
encoder-decoder network. On the challenging benchmark datasets Hedau and LSUN,
we achieve state-of-the-art performance along with 200x to 600x speedup
compared to the most recent work. Additionally, we present optional extensions
to the RoomNet architecture such as including recurrent computations and memory
units to refine the keypoint locations under the same parametric capacity.Comment: accepted at ICCV 201
OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas
Recent work on depth estimation up to now has only focused on projective
images ignoring 360 content which is now increasingly and more easily produced.
We show that monocular depth estimation models trained on traditional images
produce sub-optimal results on omnidirectional images, showcasing the need for
training directly on 360 datasets, which however, are hard to acquire. In this
work, we circumvent the challenges associated with acquiring high quality 360
datasets with ground truth depth annotations, by re-using recently released
large scale 3D datasets and re-purposing them to 360 via rendering. This
dataset, which is considerably larger than similar projective datasets, is
publicly offered to the community to enable future research in this direction.
We use this dataset to learn in an end-to-end fashion the task of depth
estimation from 360 images. We show promising results in our synthesized data
as well as in unseen realistic images.Comment: Pre-print to appear in ECCV1
Monocular Object and Plane SLAM in Structured Environments
In this paper, we present a monocular Simultaneous Localization and Mapping
(SLAM) algorithm using high-level object and plane landmarks. The built map is
denser, more compact and semantic meaningful compared to feature point based
SLAM. We first propose a high order graphical model to jointly infer the 3D
object and layout planes from single images considering occlusions and semantic
constraints. The extracted objects and planes are further optimized with camera
poses in a unified SLAM framework. Objects and planes can provide more semantic
constraints such as Manhattan plane and object supporting relationships
compared to points. Experiments on various public and collected datasets
including ICL NUIM and TUM Mono show that our algorithm can improve camera
localization accuracy compared to state-of-the-art SLAM especially when there
is no loop closure, and also generate dense maps robustly in many structured
environments.Comment: IEEE Robotics and Automation Letter
HorizonNet: Learning Room Layout with 1D Representation and Pano Stretch Data Augmentation
We present a new approach to the problem of estimating the 3D room layout
from a single panoramic image. We represent room layout as three 1D vectors
that encode, at each image column, the boundary positions of floor-wall and
ceiling-wall, and the existence of wall-wall boundary. The proposed network,
HorizonNet, trained for predicting 1D layout, outperforms previous
state-of-the-art approaches. The designed post-processing procedure for
recovering 3D room layouts from 1D predictions can automatically infer the room
shape with low computation cost - it takes less than 20ms for a panorama image
while prior works might need dozens of seconds. We also propose Pano Stretch
Data Augmentation, which can diversify panorama data and be applied to other
panorama-related learning tasks. Due to the limited data available for
non-cuboid layout, we relabel 65 general layout from the current dataset for
finetuning. Our approach shows good performance on general layouts by
qualitative results and cross-validation.Comment: CVPR 201
VisualEchoes: Spatial Image Representation Learning through Echolocation
Several animal species (e.g., bats, dolphins, and whales) and even visually
impaired humans have the remarkable ability to perform echolocation: a
biological sonar used to perceive spatial layout and locate objects in the
world. We explore the spatial cues contained in echoes and how they can benefit
vision tasks that require spatial reasoning. First we capture echo responses in
photo-realistic 3D indoor scene environments. Then we propose a novel
interaction-based representation learning framework that learns useful visual
features via echolocation. We show that the learned image features are useful
for multiple downstream vision tasks requiring spatial reasoning---monocular
depth estimation, surface normal estimation, and visual navigation---with
results comparable or even better than heavily supervised pre-training. Our
work opens a new path for representation learning for embodied agents, where
supervision comes from interacting with the physical world.Comment: Appears in ECCV 202
Taskonomy: Disentangling Task Transfer Learning
Do visual tasks have a relationship, or are they unrelated? For instance,
could having surface normals simplify estimating the depth of an image?
Intuition answers these questions positively, implying existence of a structure
among visual tasks. Knowing this structure has notable values; it is the
concept underlying transfer learning and provides a principled way for
identifying redundancies across tasks, e.g., to seamlessly reuse supervision
among related tasks or solve many tasks in one system without piling up the
complexity.
We proposes a fully computational approach for modeling the structure of
space of visual tasks. This is done via finding (first and higher-order)
transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D,
and semantic tasks in a latent space. The product is a computational taxonomic
map for task transfer learning. We study the consequences of this structure,
e.g. nontrivial emerged relationships, and exploit them to reduce the demand
for labeled data. For example, we show that the total number of labeled
datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3
(compared to training independently) while keeping the performance nearly the
same. We provide a set of tools for computing and probing this taxonomical
structure including a solver that users can employ to devise efficient
supervision policies for their use cases.Comment: CVPR 2018 (Oral). See project website and live demos at
http://taskonomy.vision
Structural and object detection for phosphene images
Prosthetic vision based on phosphenes is a promising way to provide visual
perception to some blind people. However, phosphenic images are very limited in
terms of spatial resolution (e.g.: 32 x 32 phosphene array) and luminance
levels (e.g.: 8 gray levels), which results in the subject receiving very
limited information about the scene. This requires using high-level processing
to extract more information from the scene and present it to the subject with
the phosphenes limitations. In this work, we study the recognition of indoor
environments under simulated prosthetic vision. Most research in simulated
prosthetic vision is performed based on static images, while very few
researchers have addressed the problem of scene recognition through video
sequences. We propose a new approach to build a schematic representation of
indoor environments for phosphene images. Our schematic representation relies
on two parallel CNNs for the extraction of structural informative edges of the
room and the relevant object silhouettes based on mask segmentation. We have
performed a study with twelve normally sighted subjects to evaluate how our
methods were able to the room recognition by presenting phosphenic images and
videos. We show how our method is able to increase the recognition ability of
the user from 75% using alternative methods to 90% using our approach
Structured Depth Prediction in Challenging Monocular Video Sequences
In this paper, we tackle the problem of estimating the depth of a scene from
a monocular video sequence. In particular, we handle challenging scenarios,
such as non-translational camera motion and dynamic scenes, where traditional
structure from motion and motion stereo methods do not apply. To this end, we
first study the problem of depth estimation from a single image. In this
context, we exploit the availability of a pool of images for which the depth is
known, and formulate monocular depth estimation as a discrete-continuous
optimization problem, where the continuous variables encode the depth of the
superpixels in the input image, and the discrete ones represent relationships
between neighboring superpixels. The solution to this discrete-continuous
optimization problem is obtained by performing inference in a graphical model
using particle belief propagation. To handle video sequences, we then extend
our single image model to a two-frame one that naturally encodes short-range
temporal consistency and inherently handles dynamic objects. Based on the
prediction of this model, we then introduce a fully-connected pairwise CRF that
accounts for longer range spatio-temporal interactions throughout a video. We
demonstrate the effectiveness of our model in both the indoor and outdoor
scenarios
GeoLayout: Geometry Driven Room Layout Estimation Based on Depth Maps of Planes
The task of room layout estimation is to locate the wall-floor, wall-ceiling,
and wall-wall boundaries. Most recent methods solve this problem based on
edge/keypoint detection or semantic segmentation. However, these approaches
have shown limited attention on the geometry of the dominant planes and the
intersection between them, which has significant impact on room layout. In this
work, we propose to incorporate geometric reasoning to deep learning for layout
estimation. Our approach learns to infer the depth maps of the dominant planes
in the scene by predicting the pixel-level surface parameters, and the layout
can be generated by the intersection of the depth maps. Moreover, we present a
new dataset with pixel-level depth annotation of dominant planes. It is larger
than the existing datasets and contains both cuboid and non-cuboid rooms.
Experimental results show that our approach produces considerable performance
gains on both 2D and 3D datasets
- …