17,144 research outputs found
Night-to-Day Image Translation for Retrieval-based Localization
Visual localization is a key step in many robotics pipelines, allowing the
robot to (approximately) determine its position and orientation in the world.
An efficient and scalable approach to visual localization is to use image
retrieval techniques. These approaches identify the image most similar to a
query photo in a database of geo-tagged images and approximate the query's pose
via the pose of the retrieved database image. However, image retrieval across
drastically different illumination conditions, e.g. day and night, is still a
problem with unsatisfactory results, even in this age of powerful neural
models. This is due to a lack of a suitably diverse dataset with true
correspondences to perform end-to-end learning. A recent class of neural models
allows for realistic translation of images among visual domains with relatively
little training data and, most importantly, without ground-truth pairings. In
this paper, we explore the task of accurately localizing images captured from
two traversals of the same area in both day and night. We propose ToDayGAN - a
modified image-translation model to alter nighttime driving images to a more
useful daytime representation. We then compare the daytime and translated night
images to obtain a pose estimate for the night image using the known 6-DOF
position of the closest day image. Our approach improves localization
performance by over 250% compared the current state-of-the-art, in the context
of standard metrics in multiple categories.Comment: Published in ICRA 201
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes
Estimating the 6D pose of known objects is important for robots to interact
with the real world. The problem is challenging due to the variety of objects
as well as the complexity of a scene caused by clutter and occlusions between
objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network
for 6D object pose estimation. PoseCNN estimates the 3D translation of an
object by localizing its center in the image and predicting its distance from
the camera. The 3D rotation of the object is estimated by regressing to a
quaternion representation. We also introduce a novel loss function that enables
PoseCNN to handle symmetric objects. In addition, we contribute a large scale
video dataset for 6D object pose estimation named the YCB-Video dataset. Our
dataset provides accurate 6D poses of 21 objects from the YCB dataset observed
in 92 videos with 133,827 frames. We conduct extensive experiments on our
YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is
highly robust to occlusions, can handle symmetric objects, and provide accurate
pose estimation using only color images as input. When using depth data to
further refine the poses, our approach achieves state-of-the-art results on the
challenging OccludedLINEMOD dataset. Our code and dataset are available at
https://rse-lab.cs.washington.edu/projects/posecnn/.Comment: Accepted to RSS 201
Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression
This paper addresses the problem of localizing audio sources using binaural
measurements. We propose a supervised formulation that simultaneously localizes
multiple sources at different locations. The approach is intrinsically
efficient because, contrary to prior work, it relies neither on source
separation, nor on monaural segregation. The method starts with a training
stage that establishes a locally-linear Gaussian regression model between the
directional coordinates of all the sources and the auditory features extracted
from binaural measurements. While fixed-length wide-spectrum sounds (white
noise) are used for training to reliably estimate the model parameters, we show
that the testing (localization) can be extended to variable-length
sparse-spectrum sounds (such as speech), thus enabling a wide range of
realistic applications. Indeed, we demonstrate that the method can be used for
audio-visual fusion, namely to map speech signals onto images and hence to
spatially align the audio and visual modalities, thus enabling to discriminate
between speaking and non-speaking faces. We release a novel corpus of real-room
recordings that allow quantitative evaluation of the co-localization method in
the presence of one or two sound sources. Experiments demonstrate increased
accuracy and speed relative to several state-of-the-art methods.Comment: 15 pages, 8 figure
Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions
Visual localization enables autonomous vehicles to navigate in their
surroundings and augmented reality applications to link virtual to real worlds.
Practical visual localization approaches need to be robust to a wide variety of
viewing condition, including day-night changes, as well as weather and seasonal
variations, while providing highly accurate 6 degree-of-freedom (6DOF) camera
pose estimates. In this paper, we introduce the first benchmark datasets
specifically designed for analyzing the impact of such factors on visual
localization. Using carefully created ground truth poses for query images taken
under a wide variety of conditions, we evaluate the impact of various factors
on 6DOF camera pose estimation accuracy through extensive experiments with
state-of-the-art localization approaches. Based on our results, we draw
conclusions about the difficulty of different conditions, showing that
long-term localization is far from solved, and propose promising avenues for
future work, including sequence-based localization approaches and the need for
better local features. Our benchmark is available at visuallocalization.net.Comment: Accepted to CVPR 2018 as a spotligh
- …