26 research outputs found
Learning to Navigate the Energy Landscape
In this paper, we present a novel and efficient architecture for addressing
computer vision problems that use `Analysis by Synthesis'. Analysis by
synthesis involves the minimization of the reconstruction error which is
typically a non-convex function of the latent target variables.
State-of-the-art methods adopt a hybrid scheme where discriminatively trained
predictors like Random Forests or Convolutional Neural Networks are used to
initialize local search algorithms. While these methods have been shown to
produce promising results, they often get stuck in local optima. Our method
goes beyond the conventional hybrid architecture by not only proposing multiple
accurate initial solutions but by also defining a navigational structure over
the solution space that can be used for extremely efficient gradient-free local
search. We demonstrate the efficacy of our approach on the challenging problem
of RGB Camera Relocalization. To make the RGB camera relocalization problem
particularly challenging, we introduce a new dataset of 3D environments which
are significantly larger than those found in other publicly-available datasets.
Our experiments reveal that the proposed method is able to achieve
state-of-the-art camera relocalization results. We also demonstrate the
generalizability of our approach on Hand Pose Estimation and Image Retrieval
tasks
Exploiting Points and Lines in Regression Forests for RGB-D Camera Relocalization
Camera relocalization plays a vital role in many robotics and computer vision
tasks, such as global localization, recovery from tracking failure and loop
closure detection. Recent random forests based methods exploit randomly sampled
pixel comparison features to predict 3D world locations for 2D image locations
to guide the camera pose optimization. However, these image features are only
sampled randomly in the images, without considering the spatial structures or
geometric information, leading to large errors or failure cases with the
existence of poorly textured areas or in motion blur. Line segment features are
more robust in these environments. In this work, we propose to jointly exploit
points and lines within the framework of uncertainty driven regression forests.
The proposed approach is thoroughly evaluated on three publicly available
datasets against several strong state-of-the-art baselines in terms of several
different error metrics. Experimental results prove the efficacy of our method,
showing superior or on-par state-of-the-art performance.Comment: published as a conference paper at 2018 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS
Learning Less is More - 6D Camera Localization via 3D Surface Regression
Popular research areas like autonomous driving and augmented reality have
renewed the interest in image-based camera localization. In this work, we
address the task of predicting the 6D camera pose from a single RGB image in a
given 3D environment. With the advent of neural networks, previous works have
either learned the entire camera localization process, or multiple components
of a camera localization pipeline. Our key contribution is to demonstrate and
explain that learning a single component of this pipeline is sufficient. This
component is a fully convolutional neural network for densely regressing
so-called scene coordinates, defining the correspondence between the input
image and the 3D scene space. The neural network is prepended to a new
end-to-end trainable pipeline. Our system is efficient, highly accurate, robust
in training, and exhibits outstanding generalization capabilities. It exceeds
state-of-the-art consistently on indoor and outdoor datasets. Interestingly,
our approach surpasses existing techniques even without utilizing a 3D model of
the scene during training, since the network is able to discover 3D scene
geometry automatically, solely from single-view constraints.Comment: CVPR 201
Matterport3D: Learning from RGB-D Data in Indoor Environments
Access to large, diverse RGB-D datasets is critical for training RGB-D scene
understanding algorithms. However, existing datasets still cover only a limited
number of views or a restricted scale of spaces. In this paper, we introduce
Matterport3D, a large-scale RGB-D dataset containing 10,800 panoramic views
from 194,400 RGB-D images of 90 building-scale scenes. Annotations are provided
with surface reconstructions, camera poses, and 2D and 3D semantic
segmentations. The precise global alignment and comprehensive, diverse
panoramic set of views over entire buildings enable a variety of supervised and
self-supervised computer vision tasks, including keypoint matching, view
overlap prediction, normal prediction from color, semantic segmentation, and
region classification
InLoc: Indoor Visual Localization with Dense Matching and View Synthesis
We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph
with respect to a large indoor 3D map. The contributions of this work are
three-fold. First, we develop a new large-scale visual localization method
targeted for indoor environments. The method proceeds along three steps: (i)
efficient retrieval of candidate poses that ensures scalability to large-scale
environments, (ii) pose estimation using dense matching rather than local
features to deal with textureless indoor scenes, and (iii) pose verification by
virtual view synthesis to cope with significant changes in viewpoint, scene
layout, and occluders. Second, we collect a new dataset with reference 6DoF
poses for large-scale indoor localization. Query photographs are captured by
mobile phones at a different time than the reference 3D map, thus presenting a
realistic indoor localization scenario. Third, we demonstrate that our method
significantly outperforms current state-of-the-art indoor localization
approaches on this new challenging data
Kpsnet: Keypoint Detection and Feature Extraction for Point Cloud Registration
© 2019 IEEE. This paper presents the KPSNet, a KeyPoint Siamese Network to simultaneously learn task-desirable keypoint detector and feature extractor. The keypoint detector is optimized to predict a score vector, which signifies the probability of each candidate being a keypoint. The feature extractor is optimized to learn robust features of keypoints by exploiting the correspondence between the keypoints generated from two inputs, respectively. For training, the KPSNet does not require to manually annotate keypoints and local patches pairwise. Instead, we design an alignment module to establish the correspondence between the two inputs and generate positive and negative samples on-the-fly. Therefore, our method can be easily extended to new scenes. We test the proposed method on the open-source benchmark and experiments show the validity of our method
Global Localization: Utilizing Relative Spatio-Temporal Geometric Constraints from Adjacent and Distant Cameras
Re-localizing a camera from a single image in a previously mapped area is
vital for many computer vision applications in robotics and augmented/virtual
reality. In this work, we address the problem of estimating the 6 DoF camera
pose relative to a global frame from a single image. We propose to leverage a
novel network of relative spatial and temporal geometric constraints to guide
the training of a Deep Network for localization. We employ simultaneously
spatial and temporal relative pose constraints that are obtained not only from
adjacent camera frames but also from camera frames that are distant in the
spatio-temporal space of the scene. We show that our method, through these
constraints, is capable of learning to localize when little or very sparse
ground-truth 3D coordinates are available. In our experiments, this is less
than 1% of available ground-truth data. We evaluate our method on 3 common
visual localization datasets and show that it outperforms other direct pose
estimation methods.Comment: To be published in the proceedings of IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS) 202
Distinctive 3D local deep descriptors
We present a simple but yet effective method for learning distinctive 3D
local deep descriptors (DIPs) that can be used to register point clouds without
requiring an initial alignment. Point cloud patches are extracted,
canonicalised with respect to their estimated local reference frame and encoded
into rotation-invariant compact descriptors by a PointNet-based deep neural
network. DIPs can effectively generalise across different sensor modalities
because they are learnt end-to-end from locally and randomly sampled points.
Because DIPs encode only local geometric information, they are robust to
clutter, occlusions and missing regions. We evaluate and compare DIPs against
alternative hand-crafted and deep descriptors on several indoor and outdoor
datasets consisting of point clouds reconstructed using different sensors.
Results show that DIPs (i) achieve comparable results to the state-of-the-art
on RGB-D indoor scenes (3DMatch dataset), (ii) outperform state-of-the-art by a
large margin on laser-scanner outdoor scenes (ETH dataset), and (iii)
generalise to indoor scenes reconstructed with the Visual-SLAM system of
Android ARCore. Source code: https://github.com/fabiopoiesi/dip.Comment: IEEE International Conference on Pattern Recognition 202