9 research outputs found
Leveraging Deep Visual Descriptors for Hierarchical Efficient Localization
Many robotics applications require precise pose estimates despite operating
in large and changing environments. This can be addressed by visual
localization, using a pre-computed 3D model of the surroundings. The pose
estimation then amounts to finding correspondences between 2D keypoints in a
query image and 3D points in the model using local descriptors. However,
computational power is often limited on robotic platforms, making this task
challenging in large-scale environments. Binary feature descriptors
significantly speed up this 2D-3D matching, and have become popular in the
robotics community, but also strongly impair the robustness to perceptual
aliasing and changes in viewpoint, illumination and scene structure. In this
work, we propose to leverage recent advances in deep learning to perform an
efficient hierarchical localization. We first localize at the map level using
learned image-wide global descriptors, and subsequently estimate a precise pose
from 2D-3D matches computed in the candidate places only. This restricts the
local search and thus allows to efficiently exploit powerful non-binary
descriptors usually dismissed on resource-constrained devices. Our approach
results in state-of-the-art localization performance while running in real-time
on a popular mobile platform, enabling new prospects for robotics research.Comment: CoRL 2018 Camera-ready (fix typos and update citations
Delta Descriptors: Change-Based Place Representation for Robust Visual Localization
Visual place recognition is challenging because there are so many factors
that can cause the appearance of a place to change, from day-night cycles to
seasonal change to atmospheric conditions. In recent years a large range of
approaches have been developed to address this challenge including deep-learnt
image descriptors, domain translation, and sequential filtering, all with
shortcomings including generality and velocity-sensitivity. In this paper we
propose a novel descriptor derived from tracking changes in any learned global
descriptor over time, dubbed Delta Descriptors. Delta Descriptors mitigate the
offsets induced in the original descriptor matching space in an unsupervised
manner by considering temporal differences across places observed along a
route. Like all other approaches, Delta Descriptors have a shortcoming -
volatility on a frame to frame basis - which can be overcome by combining them
with sequential filtering methods. Using two benchmark datasets, we first
demonstrate the high performance of Delta Descriptors in isolation, before
showing new state-of-the-art performance when combined with sequence-based
matching. We also present results demonstrating the approach working with four
different underlying descriptor types, and two other beneficial properties of
Delta Descriptors in comparison to existing techniques: their increased
inherent robustness to variations in camera motion and a reduced rate of
performance degradation as dimensional reduction is applied. Source code is
made available at https://github.com/oravus/DeltaDescriptors.Comment: 8 pages and 7 figures. Published in 2020 IEEE Robotics and Automation
Letters (RA-L
Augmenting Visual Place Recognition with Structural Cues
In this paper, we propose to augment image-based place recognition with
structural cues. Specifically, these structural cues are obtained using
structure-from-motion, such that no additional sensors are needed for place
recognition. This is achieved by augmenting the 2D convolutional neural network
(CNN) typically used for image-based place recognition with a 3D CNN that takes
as input a voxel grid derived from the structure-from-motion point cloud. We
evaluate different methods for fusing the 2D and 3D features and obtain best
performance with global average pooling and simple concatenation. On the Oxford
RobotCar dataset, the resulting descriptor exhibits superior recognition
performance compared to descriptors extracted from only one of the input
modalities, including state-of-the-art image-based descriptors. Especially at
low descriptor dimensionalities, we outperform state-of-the-art descriptors by
up to 90%.Comment: 8 pages, published in RA-L & IROS 202
Learning camera localization via dense scene matching
This thesis presents a method for camera localization. Given a set of reference images with known camera poses, camera localization aims to estimate the 6 DoF camera pose for an arbitrary query image captured in the same environment. It might also be generalized to recover the 6 DoF pose of each video frame of an input query video. Traditional methods detect and match interest points between the query image and a pre-built 3D model, and then solve camera poses accordingly by the PnP algorithm combined with RANSAC. The recent development of deep learning has motivated end-to-end approaches for camera localization. Those methods encode scene structures into the parameters of a specific convolutional neural network (CNN) and thus are able to predict a dense coordinate map for a query image whose pixels record 3D scene coordinates. This dense coordinate map can be used to estimate camera poses in the same way as traditional methods. However, most of these learning-based methods require re-training or re-adaption for a new scene and have difficulties in handling large-scale scenes due to limited network capacity. In this thesis, We present a new method for scene agnostic camera localization which can be applied to a novel scene without retraining. This scene agnostic localization is achieved with our dense scene matching (DSM) technique, where a cost volume is constructed between a query image and a scene. The cost volume is fed to a CNN to predict the dense coordinate map to compute the 6 DoF camera pose. In addition, our method can be directly applied to deal with query videoclips, which leads to extra performance boost during testing time by exploring temporal constraint between neighboring frames. Our method achieves state-of-the-art performance over several benchmarks
Robust and Accurate Camera Localisation at a Large Scale
The task of camera-based localization aims to quickly and precisely pinpoint at which location (and viewing direction) the image was taken, against a pre-stored large-scale map of the environment. This technique can be used in many 3D computer vision applications, e.g., AR/VR and autonomous driving.
Mapping the world is the first step to enable camera-based localization since a pre-stored map serves as a reference for a query image/sequence. In this thesis, we exploit three readily available sources: (i) satellite images; (ii) ground-view images; (iii) 3D points cloud. Based on the above three sources, we propose solutions to localize a query camera both effectively and efficiently, i.e., accurately localizing a query camera under a variety of lighting and viewing conditions within a small amount of time. The main contributions are summarized as follows.
In chapter 3, we separately present a minimal 4-point and 2-point solver to estimate a relative and absolute camera pose. The core idea is exploiting the vertical direction from IMU or vanishing point to derive a closed-form solution of a quartic equation and a quadratic equation for the relative and absolute camera pose, respectively.
In chapter 4, we localize a ground-view query image against a satellite map. Inspired by the insight that humans commonly use orientation information as an important cue for spatial localization, we propose a method that endows deep neural networks with the 'commonsense' of orientation. We design a Siamese network that explicitly encodes each pixel's orientation of the ground-view and satellite images. Our method boosts the learned deep features' discriminative power, outperforming all previous methods.
In chapter 5, we localize a ground-view query image against a ground-view image database. We propose a representation learning method having higher location-discriminating power. The core idea is learning discriminative image embedding. Similarities among intra-place images (viewing the same landmarks) are maximized while similarities among inter-place images (viewing different landmarks) are minimized. The method is easy to implement and pluggable into any CNN. Experiments show that our method outperforms all previous methods.
In chapter 6, we localize a ground-view query image against a large-scale 3D points cloud with visual descriptors. To address the ambiguities in direct 2D--3D feature matching, we introduce a global matching method that harnesses global contextual information exhibited both within the query image and among all the 3D points in the map. The core idea is to find the optimal 2D set to 3D set matching. Tests on standard benchmark datasets show the effectiveness of our method.
In chapter 7, we localize a ground-view query image against a 3D points cloud with only coordinates. The problem is also known as blind Perspective-n-Point. We propose a deep CNN model that simultaneously solves for both the 6-DoF absolute camera pose and 2D--3D correspondences. The core idea is extracting point-wise 2D and 3D features from their coordinates and matching 2D and 3D features effectively in a global feature matching module. Extensive tests on both real and simulated data have shown that our method substantially outperforms existing approaches.
Last, in chapter 8, we study the potential of using 3D lines. Specifically, we study the problem of aligning two partially overlapping 3D line reconstructions in Euclidean space. This technique can be used for localization with respect to a 3D line database when query 3D line reconstructions are available (e.g., from stereo triangulation). We propose a neural network, taking Pluecker representations of lines as input, and solving for line-to-line matches and estimate a 6-DoF rigid transformation. Experiments on indoor and outdoor datasets show that our method's registration (rotation and translation) precision outperforms baselines significantly