24 research outputs found
Towards CNN map representation and compression for camera relocalisation
This paper presents a study on the use of Convolutional Neural Networks for
camera relocalisation and its application to map compression. We follow state
of the art visual relocalisation results and evaluate the response to different
data inputs. We use a CNN map representation and introduce the notion of map
compression under this paradigm by using smaller CNN architectures without
sacrificing relocalisation performance. We evaluate this approach in a series
of publicly available datasets over a number of CNN architectures with
different sizes, both in complexity and number of layers. This formulation
allows us to improve relocalisation accuracy by increasing the number of
training trajectories while maintaining a constant-size CNN.Comment: Submitted to the 1st International Workshop on Deep Learning for
Visual SLAM, at the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR
To Learn or Not to Learn: Visual Localization from Essential Matrices
Visual localization is the problem of estimating a camera within a scene and
a key component in computer vision applications such as self-driving cars and
Mixed Reality. State-of-the-art approaches for accurate visual localization use
scene-specific representations, resulting in the overhead of constructing these
models when applying the techniques to new scenes. Recently, deep
learning-based approaches based on relative pose estimation have been proposed,
carrying the promise of easily adapting to new scenes. However, it has been
shown such approaches are currently significantly less accurate than
state-of-the-art approaches. In this paper, we are interested in analyzing this
behavior. To this end, we propose a novel framework for visual localization
from relative poses. Using a classical feature-based approach within this
framework, we show state-of-the-art performance. Replacing the classical
approach with learned alternatives at various levels, we then identify the
reasons for why deep learned approaches do not perform well. Based on our
analysis, we make recommendations for future work.Comment: Accepted to ICRA 202
Is Geometry Enough for Matching in Visual Localization?
In this paper, we propose to go beyond the well-established approach to
vision-based localization that relies on visual descriptor matching between a
query image and a 3D point cloud. While matching keypoints via visual
descriptors makes localization highly accurate, it has significant storage
demands, raises privacy concerns and requires update to the descriptors in the
long-term. To elegantly address those practical challenges for large-scale
localization, we present GoMatch, an alternative to visual-based matching that
solely relies on geometric information for matching image keypoints to maps,
represented as sets of bearing vectors. Our novel bearing vectors
representation of 3D points, significantly relieves the cross-modal challenge
in geometric-based matching that prevented prior work to tackle localization in
a realistic environment. With additional careful architecture design, GoMatch
improves over prior geometric-based matching work with a reduction of
(10.67m,95.7deg) and (1.43m, 34.7deg) in average median pose errors on
Cambridge Landmarks and 7-Scenes, while requiring as little as 1.5/1.7% of
storage capacity in comparison to the best visual-based matching methods. This
confirms its potential and feasibility for real-world localization and opens
the door to future efforts in advancing city-scale visual localization methods
that do not require storing visual descriptors.Comment: ECCV2022 Camera Read
Using Image Sequences for Long-Term Visual Localization
Estimating the pose of a camera in a known scene, i.e., visual localization, is a core task for applications such as self-driving cars. In many scenarios, image sequences are available and existing work on combining single-image localization with odometry offers to unlock their potential for improving localization performance. Still, the largest part of the literature focuses on single-image localization and ignores the availability of sequence data. The goal of this paper is to demonstrate the potential of image sequences in challenging scenarios, e.g., under day-night or seasonal changes. Combining ideas from the literature, we describe a sequence-based localization pipeline that combines odometry with both a coarse and a fine localization module. Experiments on long-term localization datasets show that combining single-image global localization against a prebuilt map with a visual odometry / SLAM pipeline improves performance to a level where the extended CMU Seasons dataset can be considered solved. We show that SIFT features can perform on par with modern state-of-the-art features in our framework, despite being much weaker and a magnitude faster to compute. Our code is publicly available at github.com/rulllars
ImPosing: Implicit Pose Encoding for Efficient Visual Localization
We propose a novel learning-based formulation for visual localization of
vehicles that can operate in real-time in city-scale environments. Visual
localization algorithms determine the position and orientation from which an
image has been captured, using a set of geo-referenced images or a 3D scene
representation. Our new localization paradigm, named Implicit Pose Encoding
(ImPosing), embeds images and camera poses into a common latent representation
with 2 separate neural networks, such that we can compute a similarity score
for each image-pose pair. By evaluating candidates through the latent space in
a hierarchical manner, the camera position and orientation are not directly
regressed but incrementally refined. Very large environments force competitors
to store gigabytes of map data, whereas our method is very compact
independently of the reference database size. In this paper, we describe how to
effectively optimize our learned modules, how to combine them to achieve
real-time localization, and demonstrate results on diverse large scale
scenarios that significantly outperform prior work in accuracy and
computational efficiency.Comment: Accepted at WACV 202
SACReg: Scene-Agnostic Coordinate Regression for Visual Localization
Scene coordinates regression (SCR), i.e., predicting 3D coordinates for every
pixel of a given image, has recently shown promising potential. However,
existing methods remain mostly scene-specific or limited to small scenes and
thus hardly scale to realistic datasets. In this paper, we propose a new
paradigm where a single generic SCR model is trained once to be then deployed
to new test scenes, regardless of their scale and without further finetuning.
For a given query image, it collects inputs from off-the-shelf image retrieval
techniques and Structure-from-Motion databases: a list of relevant database
images with sparse pointwise 2D-3D annotations. The model is based on the
transformer architecture and can take a variable number of images and sparse
2D-3D annotations as input. It is trained on a few diverse datasets and
significantly outperforms other scene regression approaches on several
benchmarks, including scene-specific models, for visual localization. In
particular, we set a new state of the art on the Cambridge localization
benchmark, even outperforming feature-matching-based approaches