350 research outputs found
Efficient 2D-3D Matching for Multi-Camera Visual Localization
Visual localization, i.e., determining the position and orientation of a
vehicle with respect to a map, is a key problem in autonomous driving. We
present a multicamera visual inertial localization algorithm for large scale
environments. To efficiently and effectively match features against a pre-built
global 3D map, we propose a prioritized feature matching scheme for
multi-camera systems. In contrast to existing works, designed for monocular
cameras, we (1) tailor the prioritization function to the multi-camera setup
and (2) run feature matching and pose estimation in parallel. This
significantly accelerates the matching and pose estimation stages and allows us
to dynamically adapt the matching efforts based on the surrounding environment.
In addition, we show how pose priors can be integrated into the localization
system to increase efficiency and robustness. Finally, we extend our algorithm
by fusing the absolute pose estimates with motion estimates from a multi-camera
visual inertial odometry pipeline (VIO). This results in a system that provides
reliable and drift-less pose estimation. Extensive experiments show that our
localization runs fast and robust under varying conditions, and that our
extended algorithm enables reliable real-time pose estimation.Comment: 7 pages, 5 figure
View Consistent Purification for Accurate Cross-View Localization
This paper proposes a fine-grained self-localization method for outdoor
robotics that utilizes a flexible number of onboard cameras and readily
accessible satellite images. The proposed method addresses limitations in
existing cross-view localization methods that struggle to handle noise sources
such as moving objects and seasonal variations. It is the first sparse
visual-only method that enhances perception in dynamic environments by
detecting view-consistent key points and their corresponding deep features from
ground and satellite views, while removing off-the-ground objects and
establishing homography transformation between the two views. Moreover, the
proposed method incorporates a spatial embedding approach that leverages camera
intrinsic and extrinsic information to reduce the ambiguity of purely visual
matching, leading to improved feature matching and overall pose estimation
accuracy. The method exhibits strong generalization and is robust to
environmental changes, requiring only geo-poses as ground truth. Extensive
experiments on the KITTI and Ford Multi-AV Seasonal datasets demonstrate that
our proposed method outperforms existing state-of-the-art methods, achieving
median spatial accuracy errors below meters along the lateral and
longitudinal directions, and a median orientation accuracy error below 2
degrees.Comment: Accepted for ICCV 202
Semantic Visual Localization
Robust visual localization under a wide range of viewing conditions is a
fundamental problem in computer vision. Handling the difficult cases of this
problem is not only very challenging but also of high practical relevance,
e.g., in the context of life-long localization for augmented reality or
autonomous robots. In this paper, we propose a novel approach based on a joint
3D geometric and semantic understanding of the world, enabling it to succeed
under conditions where previous approaches failed. Our method leverages a novel
generative model for descriptor learning, trained on semantic scene completion
as an auxiliary task. The resulting 3D descriptors are robust to missing
observations by encoding high-level 3D geometric and semantic information.
Experiments on several challenging large-scale localization datasets
demonstrate reliable localization under extreme viewpoint, illumination, and
geometry changes
OREOS: Oriented Recognition of 3D Point Clouds in Outdoor Scenarios
We introduce a novel method for oriented place recognition with 3D LiDAR
scans. A Convolutional Neural Network is trained to extract compact descriptors
from single 3D LiDAR scans. These can be used both to retrieve near-by place
candidates from a map, and to estimate the yaw discrepancy needed for
bootstrapping local registration methods. We employ a triplet loss function for
training and use a hard-negative mining strategy to further increase the
performance of our descriptor extractor. In an evaluation on the NCLT and KITTI
datasets, we demonstrate that our method outperforms related state-of-the-art
approaches based on both data-driven and handcrafted data representation in
challenging long-term outdoor conditions
Camera Pose Estimation from Street-view Snapshots and Point Clouds
This PhD thesis targets on two research problems: (1) How to efïŹciently and robustly estimate the camera pose of a query image with a map that contains street-view snapshots and point clouds; (2) Given the estimated camera pose of a query image, how to create meaningful and intuitive applications with the map data.
To conquer the ïŹrst research problem, we systematically investigated indirect, direct and hybrid camera pose estimation strategies. We implemented state-of-the-art methods and performed comprehensive experiments in two public benchmark datasets considering outdoor environmental changes from ideal to extremely challenging cases. Our key ïŹndings are: (1) the indirect method is usually more accurate than the direct method when there are enough consistent feature correspondences; (2) The direct method is sensitive to initialization, but under extreme outdoor environmental changes, the mutual-information-based direct method is more robust than the feature-based methods; (3) The hybrid method combines the strength from both direct and indirect method and outperforms them in challenging datasets.
To explore the second research problem, we considered inspiring and useful applications by exploiting the camera pose together with the map data. Firstly, we invented a 3D-map augmented photo gallery application, where imagesâ geo-meta data are extracted with an indirect camera pose estimation method and photo sharing experience is improved with the augmentation of 3D map. Secondly, we designed an interactive video playback application, where an indirect method estimates video framesâ camera pose and the video playback is augmented with a 3D map. Thirdly, we proposed a 3D visual primitive based indoor object and outdoor scene recognition method, where the 3D primitives are accumulated from the multiview images
Beyond Controlled Environments: 3D Camera Re-Localization in Changing Indoor Scenes
Long-term camera re-localization is an important task with numerous computer
vision and robotics applications. Whilst various outdoor benchmarks exist that
target lighting, weather and seasonal changes, far less attention has been paid
to appearance changes that occur indoors. This has led to a mismatch between
popular indoor benchmarks, which focus on static scenes, and indoor
environments that are of interest for many real-world applications. In this
paper, we adapt 3RScan - a recently introduced indoor RGB-D dataset designed
for object instance re-localization - to create RIO10, a new long-term camera
re-localization benchmark focused on indoor scenes. We propose new metrics for
evaluating camera re-localization and explore how state-of-the-art camera
re-localizers perform according to these metrics. We also examine in detail how
different types of scene change affect the performance of different methods,
based on novel ways of detecting such changes in a given RGB-D frame. Our
results clearly show that long-term indoor re-localization is an unsolved
problem. Our benchmark and tools are publicly available at
waldjohannau.github.io/RIO10Comment: ECCV 2020, project website https://waldjohannau.github.io/RIO1
Towards Robust Visual Localization in Challenging Conditions
Visual localization is a fundamental problem in computer vision, with a multitude of applications in robotics, augmented reality and structure-from-motion. The basic problem is to, based on one or more images, figure out the position and orientation of the camera which captured these images relative to some model of the environment. Current visual localization approaches typically work well when the images to be localized are captured under similar conditions compared to those captured during mapping. However, when the environment exhibits large changes in visual appearance, due to e.g. variations in weather, seasons, day-night or viewpoint, the traditional pipelines break down. The reason is that the local image features used are based on low-level pixel-intensity information, which is not invariant to these transformations: when the environment changes, this will cause a different set of keypoints to be detected, and their descriptors will be different, making the long-term visual localization problem a challenging one. In this thesis, five papers are included, which present work towards solving the problem of long-term visual localization. Two of the articles present ideas for how semantic information may be included to aid in the localization process: one approach relies only on the semantic information for visual localization, and the other shows how the semantics can be used to detect outlier feature correspondences. The third paper considers how the output from a monocular depth-estimation network can be utilized to extract features that are less sensitive to viewpoint changes. The fourth article is a benchmark paper, where we present three new benchmark datasets aimed at evaluating localization algorithms in the context of long-term visual localization. Lastly, the fifth article considers how to perform convolutions on spherical imagery, which in the future might be applied to learning local image features for the localization problem
Visual SLAM muuttuvissa ympÀristöissÀ
This thesis investigates the problem of Visual Simultaneous Localization and Mapping (vSLAM) in
changing environments. The vSLAM problem is to sequentially estimate the pose of a device with
mounted cameras in a map generated based on images taken with those cameras. vSLAM algorithms
face two main challenges in changing environments: moving objects and temporal appearance
changes. Moving objects cause problems in pose estimation if they are mistaken for static objects.
Moving objects also cause problems for loop closure detection (LCD), which is the problem of
detecting whether a previously visited place has been revisited. A same moving object observed
in two different places may cause false loop closures to be detected. Temporal appearance changes
such as those brought about by time of day or weather changes cause long-term data association
errors for LCD. These cause difficulties in recognizing previously visited places after they have
undergone appearance changes. Focus is placed on LCD, which turns out to be the part of vSLAM
that changing environment affects the most. In addition, several techniques and algorithms for
Visual Place Recognition (VPR) in challenging conditions that could be used in the context of
LCD are surveyed and the performance of two state-of-the-art modern VPR algorithms in changing
environments is assessed in an experiment in order to measure their applicability for LCD. The
most severe performance degrading appearance changes are found to be those caused by change in
season and illumination. Several algorithms and techniques that perform well in loop closure related
tasks in specific environmental conditions are identified as a result of the survey. Finally, a limited
experiment on the Nordland dataset implies that the tested VPR algorithms are usable as is or can
be modified for use in long-term LCD. As a part of the experiment, a new simple neighborhood
consistency check was also developed, evaluated, and found to be effective at reducing false positives
output by the tested VPR algorithms
AFT-VO: Asynchronous Fusion Transformers for Multi-View Visual Odometry Estimation
Motion estimation approaches typically employ sensor fusion techniques, such
as the Kalman Filter, to handle individual sensor failures. More recently, deep
learning-based fusion approaches have been proposed, increasing the performance
and requiring less model-specific implementations. However, current deep fusion
approaches often assume that sensors are synchronised, which is not always
practical, especially for low-cost hardware. To address this limitation, in
this work, we propose AFT-VO, a novel transformer-based sensor fusion
architecture to estimate VO from multiple sensors. Our framework combines
predictions from asynchronous multi-view cameras and accounts for the time
discrepancies of measurements coming from different sources.
Our approach first employs a Mixture Density Network (MDN) to estimate the
probability distributions of the 6-DoF poses for every camera in the system.
Then a novel transformer-based fusion module, AFT-VO, is introduced, which
combines these asynchronous pose estimations, along with their confidences.
More specifically, we introduce Discretiser and Source Encoding techniques
which enable the fusion of multi-source asynchronous signals.
We evaluate our approach on the popular nuScenes and KITTI datasets. Our
experiments demonstrate that multi-view fusion for VO estimation provides
robust and accurate trajectories, outperforming the state of the art in both
challenging weather and lighting conditions
- âŠ