3,391 research outputs found
Image-based Geolocalization by Ground-to-2.5D Map Matching
We study the image-based geolocalization problem, aiming to localize
ground-view query images on cartographic maps. Current methods often utilize
cross-view localization techniques to match ground-view query images with 2D
maps. However, the performance of these methods is unsatisfactory due to
significant cross-view appearance differences. In this paper, we lift
cross-view matching to a 2.5D space, where heights of structures (e.g., trees
and buildings) provide geometric information to guide the cross-view matching.
We propose a new approach to learning representative embeddings from
multi-modal data. Specifically, we establish a projection relationship between
2.5D space and 2D aerial-view space. The projection is further used to combine
multi-modal features from the 2.5D and 2D maps using an effective
pixel-to-point fusion method. By encoding crucial geometric cues, our method
learns discriminative location embeddings for matching panoramic images and
maps. Additionally, we construct the first large-scale ground-to-2.5D map
geolocalization dataset to validate our method and facilitate future research.
Both single-image based and route based localization experiments are conducted
to test our method. Extensive experiments demonstrate that the proposed method
achieves significantly higher localization accuracy and faster convergence than
previous 2D map-based approaches
SliceMatch: Geometry-guided Aggregation for Cross-View Pose Estimation
This work addresses cross-view camera pose estimation, i.e., determining the
3-Degrees-of-Freedom camera pose of a given ground-level image w.r.t. an aerial
image of the local area. We propose SliceMatch, which consists of ground and
aerial feature extractors, feature aggregators, and a pose predictor. The
feature extractors extract dense features from the ground and aerial images.
Given a set of candidate camera poses, the feature aggregators construct a
single ground descriptor and a set of pose-dependent aerial descriptors.
Notably, our novel aerial feature aggregator has a cross-view attention module
for ground-view guided aerial feature selection and utilizes the geometric
projection of the ground camera's viewing frustum on the aerial image to pool
features. The efficient construction of aerial descriptors is achieved using
precomputed masks. SliceMatch is trained using contrastive learning and pose
estimation is formulated as a similarity comparison between the ground
descriptor and the aerial descriptors. Compared to the state-of-the-art,
SliceMatch achieves a 19% lower median localization error on the VIGOR
benchmark using the same VGG16 backbone at 150 frames per second, and a 50%
lower error when using a ResNet50 backbone
Semantic Cross-View Matching
Matching cross-view images is challenging because the appearance and
viewpoints are significantly different. While low-level features based on
gradient orientations or filter responses can drastically vary with such
changes in viewpoint, semantic information of images however shows an invariant
characteristic in this respect. Consequently, semantically labeled regions can
be used for performing cross-view matching. In this paper, we therefore explore
this idea and propose an automatic method for detecting and representing the
semantic information of an RGB image with the goal of performing cross-view
matching with a (non-RGB) geographic information system (GIS). A segmented
image forms the input to our system with segments assigned to semantic concepts
such as traffic signs, lakes, roads, foliage, etc. We design a descriptor to
robustly capture both, the presence of semantic concepts and the spatial layout
of those segments. Pairwise distances between the descriptors extracted from
the GIS map and the query image are then used to generate a shortlist of the
most promising locations with similar semantic concepts in a consistent spatial
layout. An experimental evaluation with challenging query images and a large
urban area shows promising results
Semantic Visual Localization
Robust visual localization under a wide range of viewing conditions is a
fundamental problem in computer vision. Handling the difficult cases of this
problem is not only very challenging but also of high practical relevance,
e.g., in the context of life-long localization for augmented reality or
autonomous robots. In this paper, we propose a novel approach based on a joint
3D geometric and semantic understanding of the world, enabling it to succeed
under conditions where previous approaches failed. Our method leverages a novel
generative model for descriptor learning, trained on semantic scene completion
as an auxiliary task. The resulting 3D descriptors are robust to missing
observations by encoding high-level 3D geometric and semantic information.
Experiments on several challenging large-scale localization datasets
demonstrate reliable localization under extreme viewpoint, illumination, and
geometry changes
- …