12 research outputs found
Optimal Feature Transport for Cross-View Image Geo-Localization
This paper addresses the problem of cross-view image geo-localization, where
the geographic location of a ground-level street-view query image is estimated
by matching it against a large scale aerial map (e.g., a high-resolution
satellite image). State-of-the-art deep-learning based methods tackle this
problem as deep metric learning which aims to learn global feature
representations of the scene seen by the two different views. Despite promising
results are obtained by such deep metric learning methods, they, however, fail
to exploit a crucial cue relevant for localization, namely, the spatial layout
of local features. Moreover, little attention is paid to the obvious domain gap
(between aerial view and ground view) in the context of cross-view
localization. This paper proposes a novel Cross-View Feature Transport (CVFT)
technique to explicitly establish cross-view domain transfer that facilitates
feature alignment between ground and aerial images. Specifically, we implement
the CVFT as network layers, which transports features from one domain to the
other, leading to more meaningful feature similarity comparison. Our model is
differentiable and can be learned end-to-end. Experiments on large-scale
datasets have demonstrated that our method has remarkably boosted the
state-of-the-art cross-view localization performance, e.g., on the CVUSA
dataset, with significant improvements for top-1 recall from 40.79% to 61.43%,
and for top-10 from 76.36% to 90.49%. We expect the key insight of the paper
(i.e., explicitly handling domain difference via domain transport) will prove
to be useful for other similar problems in computer vision as well
View Consistent Purification for Accurate Cross-View Localization
This paper proposes a fine-grained self-localization method for outdoor
robotics that utilizes a flexible number of onboard cameras and readily
accessible satellite images. The proposed method addresses limitations in
existing cross-view localization methods that struggle to handle noise sources
such as moving objects and seasonal variations. It is the first sparse
visual-only method that enhances perception in dynamic environments by
detecting view-consistent key points and their corresponding deep features from
ground and satellite views, while removing off-the-ground objects and
establishing homography transformation between the two views. Moreover, the
proposed method incorporates a spatial embedding approach that leverages camera
intrinsic and extrinsic information to reduce the ambiguity of purely visual
matching, leading to improved feature matching and overall pose estimation
accuracy. The method exhibits strong generalization and is robust to
environmental changes, requiring only geo-poses as ground truth. Extensive
experiments on the KITTI and Ford Multi-AV Seasonal datasets demonstrate that
our proposed method outperforms existing state-of-the-art methods, achieving
median spatial accuracy errors below meters along the lateral and
longitudinal directions, and a median orientation accuracy error below 2
degrees.Comment: Accepted for ICCV 202
SliceMatch: Geometry-guided Aggregation for Cross-View Pose Estimation
This work addresses cross-view camera pose estimation, i.e., determining the
3-Degrees-of-Freedom camera pose of a given ground-level image w.r.t. an aerial
image of the local area. We propose SliceMatch, which consists of ground and
aerial feature extractors, feature aggregators, and a pose predictor. The
feature extractors extract dense features from the ground and aerial images.
Given a set of candidate camera poses, the feature aggregators construct a
single ground descriptor and a set of pose-dependent aerial descriptors.
Notably, our novel aerial feature aggregator has a cross-view attention module
for ground-view guided aerial feature selection and utilizes the geometric
projection of the ground camera's viewing frustum on the aerial image to pool
features. The efficient construction of aerial descriptors is achieved using
precomputed masks. SliceMatch is trained using contrastive learning and pose
estimation is formulated as a similarity comparison between the ground
descriptor and the aerial descriptors. Compared to the state-of-the-art,
SliceMatch achieves a 19% lower median localization error on the VIGOR
benchmark using the same VGG16 backbone at 150 frames per second, and a 50%
lower error when using a ResNet50 backbone
Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transformer
Image retrieval-based cross-view localization methods often lead to very
coarse camera pose estimation, due to the limited sampling density of the
database satellite images. In this paper, we propose a method to increase the
accuracy of a ground camera's location and orientation by estimating the
relative rotation and translation between the ground-level image and its
matched/retrieved satellite image. Our approach designs a geometry-guided
cross-view transformer that combines the benefits of conventional geometry and
learnable cross-view transformers to map the ground-view observations to an
overhead view. Given the synthesized overhead view and observed satellite
feature maps, we construct a neural pose optimizer with strong global
information embedding ability to estimate the relative rotation between them.
After aligning their rotations, we develop an uncertainty-guided spatial
correlation to generate a probability map of the vehicle locations, from which
the relative translation can be determined. Experimental results demonstrate
that our method significantly outperforms the state-of-the-art. Notably, the
likelihood of restricting the vehicle lateral pose to be within 1m of its
Ground Truth (GT) value on the cross-view KITTI dataset has been improved from
to , and the likelihood of restricting the vehicle
orientation to be within of its GT value has been improved from
to .Comment: Accepted to ICCV 202