Precise estimation of global orientation and location is critical to ensure a
compelling outdoor Augmented Reality (AR) experience. We address the problem of
geo-pose estimation by cross-view matching of query ground images to a
geo-referenced aerial satellite image database. Recently, neural network-based
methods have shown state-of-the-art performance in cross-view matching.
However, most of the prior works focus only on location estimation, ignoring
orientation, which cannot meet the requirements in outdoor AR applications. We
propose a new transformer neural network-based model and a modified triplet
ranking loss for joint location and orientation estimation. Experiments on
several benchmark cross-view geo-localization datasets show that our model
achieves state-of-the-art performance. Furthermore, we present an approach to
extend the single image query-based geo-localization approach by utilizing
temporal information from a navigation pipeline for robust continuous
geo-localization. Experimentation on several large-scale real-world video
sequences demonstrates that our approach enables high-precision and stable AR
insertion.Comment: IEEE VR 202