212 research outputs found

    Deep Probabilistic Models for Camera Geo-Calibration

    Get PDF
    The ultimate goal of image understanding is to transfer visual images into numerical or symbolic descriptions of the scene that are helpful for decision making. Knowing when, where, and in which direction a picture was taken, the task of geo-calibration makes it possible to use imagery to understand the world and how it changes in time. Current models for geo-calibration are mostly deterministic, which in many cases fails to model the inherent uncertainties when the image content is ambiguous. Furthermore, without a proper modeling of the uncertainty, subsequent processing can yield overly confident predictions. To address these limitations, we propose a probabilistic model for camera geo-calibration using deep neural networks. While our primary contribution is geo-calibration, we also show that learning to geo-calibrate a camera allows us to implicitly learn to understand the content of the scene

    Robust and Accurate Camera Localisation at a Large Scale

    Get PDF
    The task of camera-based localization aims to quickly and precisely pinpoint at which location (and viewing direction) the image was taken, against a pre-stored large-scale map of the environment. This technique can be used in many 3D computer vision applications, e.g., AR/VR and autonomous driving. Mapping the world is the first step to enable camera-based localization since a pre-stored map serves as a reference for a query image/sequence. In this thesis, we exploit three readily available sources: (i) satellite images; (ii) ground-view images; (iii) 3D points cloud. Based on the above three sources, we propose solutions to localize a query camera both effectively and efficiently, i.e., accurately localizing a query camera under a variety of lighting and viewing conditions within a small amount of time. The main contributions are summarized as follows. In chapter 3, we separately present a minimal 4-point and 2-point solver to estimate a relative and absolute camera pose. The core idea is exploiting the vertical direction from IMU or vanishing point to derive a closed-form solution of a quartic equation and a quadratic equation for the relative and absolute camera pose, respectively. In chapter 4, we localize a ground-view query image against a satellite map. Inspired by the insight that humans commonly use orientation information as an important cue for spatial localization, we propose a method that endows deep neural networks with the 'commonsense' of orientation. We design a Siamese network that explicitly encodes each pixel's orientation of the ground-view and satellite images. Our method boosts the learned deep features' discriminative power, outperforming all previous methods. In chapter 5, we localize a ground-view query image against a ground-view image database. We propose a representation learning method having higher location-discriminating power. The core idea is learning discriminative image embedding. Similarities among intra-place images (viewing the same landmarks) are maximized while similarities among inter-place images (viewing different landmarks) are minimized. The method is easy to implement and pluggable into any CNN. Experiments show that our method outperforms all previous methods. In chapter 6, we localize a ground-view query image against a large-scale 3D points cloud with visual descriptors. To address the ambiguities in direct 2D--3D feature matching, we introduce a global matching method that harnesses global contextual information exhibited both within the query image and among all the 3D points in the map. The core idea is to find the optimal 2D set to 3D set matching. Tests on standard benchmark datasets show the effectiveness of our method. In chapter 7, we localize a ground-view query image against a 3D points cloud with only coordinates. The problem is also known as blind Perspective-n-Point. We propose a deep CNN model that simultaneously solves for both the 6-DoF absolute camera pose and 2D--3D correspondences. The core idea is extracting point-wise 2D and 3D features from their coordinates and matching 2D and 3D features effectively in a global feature matching module. Extensive tests on both real and simulated data have shown that our method substantially outperforms existing approaches. Last, in chapter 8, we study the potential of using 3D lines. Specifically, we study the problem of aligning two partially overlapping 3D line reconstructions in Euclidean space. This technique can be used for localization with respect to a 3D line database when query 3D line reconstructions are available (e.g., from stereo triangulation). We propose a neural network, taking Pluecker representations of lines as input, and solving for line-to-line matches and estimate a 6-DoF rigid transformation. Experiments on indoor and outdoor datasets show that our method's registration (rotation and translation) precision outperforms baselines significantly

    UNav: An Infrastructure-Independent Vision-Based Navigation System for People with Blindness and Low vision

    Full text link
    Vision-based localization approaches now underpin newly emerging navigation pipelines for myriad use cases from robotics to assistive technologies. Compared to sensor-based solutions, vision-based localization does not require pre-installed sensor infrastructure, which is costly, time-consuming, and/or often infeasible at scale. Herein, we propose a novel vision-based localization pipeline for a specific use case: navigation support for end-users with blindness and low vision. Given a query image taken by an end-user on a mobile application, the pipeline leverages a visual place recognition (VPR) algorithm to find similar images in a reference image database of the target space. The geolocations of these similar images are utilized in downstream tasks that employ a weighted-average method to estimate the end-user's location and a perspective-n-point (PnP) algorithm to estimate the end-user's direction. Additionally, this system implements Dijkstra's algorithm to calculate a shortest path based on a navigable map that includes trip origin and destination. The topometric map used for localization and navigation is built using a customized graphical user interface that projects a 3D reconstructed sparse map, built from a sequence of images, to the corresponding a priori 2D floor plan. Sequential images used for map construction can be collected in a pre-mapping step or scavenged through public databases/citizen science. The end-to-end system can be installed on any internet-accessible device with a camera that hosts a custom mobile application. For evaluation purposes, mapping and localization were tested in a complex hospital environment. The evaluation results demonstrate that our system can achieve localization with an average error of less than 1 meter without knowledge of the camera's intrinsic parameters, such as focal length

    Camera Pose Estimation from Street-view Snapshots and Point Clouds

    Get PDF
    This PhD thesis targets on two research problems: (1) How to efficiently and robustly estimate the camera pose of a query image with a map that contains street-view snapshots and point clouds; (2) Given the estimated camera pose of a query image, how to create meaningful and intuitive applications with the map data. To conquer the first research problem, we systematically investigated indirect, direct and hybrid camera pose estimation strategies. We implemented state-of-the-art methods and performed comprehensive experiments in two public benchmark datasets considering outdoor environmental changes from ideal to extremely challenging cases. Our key findings are: (1) the indirect method is usually more accurate than the direct method when there are enough consistent feature correspondences; (2) The direct method is sensitive to initialization, but under extreme outdoor environmental changes, the mutual-information-based direct method is more robust than the feature-based methods; (3) The hybrid method combines the strength from both direct and indirect method and outperforms them in challenging datasets. To explore the second research problem, we considered inspiring and useful applications by exploiting the camera pose together with the map data. Firstly, we invented a 3D-map augmented photo gallery application, where images’ geo-meta data are extracted with an indirect camera pose estimation method and photo sharing experience is improved with the augmentation of 3D map. Secondly, we designed an interactive video playback application, where an indirect method estimates video frames’ camera pose and the video playback is augmented with a 3D map. Thirdly, we proposed a 3D visual primitive based indoor object and outdoor scene recognition method, where the 3D primitives are accumulated from the multiview images

    ECO: Egocentric Cognitive Mapping

    Get PDF
    We present a new method to localize a camera within a previously unseen environment perceived from an egocentric point of view. Although this is, in general, an ill-posed problem, humans can effortlessly and efficiently determine their relative location and orientation and navigate into a previously unseen environments, e.g., finding a specific item in a new grocery store. To enable such a capability, we design a new egocentric representation, which we call ECO (Egocentric COgnitive map). ECO is biologically inspired, by the cognitive map that allows human navigation, and it encodes the surrounding visual semantics with respect to both distance and orientation. ECO possesses three main properties: (1) reconfigurability: complex semantics and geometry is captured via the synthesis of atomic visual representations (e.g., image patch); (2) robustness: the visual semantics are registered in a geometrically consistent way (e.g., aligning with respect to the gravity vector, frontalizing, and rescaling to canonical depth), thus enabling us to learn meaningful atomic representations; (3) adaptability: a domain adaptation framework is designed to generalize the learned representation without manual calibration. As a proof-of-concept, we use ECO to localize a camera within real-world scenes---various grocery stores---and demonstrate performance improvements when compared to existing semantic localization approaches
    corecore