35 research outputs found

    Real-Time RGB-D Camera Pose Estimation in Novel Scenes using a Relocalisation Cascade

    Full text link
    Camera pose estimation is an important problem in computer vision. Common techniques either match the current image against keyframes with known poses, directly regress the pose, or establish correspondences between keypoints in the image and points in the scene to estimate the pose. In recent years, regression forests have become a popular alternative to establish such correspondences. They achieve accurate results, but have traditionally needed to be trained offline on the target scene, preventing relocalisation in new environments. Recently, we showed how to circumvent this limitation by adapting a pre-trained forest to a new scene on the fly. The adapted forests achieved relocalisation performance that was on par with that of offline forests, and our approach was able to estimate the camera pose in close to real time. In this paper, we present an extension of this work that achieves significantly better relocalisation performance whilst running fully in real time. To achieve this, we make several changes to the original approach: (i) instead of accepting the camera pose hypothesis without question, we make it possible to score the final few hypotheses using a geometric approach and select the most promising; (ii) we chain several instantiations of our relocaliser together in a cascade, allowing us to try faster but less accurate relocalisation first, only falling back to slower, more accurate relocalisation as necessary; and (iii) we tune the parameters of our cascade to achieve effective overall performance. These changes allow us to significantly improve upon the performance our original state-of-the-art method was able to achieve on the well-known 7-Scenes and Stanford 4 Scenes benchmarks. As additional contributions, we present a way of visualising the internal behaviour of our forests and show how to entirely circumvent the need to pre-train a forest on a generic scene.Comment: Tommaso Cavallari, Stuart Golodetz, Nicholas Lord and Julien Valentin assert joint first authorshi

    Enhancing RGB-D SLAM Using Deep Learning

    Get PDF

    Beyond Controlled Environments: 3D Camera Re-Localization in Changing Indoor Scenes

    Full text link
    Long-term camera re-localization is an important task with numerous computer vision and robotics applications. Whilst various outdoor benchmarks exist that target lighting, weather and seasonal changes, far less attention has been paid to appearance changes that occur indoors. This has led to a mismatch between popular indoor benchmarks, which focus on static scenes, and indoor environments that are of interest for many real-world applications. In this paper, we adapt 3RScan - a recently introduced indoor RGB-D dataset designed for object instance re-localization - to create RIO10, a new long-term camera re-localization benchmark focused on indoor scenes. We propose new metrics for evaluating camera re-localization and explore how state-of-the-art camera re-localizers perform according to these metrics. We also examine in detail how different types of scene change affect the performance of different methods, based on novel ways of detecting such changes in a given RGB-D frame. Our results clearly show that long-term indoor re-localization is an unsolved problem. Our benchmark and tools are publicly available at waldjohannau.github.io/RIO10Comment: ECCV 2020, project website https://waldjohannau.github.io/RIO1

    Graph Attention Network for Camera Relocalization on Dynamic Scenes

    Full text link
    We devise a graph attention network-based approach for learning a scene triangle mesh representation in order to estimate an image camera position in a dynamic environment. Previous approaches built a scene-dependent model that explicitly or implicitly embeds the structure of the scene. They use convolution neural networks or decision trees to establish 2D/3D-3D correspondences. Such a mapping overfits the target scene and does not generalize well to dynamic changes in the environment. Our work introduces a novel approach to solve the camera relocalization problem by using the available triangle mesh. Our 3D-3D matching framework consists of three blocks: (1) a graph neural network to compute the embedding of mesh vertices, (2) a convolution neural network to compute the embedding of grid cells defined on the RGB-D image, and (3) a neural network model to establish the correspondence between the two embeddings. These three components are trained end-to-end. To predict the final pose, we run the RANSAC algorithm to generate camera pose hypotheses, and we refine the prediction using the point-cloud representation. Our approach significantly improves the camera pose accuracy of the state-of-the-art method from 0.3580.358 to 0.5060.506 on the RIO10 benchmark for dynamic indoor camera relocalization

    Using Image Sequences for Long-Term Visual Localization

    Get PDF
    Estimating the pose of a camera in a known scene, i.e., visual localization, is a core task for applications such as self-driving cars. In many scenarios, image sequences are available and existing work on combining single-image localization with odometry offers to unlock their potential for improving localization performance. Still, the largest part of the literature focuses on single-image localization and ignores the availability of sequence data. The goal of this paper is to demonstrate the potential of image sequences in challenging scenarios, e.g., under day-night or seasonal changes. Combining ideas from the literature, we describe a sequence-based localization pipeline that combines odometry with both a coarse and a fine localization module. Experiments on long-term localization datasets show that combining single-image global localization against a prebuilt map with a visual odometry / SLAM pipeline improves performance to a level where the extended CMU Seasons dataset can be considered solved. We show that SIFT features can perform on par with modern state-of-the-art features in our framework, despite being much weaker and a magnitude faster to compute. Our code is publicly available at github.com/rulllars

    How Geometry Meets Learning in Pose Estimation

    Get PDF
    This thesis focuses on one of the fundamental problems in computer vision, sixdegree- of-freedom (6dof) pose estimation, whose task is to predict the geometric transformation from the camera to a target of interest, from only RGB inputs. Solutions to this problem have been proposed using the technique of image retrieval or sparse 2D-3D correspondence matching with geometric verification. Thanks to the development of deep learning, the direct regression-based (compute pose directly from image-to-pose regression) and indirect reconstruction-based (solve pose via dense matching between image and 3D reconstruction) approaches using neural network recently draw growing attention in community. Although models have been proposed for both camera relocalisation and object pose estimation using a deep network base, there are still open questions. In this thesis, we investigate several problems in pose estimation regarding end-to-end object pose inference, uncertainty of pose estimation in regression-based method and self-supervision for reconstruction-based learning both for scenes and objects. We focus on the end-to-end 6dof pose regression for objects in the first part of this thesis. Traditional methods that predict the 6dof pose for objects usually rely on the 3D CAD model and require a multi-step scheme to compute the pose. We alternatively use the idea of direct pose regression for objects based on a region proposed network Mask R-CNN, which is well-known for object detection and instance segmentation. Our newly proposed network head regresses a 4D vector from the RoI feature map of each object. A 3D vector from Lie algebra is used as the representation for rotation. Another one scalar for the z-axis of translation is predicted to recover the full 3D translation along with the position of bounding boxes. This simplification avoids the spatial ambiguity for object in the scope of 2D image caused by RoIPooling. Our method performs accurately at inference time, and faster than methods that require 3D models and refinement in their pipeline. We estimate the uncertainty for the pose regressed by a deep model in the second part. A CNN is combined with Gaussian Process Regression (GPR) to build a framework that directly obtains a predictive distribution over camera pose. The combination is achieved by exploiting the CNN to extract discriminative features and using the GPR to perform probabilistic inference. In order to prevent the complexity of uncertainty estimation from growing with the number of training images in the datasets, we use pseudo inducing CNN feature points to represent the whole dataset and learn their representations using Stochastic Variational Inference (SVI). This makes GPR a parametric model, which can be learnt together with the CNN backbone at the same time. We test the proposed hybrid framework on the problem of camera relocalisation. The third and fourth parts of our thesis have similar objectives: seeking selfsupervision for the learning of dense reconstruction for pose estimation from images without using the ground truth 3D model of scenes (in part 3) and objects (in part 4). We explore an alternative supervisory signal from multi-view geometry. Photometric and/or featuremetric consistency in image pairs from different viewpoints is proposed to constrain the learning of the world-centric coordinates (part 3) and object-centric coordinates (part 4). The dense reconstruction model is subsequently used as 2D-3D correspondences establisher at inference time to compute the 6dof pose using PnP plus RANSAC. Our 3D model free methods for pose estimation eliminate the dependency on 3D models used in state-of-the-art approaches.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

    Active Visual Localization in Partially Calibrated Environments

    Full text link
    Humans can robustly localize themselves without a map after they get lost following prominent visual cues or landmarks. In this work, we aim at endowing autonomous agents the same ability. Such ability is important in robotics applications yet very challenging when an agent is exposed to partially calibrated environments, where camera images with accurate 6 Degree-of-Freedom pose labels only cover part of the scene. To address the above challenge, we explore using Reinforcement Learning to search for a policy to generate intelligent motions so as to actively localize the agent given visual information in partially calibrated environments. Our core contribution is to formulate the active visual localization problem as a Partially Observable Markov Decision Process and propose an algorithmic framework based on Deep Reinforcement Learning to solve it. We further propose an indoor scene dataset ACR-6, which consists of both synthetic and real data and simulates challenging scenarios for active visual localization. We benchmark our algorithm against handcrafted baselines for localization and demonstrate that our approach significantly outperforms them on localization success rate.Comment: https://www.youtube.com/watch?v=DIH-GbytCPM&feature=youtu.b
    corecore