    Real-Time RGB-D Camera Pose Estimation in Novel Scenes using a Relocalisation Cascade

    Camera pose estimation is an important problem in computer vision. Common techniques either match the current image against keyframes with known poses, directly regress the pose, or establish correspondences between keypoints in the image and points in the scene to estimate the pose. In recent years, regression forests have become a popular alternative to establish such correspondences. They achieve accurate results, but have traditionally needed to be trained offline on the target scene, preventing relocalisation in new environments. Recently, we showed how to circumvent this limitation by adapting a pre-trained forest to a new scene on the fly. The adapted forests achieved relocalisation performance that was on par with that of offline forests, and our approach was able to estimate the camera pose in close to real time. In this paper, we present an extension of this work that achieves significantly better relocalisation performance whilst running fully in real time. To achieve this, we make several changes to the original approach: (i) instead of accepting the camera pose hypothesis without question, we make it possible to score the final few hypotheses using a geometric approach and select the most promising; (ii) we chain several instantiations of our relocaliser together in a cascade, allowing us to try faster but less accurate relocalisation first, only falling back to slower, more accurate relocalisation as necessary; and (iii) we tune the parameters of our cascade to achieve effective overall performance. These changes allow us to significantly improve upon the performance our original state-of-the-art method was able to achieve on the well-known 7-Scenes and Stanford 4 Scenes benchmarks. As additional contributions, we present a way of visualising the internal behaviour of our forests and show how to entirely circumvent the need to pre-train a forest on a generic scene.Comment: Tommaso Cavallari, Stuart Golodetz, Nicholas Lord and Julien Valentin assert joint first authorshi

    InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure

    Volumetric models have become a popular representation for 3D scenes in recent years. One breakthrough leading to their popularity was KinectFusion, which focuses on 3D reconstruction using RGB-D sensors. However, monocular SLAM has since also been tackled with very similar approaches. Representing the reconstruction volumetrically as a TSDF leads to most of the simplicity and efficiency that can be achieved with GPU implementations of these systems. However, this representation is memory-intensive and limits applicability to small-scale reconstructions. Several avenues have been explored to overcome this. With the aim of summarizing them and providing for a fast, flexible 3D reconstruction pipeline, we propose a new, unifying framework called InfiniTAM. The idea is that steps like camera tracking, scene representation and integration of new data can easily be replaced and adapted to the user's needs. This report describes the technical implementation details of InfiniTAM v3, the third version of our InfiniTAM system. We have added various new features, as well as making numerous enhancements to the low-level code that significantly improve our camera tracking performance. The new features that we expect to be of most interest are (i) a robust camera tracking module; (ii) an implementation of Glocker et al.'s keyframe-based random ferns camera relocaliser; (iii) a novel approach to globally-consistent TSDF-based reconstruction, based on dividing the scene into rigid submaps and optimising the relative poses between them; and (iv) an implementation of Keller et al.'s surfel-based reconstruction approach.Comment: This article largely supersedes arxiv:1410.0925 (it describes version 3 of the InfiniTAM framework

    Image-Based Localization Using Context

    Image-based localization problem consists of estimating the 6 DoFcamera pose by matching the image to a 3D point cloud (or equivalent)representing a 3D environment. The robustness and accuracyof current solutions is not objective and quantifiable. Wehave completed a comparative analysis of the main state of the artapproaches, namely Brute Force Matching, Approximate NearestNeighbour Matching, Embedded Ferns Classification, ACG Localizer(Using Visual Vocabulary) and Keyframe Matching Approach.The results of the study revealed major deficiencies in each approachmainly in search space reduction, clustering, feature matchingand sensitivity to where the query image was taken. Then, wechoose to focus on one common major problem that is reducingthe search space. We propose to create a new image-based localizationapproach based on reducing the search space by usingglobal descriptors to find candidate keyframes in the database thensearch against the 3D points that are only seen from these candidatesusing local descriptors stored in a 3D cloud map

    Exploiting Points and Lines in Regression Forests for RGB-D Camera Relocalization

    Camera relocalization plays a vital role in many robotics and computer vision tasks, such as global localization, recovery from tracking failure and loop closure detection. Recent random forests based methods exploit randomly sampled pixel comparison features to predict 3D world locations for 2D image locations to guide the camera pose optimization. However, these image features are only sampled randomly in the images, without considering the spatial structures or geometric information, leading to large errors or failure cases with the existence of poorly textured areas or in motion blur. Line segment features are more robust in these environments. In this work, we propose to jointly exploit points and lines within the framework of uncertainty driven regression forests. The proposed approach is thoroughly evaluated on three publicly available datasets against several strong state-of-the-art baselines in terms of several different error metrics. Experimental results prove the efficacy of our method, showing superior or on-par state-of-the-art performance.Comment: published as a conference paper at 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS

    Efficient Image-Based Localization Using Context

    Image-Based Localization (IBL) is the problem of computing the position and orientation of a camera with respect to a geometric representation of the scene. A fundamental building block of IBL is searching the space of a saved 3D representation of the scene for correspondences to a query image. The robustness and accuracy of the IBL approaches in the literature are not objective and quantifiable. First, this thesis presents a detailed description and study of three different 3D modeling packages based on SFM to reconstruct a 3D map of an environment. The packages tested are VSFM, Bundler and PTAM. The objective is to assess the mapping ability of each of the techniques and choose the best one to use for reconstructing the IBL 3D map. The study results show that image matching which is the bottleneck of SFM, SLAM and IBL plays the major role in favour of VSFM. This will result in using wrong matches in building the 3D map. It is crucial for IBL to choose the software that provides the best quality of points, \textit{i.e.} the largest number of correct 3D points. For this reason, VSFM will be chosen to reconstruct the 3D maps for IBL. Second, this work presents a comparative study of the main approaches, namely Brute Force Matching, Tree-Based Approach, Embedded Ferns Classification, ACG Localizer, Keyframe Approach, Decision Forest, Worldwide Pose Estimation and MPEG Search Space Reduction. The objective of the comparative analysis was to first uncover the specifics of each of these techniques and thereby understand the advantages and disadvantages of each of them. The testing was performed on Dubrovnik Dataset where the localization is determined with respect to a 3D cloud map which was computed using a Structure-from-Motion approach. The study results show that the current state of the art IBL solutions still face challenges in search space reduction, feature matching, clustering, and the quality of the solution is not consistent across all query images. Third, this work addresses the search space problem in order to solve the IBL problem. The Gist-based Search Space Reduction (GSSR), an efficient alternative to the available search space solutions, is proposed. It relies on GIST descriptors to considerably reduce search space and computational time, while at the same exceeding the state of the art in localization accuracy. Experiments on the 7 scenes datasets of Microsoft Research reveal considerable speedups for GSSR versus tree-based approaches, reaching a 4 times faster speed for the Heads dataset, and reducing the search space by an average of 92% while maintaining a better accuracy

    Collaborative large-scale dense 3D reconstruction with online inter-agent pose optimisation

    Reconstructing dense, volumetric models of real-world 3D scenes is important for many tasks, but capturing large scenes can take significant time, and the risk of transient changes to the scene goes up as the capture time increases. These are good reasons to want instead to capture several smaller sub-scenes that can be joined to make the whole scene. Achieving this has traditionally been difficult: joining sub-scenes that may never have been viewed from the same angle requires a high-quality camera relocaliser that can cope with novel poses, and tracking drift in each sub-scene can prevent them from being joined to make a consistent overall scene. Recent advances, however, have significantly improved our ability to capture medium-sized sub-scenes with little to no tracking drift: real-time globally consistent reconstruction systems can close loops and re-integrate the scene surface on the fly, whilst new visual-inertial odometry approaches can significantly reduce tracking drift during live reconstruction. Moreover, high-quality regression forest-based relocalisers have recently been made more practical by the introduction of a method to allow them to be trained and used online. In this paper, we leverage these advances to present what to our knowledge is the first system to allow multiple users to collaborate interactively to reconstruct dense, voxel-based models of whole buildings using only consumer-grade hardware, a task that has traditionally been both time-consuming and dependent on the availability of specialised hardware. Using our system, an entire house or lab can be reconstructed in under half an hour and at a far lower cost than was previously possible