227 research outputs found

    Efficient 2D-3D Matching for Multi-Camera Visual Localization

    Full text link
    Visual localization, i.e., determining the position and orientation of a vehicle with respect to a map, is a key problem in autonomous driving. We present a multicamera visual inertial localization algorithm for large scale environments. To efficiently and effectively match features against a pre-built global 3D map, we propose a prioritized feature matching scheme for multi-camera systems. In contrast to existing works, designed for monocular cameras, we (1) tailor the prioritization function to the multi-camera setup and (2) run feature matching and pose estimation in parallel. This significantly accelerates the matching and pose estimation stages and allows us to dynamically adapt the matching efforts based on the surrounding environment. In addition, we show how pose priors can be integrated into the localization system to increase efficiency and robustness. Finally, we extend our algorithm by fusing the absolute pose estimates with motion estimates from a multi-camera visual inertial odometry pipeline (VIO). This results in a system that provides reliable and drift-less pose estimation. Extensive experiments show that our localization runs fast and robust under varying conditions, and that our extended algorithm enables reliable real-time pose estimation.Comment: 7 pages, 5 figure

    Understanding the Limitations of CNN-based Absolute Camera Pose Regression

    Full text link
    Visual localization is the task of accurate camera pose estimation in a known scene. It is a key problem in computer vision and robotics, with applications including self-driving cars, Structure-from-Motion, SLAM, and Mixed Reality. Traditionally, the localization problem has been tackled using 3D geometry. Recently, end-to-end approaches based on convolutional neural networks have become popular. These methods learn to directly regress the camera pose from an input image. However, they do not achieve the same level of pose accuracy as 3D structure-based methods. To understand this behavior, we develop a theoretical model for camera pose regression. We use our model to predict failure cases for pose regression techniques and verify our predictions through experiments. We furthermore use our model to show that pose regression is more closely related to pose approximation via image retrieval than to accurate pose estimation via 3D structure. A key result is that current approaches do not consistently outperform a handcrafted image retrieval baseline. This clearly shows that additional research is needed before pose regression algorithms are ready to compete with structure-based methods.Comment: Initial version of a paper accepted to CVPR 201

    MeshLoc: Mesh-Based Visual Localization

    Full text link
    Visual localization, i.e., the problem of camera pose estimation, is a central component of applications such as autonomous robots and augmented reality systems. A dominant approach in the literature, shown to scale to large scenes and to handle complex illumination and seasonal changes, is based on local features extracted from images. The scene representation is a sparse Structure-from-Motion point cloud that is tied to a specific local feature. Switching to another feature type requires an expensive feature matching step between the database images used to construct the point cloud. In this work, we thus explore a more flexible alternative based on dense 3D meshes that does not require features matching between database images to build the scene representation. We show that this approach can achieve state-of-the-art results. We further show that surprisingly competitive results can be obtained when extracting features on renderings of these meshes, without any neural rendering stage, and even when rendering raw scene geometry without color or texture. Our results show that dense 3D model-based representations are a promising alternative to existing representations and point to interesting and challenging directions for future research.Comment: to be published in the proceedings of ECCV 2022, code repository: https://github.com/tsattler/meshloc_releas

    Towards Robust Visual Localization in Challenging Conditions

    Get PDF
    Visual localization is a fundamental problem in computer vision, with a multitude of applications in robotics, augmented reality and structure-from-motion. The basic problem is to, based on one or more images, figure out the position and orientation of the camera which captured these images relative to some model of the environment. Current visual localization approaches typically work well when the images to be localized are captured under similar conditions compared to those captured during mapping. However, when the environment exhibits large changes in visual appearance, due to e.g. variations in weather, seasons, day-night or viewpoint, the traditional pipelines break down. The reason is that the local image features used are based on low-level pixel-intensity information, which is not invariant to these transformations: when the environment changes, this will cause a different set of keypoints to be detected, and their descriptors will be different, making the long-term visual localization problem a challenging one. In this thesis, four papers are included, which present work towards solving the problem of long-term visual localization. Three of the articles present ideas for how semantic information may be included to aid in the localization process: one approach relies only on the semantic information for visual localization, another shows how the semantics can be used to detect outlier feature correspondences, while the third presents a sequential localization algorithm which relies on the consistency of the reprojection of a semantic model, instead of traditional features. The final article is a benchmark paper, where we present three new benchmark datasets aimed at evaluating localization algorithms in the context of long-term visual localization

    View synthesis for pose computation

    Get PDF
    International audienceGeometrical registration of a query image with respect to a 3D model, or pose estimation, is the cornerstone of many computer vision applications. It is often based on the matching of local photometric descriptors invariant to limited viewpoint changes. However, when the query image has been acquired from a camera position not covered by the model images, pose estimation is often not accurate and sometimes even fails, precisely because of the limited invariance of descriptors. In this paper, we propose to add descriptors to the model, obtained from synthesized views associated with virtual cameras completing the covering of the scene by the real cameras. We propose an efficient strategy to localize the virtual cameras in the scene and generate valuable descriptors from synthetic views. We also discuss a guided sampling strategy for registration in this context. Experiments show that the accuracy of pose estimation is dramatically improved when large viewpoint changes makes the matching of classic descriptors a challenging task

    A multisensor SLAM for dense maps of large scale environments under poor lighting conditions

    Get PDF
    This thesis describes the development and implementation of a multisensor large scale autonomous mapping system for surveying tasks in underground mines. The hazardous nature of the underground mining industry has resulted in a push towards autonomous solutions to the most dangerous operations, including surveying tasks. Many existing autonomous mapping techniques rely on approaches to the Simultaneous Localization and Mapping (SLAM) problem which are not suited to the extreme characteristics of active underground mining environments. Our proposed multisensor system has been designed from the outset to address the unique challenges associated with underground SLAM. The robustness, self-containment and portability of the system maximize the potential applications.The multisensor mapping solution proposed as a result of this work is based on a fusion of omnidirectional bearing-only vision-based localization and 3D laser point cloud registration. By combining these two SLAM techniques it is possible to achieve some of the advantages of both approaches – the real-time attributes of vision-based SLAM and the dense, high precision maps obtained through 3D lasers. The result is a viable autonomous mapping solution suitable for application in challenging underground mining environments.A further improvement to the robustness of the proposed multisensor SLAM system is a consequence of incorporating colour information into vision-based localization. Underground mining environments are often dominated by dynamic sources of illumination which can cause inconsistent feature motion during localization. Colour information is utilized to identify and remove features resulting from illumination artefacts and to improve the monochrome based feature matching between frames.Finally, the proposed multisensor mapping system is implemented and evaluated in both above ground and underground scenarios. The resulting large scale maps contained a maximum offset error of ±30mm for mapping tasks with lengths over 100m

    From Coarse to Fine: Robust Hierarchical Localization at Large Scale

    Full text link
    Robust and accurate visual localization is a fundamental capability for numerous applications, such as autonomous driving, mobile robotics, or augmented reality. It remains, however, a challenging task, particularly for large-scale environments and in presence of significant appearance changes. State-of-the-art methods not only struggle with such scenarios, but are often too resource intensive for certain real-time applications. In this paper we propose HF-Net, a hierarchical localization approach based on a monolithic CNN that simultaneously predicts local features and global descriptors for accurate 6-DoF localization. We exploit the coarse-to-fine localization paradigm: we first perform a global retrieval to obtain location hypotheses and only later match local features within those candidate places. This hierarchical approach incurs significant runtime savings and makes our system suitable for real-time operation. By leveraging learned descriptors, our method achieves remarkable localization robustness across large variations of appearance and sets a new state-of-the-art on two challenging benchmarks for large-scale localization.Comment: Camera-ready for CVPR 201

    Semantic Validation in Structure from Motion

    Full text link
    The Structure from Motion (SfM) challenge in computer vision is the process of recovering the 3D structure of a scene from a series of projective measurements that are calculated from a collection of 2D images, taken from different perspectives. SfM consists of three main steps; feature detection and matching, camera motion estimation, and recovery of 3D structure from estimated intrinsic and extrinsic parameters and features. A problem encountered in SfM is that scenes lacking texture or with repetitive features can cause erroneous feature matching between frames. Semantic segmentation offers a route to validate and correct SfM models by labelling pixels in the input images with the use of a deep convolutional neural network. The semantic and geometric properties associated with classes in the scene can be taken advantage of to apply prior constraints to each class of object. The SfM pipeline COLMAP and semantic segmentation pipeline DeepLab were used. This, along with planar reconstruction of the dense model, were used to determine erroneous points that may be occluded from the calculated camera position, given the semantic label, and thus prior constraint of the reconstructed plane. Herein, semantic segmentation is integrated into SfM to apply priors on the 3D point cloud, given the object detection in the 2D input images. Additionally, the semantic labels of matched keypoints are compared and inconsistent semantically labelled points discarded. Furthermore, semantic labels on input images are used for the removal of objects associated with motion in the output SfM models. The proposed approach is evaluated on a data-set of 1102 images of a repetitive architecture scene. This project offers a novel method for improved validation of 3D SfM models
    • …
    corecore