75 research outputs found

    3D Visual Perception for Self-Driving Cars using a Multi-Camera System: Calibration, Mapping, Localization, and Obstacle Detection

    Full text link
    Cameras are a crucial exteroceptive sensor for self-driving cars as they are low-cost and small, provide appearance information about the environment, and work in various weather conditions. They can be used for multiple purposes such as visual navigation and obstacle detection. We can use a surround multi-camera system to cover the full 360-degree field-of-view around the car. In this way, we avoid blind spots which can otherwise lead to accidents. To minimize the number of cameras needed for surround perception, we utilize fisheye cameras. Consequently, standard vision pipelines for 3D mapping, visual localization, obstacle detection, etc. need to be adapted to take full advantage of the availability of multiple cameras rather than treat each camera individually. In addition, processing of fisheye images has to be supported. In this paper, we describe the camera calibration and subsequent processing pipeline for multi-fisheye-camera systems developed as part of the V-Charge project. This project seeks to enable automated valet parking for self-driving cars. Our pipeline is able to precisely calibrate multi-camera systems, build sparse 3D maps for visual navigation, visually localize the car with respect to these maps, generate accurate dense maps, as well as detect obstacles based on real-time depth map extraction

    Structureless Camera Motion Estimation of Unordered Omnidirectional Images

    Get PDF
    This work aims at providing a novel camera motion estimation pipeline from large collections of unordered omnidirectional images. In oder to keep the pipeline as general and flexible as possible, cameras are modelled as unit spheres, allowing to incorporate any central camera type. For each camera an unprojection lookup is generated from intrinsics, which is called P2S-map (Pixel-to-Sphere-map), mapping pixels to their corresponding positions on the unit sphere. Consequently the camera geometry becomes independent of the underlying projection model. The pipeline also generates P2S-maps from world map projections with less distortion effects as they are known from cartography. Using P2S-maps from camera calibration and world map projection allows to convert omnidirectional camera images to an appropriate world map projection in oder to apply standard feature extraction and matching algorithms for data association. The proposed estimation pipeline combines the flexibility of SfM (Structure from Motion) - which handles unordered image collections - with the efficiency of PGO (Pose Graph Optimization), which is used as back-end in graph-based Visual SLAM (Simultaneous Localization and Mapping) approaches to optimize camera poses from large image sequences. SfM uses BA (Bundle Adjustment) to jointly optimize camera poses (motion) and 3d feature locations (structure), which becomes computationally expensive for large-scale scenarios. On the contrary PGO solves for camera poses (motion) from measured transformations between cameras, maintaining optimization managable. The proposed estimation algorithm combines both worlds. It obtains up-to-scale transformations between image pairs using two-view constraints, which are jointly scaled using trifocal constraints. A pose graph is generated from scaled two-view transformations and solved by PGO to obtain camera motion efficiently even for large image collections. Obtained results can be used as input data to provide initial pose estimates for further 3d reconstruction purposes e.g. to build a sparse structure from feature correspondences in an SfM or SLAM framework with further refinement via BA. The pipeline also incorporates fixed extrinsic constraints from multi-camera setups as well as depth information provided by RGBD sensors. The entire camera motion estimation pipeline does not need to generate a sparse 3d structure of the captured environment and thus is called SCME (Structureless Camera Motion Estimation).:1 Introduction 1.1 Motivation 1.1.1 Increasing Interest of Image-Based 3D Reconstruction 1.1.2 Underground Environments as Challenging Scenario 1.1.3 Improved Mobile Camera Systems for Full Omnidirectional Imaging 1.2 Issues 1.2.1 Directional versus Omnidirectional Image Acquisition 1.2.2 Structure from Motion versus Visual Simultaneous Localization and Mapping 1.3 Contribution 1.4 Structure of this Work 2 Related Work 2.1 Visual Simultaneous Localization and Mapping 2.1.1 Visual Odometry 2.1.2 Pose Graph Optimization 2.2 Structure from Motion 2.2.1 Bundle Adjustment 2.2.2 Structureless Bundle Adjustment 2.3 Corresponding Issues 2.4 Proposed Reconstruction Pipeline 3 Cameras and Pixel-to-Sphere Mappings with P2S-Maps 3.1 Types 3.2 Models 3.2.1 Unified Camera Model 3.2.2 Polynomal Camera Model 3.2.3 Spherical Camera Model 3.3 P2S-Maps - Mapping onto Unit Sphere via Lookup Table 3.3.1 Lookup Table as Color Image 3.3.2 Lookup Interpolation 3.3.3 Depth Data Conversion 4 Calibration 4.1 Overview of Proposed Calibration Pipeline 4.2 Target Detection 4.3 Intrinsic Calibration 4.3.1 Selected Examples 4.4 Extrinsic Calibration 4.4.1 3D-2D Pose Estimation 4.4.2 2D-2D Pose Estimation 4.4.3 Pose Optimization 4.4.4 Uncertainty Estimation 4.4.5 PoseGraph Representation 4.4.6 Bundle Adjustment 4.4.7 Selected Examples 5 Full Omnidirectional Image Projections 5.1 Panoramic Image Stitching 5.2 World Map Projections 5.3 World Map Projection Generator for P2S-Maps 5.4 Conversion between Projections based on P2S-Maps 5.4.1 Proposed Workflow 5.4.2 Data Storage Format 5.4.3 Real World Example 6 Relations between Two Camera Spheres 6.1 Forward and Backward Projection 6.2 Triangulation 6.2.1 Linear Least Squares Method 6.2.2 Alternative Midpoint Method 6.3 Epipolar Geometry 6.4 Transformation Recovery from Essential Matrix 6.4.1 Cheirality 6.4.2 Standard Procedure 6.4.3 Simplified Procedure 6.4.4 Improved Procedure 6.5 Two-View Estimation 6.5.1 Evaluation Strategy 6.5.2 Error Metric 6.5.3 Evaluation of Estimation Algorithms 6.5.4 Concluding Remarks 6.6 Two-View Optimization 6.6.1 Epipolar-Based Error Distances 6.6.2 Projection-Based Error Distances 6.6.3 Comparison between Error Distances 6.7 Two-View Translation Scaling 6.7.1 Linear Least Squares Estimation 6.7.2 Non-Linear Least Squares Optimization 6.7.3 Comparison between Initial and Optimized Scaling Factor 6.8 Homography to Identify Degeneracies 6.8.1 Homography for Spherical Cameras 6.8.2 Homography Estimation 6.8.3 Homography Optimization 6.8.4 Homography and Pure Rotation 6.8.5 Homography in Epipolar Geometry 7 Relations between Three Camera Spheres 7.1 Three View Geometry 7.2 Crossing Epipolar Planes Geometry 7.3 Trifocal Geometry 7.4 Relation between Trifocal, Three-View and Crossing Epipolar Planes 7.5 Translation Ratio between Up-To-Scale Two-View Transformations 7.5.1 Structureless Determination Approaches 7.5.2 Structure-Based Determination Approaches 7.5.3 Comparison between Proposed Approaches 8 Pose Graphs 8.1 Optimization Principle 8.2 Solvers 8.2.1 Additional Graph Solvers 8.2.2 False Loop Closure Detection 8.3 Pose Graph Generation 8.3.1 Generation of Synthetic Pose Graph Data 8.3.2 Optimization of Synthetic Pose Graph Data 9 Structureless Camera Motion Estimation 9.1 SCME Pipeline 9.2 Determination of Two-View Translation Scale Factors 9.3 Integration of Depth Data 9.4 Integration of Extrinsic Camera Constraints 10 Camera Motion Estimation Results 10.1 Directional Camera Images 10.2 Omnidirectional Camera Images 11 Conclusion 11.1 Summary 11.2 Outlook and Future Work Appendices A.1 Additional Extrinsic Calibration Results A.2 Linear Least Squares Scaling A.3 Proof Rank Deficiency A.4 Alternative Derivation Midpoint Method A.5 Simplification of Depth Calculation A.6 Relation between Epipolar and Circumferential Constraint A.7 Covariance Estimation A.8 Uncertainty Estimation from Epipolar Geometry A.9 Two-View Scaling Factor Estimation: Uncertainty Estimation A.10 Two-View Scaling Factor Optimization: Uncertainty Estimation A.11 Depth from Adjoining Two-View Geometries A.12 Alternative Three-View Derivation A.12.1 Second Derivation Approach A.12.2 Third Derivation Approach A.13 Relation between Trifocal Geometry and Alternative Midpoint Method A.14 Additional Pose Graph Generation Examples A.15 Pose Graph Solver Settings A.16 Additional Pose Graph Optimization Examples Bibliograph

    Continuous-Time Fixed-Lag Smoothing for LiDAR-Inertial-Camera SLAM

    Full text link
    Localization and mapping with heterogeneous multi-sensor fusion have been prevalent in recent years. To adequately fuse multi-modal sensor measurements received at different time instants and different frequencies, we estimate the continuous-time trajectory by fixed-lag smoothing within a factor-graph optimization framework. With the continuous-time formulation, we can query poses at any time instants corresponding to the sensor measurements. To bound the computation complexity of the continuous-time fixed-lag smoother, we maintain temporal and keyframe sliding windows with constant size, and probabilistically marginalize out control points of the trajectory and other states, which allows preserving prior information for future sliding-window optimization. Based on continuous-time fixed-lag smoothing, we design tightly-coupled multi-modal SLAM algorithms with a variety of sensor combinations, like the LiDAR-inertial and LiDAR-inertial-camera SLAM systems, in which online timeoffset calibration is also naturally supported. More importantly, benefiting from the marginalization and our derived analytical Jacobians for optimization, the proposed continuous-time SLAM systems can achieve real-time performance regardless of the high complexity of continuous-time formulation. The proposed multi-modal SLAM systems have been widely evaluated on three public datasets and self-collect datasets. The results demonstrate that the proposed continuous-time SLAM systems can achieve high-accuracy pose estimations and outperform existing state-of-the-art methods. To benefit the research community, we will open source our code at ~\url{https://github.com/APRIL-ZJU/clic}

    Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs

    Full text link
    Humans are able to form a complex mental model of the environment they move in. This mental model captures geometric and semantic aspects of the scene, describes the environment at multiple levels of abstractions (e.g., objects, rooms, buildings), includes static and dynamic entities and their relations (e.g., a person is in a room at a given time). In contrast, current robots' internal representations still provide a partial and fragmented understanding of the environment, either in the form of a sparse or dense set of geometric primitives (e.g., points, lines, planes, voxels) or as a collection of objects. This paper attempts to reduce the gap between robot and human perception by introducing a novel representation, a 3D Dynamic Scene Graph(DSG), that seamlessly captures metric and semantic aspects of a dynamic environment. A DSG is a layered graph where nodes represent spatial concepts at different levels of abstraction, and edges represent spatio-temporal relations among nodes. Our second contribution is Kimera, the first fully automatic method to build a DSG from visual-inertial data. Kimera includes state-of-the-art techniques for visual-inertial SLAM, metric-semantic 3D reconstruction, object localization, human pose and shape estimation, and scene parsing. Our third contribution is a comprehensive evaluation of Kimera in real-life datasets and photo-realistic simulations, including a newly released dataset, uHumans2, which simulates a collection of crowded indoor and outdoor scenes. Our evaluation shows that Kimera achieves state-of-the-art performance in visual-inertial SLAM, estimates an accurate 3D metric-semantic mesh model in real-time, and builds a DSG of a complex indoor environment with tens of objects and humans in minutes. Our final contribution shows how to use a DSG for real-time hierarchical semantic path-planning. The core modules in Kimera are open-source.Comment: 34 pages, 25 figures, 9 tables. arXiv admin note: text overlap with arXiv:2002.0628

    Event-Based Algorithms For Geometric Computer Vision

    Get PDF
    Event cameras are novel bio-inspired sensors which mimic the function of the human retina. Rather than directly capturing intensities to form synchronous images as in traditional cameras, event cameras asynchronously detect changes in log image intensity. When such a change is detected at a given pixel, the change is immediately sent to the host computer, where each event consists of the x,y pixel position of the change, a timestamp, accurate to tens of microseconds, and a polarity, indicating whether the pixel got brighter or darker. These cameras provide a number of useful benefits over traditional cameras, including the ability to track extremely fast motions, high dynamic range, and low power consumption. However, with a new sensing modality comes the need to develop novel algorithms. As these cameras do not capture photometric intensities, novel loss functions must be developed to replace the photoconsistency assumption which serves as the backbone of many classical computer vision algorithms. In addition, the relative novelty of these sensors means that there does not exist the wealth of data available for traditional images with which we can train learning based methods such as deep neural networks. In this work, we address both of these issues with two foundational principles. First, we show that the motion blur induced when the events are projected into the 2D image plane can be used as a suitable substitute for the classical photometric loss function. Second, we develop self-supervised learning methods which allow us to train convolutional neural networks to estimate motion without any labeled training data. We apply these principles to solve classical perception problems such as feature tracking, visual inertial odometry, optical flow and stereo depth estimation, as well as recognition tasks such as object detection and human pose estimation. We show that these solutions are able to utilize the benefits of event cameras, allowing us to operate in fast moving scenes with challenging lighting which would be incredibly difficult for traditional cameras

    A Comprehensive Introduction of Visual-Inertial Navigation

    Full text link
    In this article, a tutorial introduction to visual-inertial navigation(VIN) is presented. Visual and inertial perception are two complementary sensing modalities. Cameras and inertial measurement units (IMU) are the corresponding sensors for these two modalities. The low cost and light weight of camera-IMU sensor combinations make them ubiquitous in robotic navigation. Visual-inertial Navigation is a state estimation problem, that estimates the ego-motion and local environment of the sensor platform. This paper presents visual-inertial navigation in the classical state estimation framework, first illustrating the estimation problem in terms of state variables and system models, including related quantities representations (Parameterizations), IMU dynamic and camera measurement models, and corresponding general probabilistic graphical models (Factor Graph). Secondly, we investigate the existing model-based estimation methodologies, these involve filter-based and optimization-based frameworks and related on-manifold operations. We also discuss the calibration of some relevant parameters, also initialization of state of interest in optimization-based frameworks. Then the evaluation and improvement of VIN in terms of accuracy, efficiency, and robustness are discussed. Finally, we briefly mention the recent development of learning-based methods that may become alternatives to traditional model-based methods.Comment: 35 pages, 10 figure

    Visual SLAM for Autonomous Navigation of MAVs

    Get PDF
    This thesis focuses on developing onboard visual simultaneous localization and mapping (SLAM) systems to enable autonomous navigation of micro aerial vehicles (MAVs), which is still a challenging topic considering the limited payload and computational capability that an MAV normally has. In MAV applications, the visual SLAM systems are required to be very efficient, especially when other visual tasks have to be done in parallel. Furthermore, robustness in pose tracking is highly desired in order to enable safe autonomous navigation of an MAV in three-dimensional (3D) space. These challenges motivate the work in this thesis in the following aspects. Firstly, the problem of visual pose estimation for MAVs using an artificial landmark is addressed. An artificial neural network (ANN) is used to robustly recognize this visual marker in cluttered environments. Then a computational projective-geometry method is implemented for relative pose computation based on the retrieved geometry information of the visual marker. The presented vision system can be used not only for pose control of MAVs, but also for providing accurate pose estimates to a monocular visual SLAM system serving as an automatic initialization module for both indoor and outdoor environments. Secondly, autonomous landing on an arbitrarily textured landing site during autonomous navigation of an MAV is achieved. By integrating an efficient local-feature-based object detection algorithm within a monocular visual SLAM system, the MAV is able to search for the landing site autonomously along a predefined path, and land on it once it has been found. Thus, the proposed monocular visual solution enables autonomous navigation of an MAV in parallel with landing site detection. This solution relaxes the assumption made in conventional vision-guided landing systems, which is that the landing site should be located inside the field of view (FOV) of the vision system before initiating the landing task. The third problem that is addressed in this thesis is multi-camera visual SLAM for robust pose tracking of MAVs. Due to the limited FOV of a single camera, pose tracking using monocular visual SLAM may easily fail when the MAV navigates in unknown environments. Previous work addresses this problem mainly by fusing information from other sensors, like an inertial measurement unit (IMU), to achieve robustness of the whole system, which does not improve the robustness of visual SLAM itself. This thesis investigates solutions for improving the pose tracking robustness of a visual SLAM system by utilizing multiple cameras. A mathematical analysis of how measurements from multiple cameras should be integrated in the optimization of visual SLAM is provided. The resulting theory allows those measurements to be used for both robust pose tracking and map updating of the visual SLAM system. Furthermore, such a multi-camera visual SLAM system is modified to be a robust constant-time visual odometry. By integrating this visual odometry with an efficient back-end which consists of loop-closure detection and pose-graph optimization processes, a near-constant time multi-camera visual SLAM system is achieved for autonomous navigation of MAVs in large-scale environments.Diese Arbeit konzentriert sich auf die Entwicklung von integrierten Systemen zur gleichzeitigen Lokalisierung und Kartierung (Simultaneous Localization and Mapping, SLAM) mit Hilfe visueller Sensoren, um die autonome Navigation von kleinen Luftfahrzeugen (Micro Aerial Vehicles, MAVs) zu ermöglichen. Dies ist noch immer ein anspruchsvolles Thema angesichts der meist begrenzten Nutzlast und Rechenleistung eines MAVs. Die dafür eingesetzten visuellen SLAM Systeme müssen sehr effizient zu sein, vor allem wenn parallel noch andere visuelle Aufgaben durchgeführt werden sollen. Darüber hinaus ist eine robuste Positionsschätzung sehr wichtig, um die sichere autonome Navigation des MAVs im dreidimensionalen (3D) Raum zu ermöglichen. Diese Herausforderungen motivieren die vorliegende Arbeit gemäß den folgenden Gesichtspunkten: Zuerst wird das Problem bearbeitet, die Pose eines MAVs mit Hilfe einer künstlichen Markierung visuell zu schätzen. Ein künstliches neuronales Netz wird verwendet, um diese visuelle Markierung auch in anspruchsvollen Umgebungen zuverlässig zu erkennen. Anschließend wird ein Verfahren aus der projektiven Geometrie eingesetzt, um die relative Pose basierend auf der gemessenen Geometrie der visuellen Markierung zu ermitteln. Das vorgestellte Bildverarbeitungssystem kann nicht nur zur Regelung der Pose des MAVs verwendet werden, sondern auch genaue Posenschätzungen zur automatischen Initialisierung eines monokularen visuellen SLAM-Systems im Innen- und Außenbereich liefern. Anschließend wird die autonome Landung eines MAVs auf einem beliebig texturierten Landeplatz während autonomer Navigation erreicht. Durch die Integration eines effizienten Objekterkennungsalgorithmus, basierend auf lokalen Bildmerkmalen in einem monokularen visuellen SLAM-System, ist das MAV in der Lage den Landeplatz autonom entlang einer vorgegebenen Strecke zu suchen, und auf ihm zu landen sobald er gefunden wurde. Die vorgestellte Lösung ermöglicht somit die autonome Navigation eines MAVs bei paralleler Landeplatzerkennung. Diese Lösung lockert die gängige Annahme in herkömmlichen Systemen zum kamerageführten Landen, dass der Landeplatz vor Beginn der Landung innerhalb des Sichtfelds des Bildverarbeitungssystems liegen muss. Das dritte in dieser Arbeit bearbeitete Problem ist visuelles SLAM mit mehreren Kameras zur robusten Posenschätzung für MAVs. Aufgrund des begrenzten Sichtfelds von einer einzigen Kamera kann die Posenschätzung von monokularem visuellem SLAM leicht fehlschlagen, wenn sich das MAV in einer unbekannten Umgebung bewegt. Frühere Arbeiten versutchen dieses Problem hauptsächlich durch die Fusionierung von Informationen anderer Sensoren, z.B. eines Inertialsensors (Inertial Measurement Unit, IMU) zu lösen um eine höhere Robustheit des Gesamtsystems zu erreichen, was die Robustheit des visuellen SLAM-Systems selbst nicht verbessert. Die vorliegende Arbeit untersucht Lösungen zur Verbesserung der Robustheit der Posenschätzung eines visuellen SLAM-Systems durch die Verwendung mehrerer Kameras. Wie Messungen von mehreren Kameras in die Optimierung für visuelles SLAM integriert werden können wird mathematisch analysiert. Die daraus resultierende Theorie erlaubt die Nutzung dieser Messungen sowohl zur robusten Posenschätzung als auch zur Aktualisierung der visuellen Karte. Ferner wird ein solches visuelles SLAM-System mit mehreren Kameras modifiziert, um in konstanter Laufzeit robuste visuelle Odometrie zu berechnen. Die Integration dieser visuellen Odometrie mit einem effizienten Back-End zur Erkennung von geschlossener Schleifen und der Optimierung des Posengraphen ermöglicht ein visuelles SLAM-System mit mehreren Kameras und fast konstanter Laufzeit zur autonomen Navigation von MAVs in großen Umgebungen
    corecore