8 research outputs found

    Real-time RGB-D semantic keyframe SLAM based on image segmentation learning from industrial CAD models

    Get PDF
    International audienceThis paper presents methods for performing real-time semantic SLAM aimed at autonomous navigation and control of a humanoid robot in a manufacturing scenario. A novel multi-keyframe approach is proposed that simultaneously minimizes a semantic cost based on class-level features in addition to common photometric and geometric costs. The approach is shown to robustly construct a 3D map with associated class labels relevant to robotic tasks. Alternatively to existing approaches, the segmentation of these semantic classes have been learnt using RGB-D sensor data aligned with an industrial CAD manufacturing model to obtain noisy pixel-wise labels. This dataset confronts the proposed approach in a complicated real-world setting and provides insight into the practical use case scenarios. The semantic segmentation network was fine tuned for the given use case and was trained in a semi-supervised manner using noisy labels. The developed software is real-time and integrated with ROS to obtain a complete semantic reconstruction for the control and navigation of the HRP4 robot. Experiments in-situ at the Airbus manufacturing site in Saint-Nazaire validate the proposed approach

    Object-level dynamic SLAM

    Get PDF
    Visual Simultaneous Localisation and Mapping (SLAM) can estimate a camera's pose in an unknown environment and reconstruct an online map of it. Despite the advances in many real-time dense SLAM systems, most still assume a static environment, which is not a valid assumption in many real-world scenarios. This thesis aims to enable dense visual SLAM to run robustly in a dynamic environment, knowing where the sensor is in the environment, and, also importantly, what and where objects are in the surrounding environment for better scene understanding. The contributions in this thesis are threefold. The first one presents one of the first object-level dynamic SLAM systems that robustly track camera pose while detecting, tracking, and reconstructing all the objects in dynamic scenes. It can continuously fuse geometric, semantic, and motion information for each object into an octree-based volumetric representation. One of the challenges in tracking moving objects is that the object motion can easily break the illumination constancy assumption. In our second contribution, we address this issue by proposing a dense feature-metric alignment to robustly estimate camera and object poses. We will show how to learn dense feature maps and feature-metric uncertainties in a self-supervised way. They formulate a probabilistic feature-metric residual, which can be efficiently solved using Gauss-Newton optimisation and easily coupled with other residuals. So far, we can only reconstruct objects' geometry from the sensor data. Our third contribution further incorporates category-level shape prior to the object mapping. Conditioning on the depth measurement, the learned implicit function completes the unseen part while reconstructing the observed part accurately. It can yield better reconstruction completeness and more accurate object pose estimation. These three contributions in this thesis have advanced the state of the art in visual SLAM. We hope such object-level dynamic SLAM systems will help robots intelligently interact with the human-existing world.Open Acces

    Deeply Learned Priors for Geometric Reconstruction

    Get PDF
    This thesis comprises of a body of work that investigates the use of deeply learned priors for dense geometric reconstruction of scenes. A typical image captured by a 2D camera sensor is a lossy two-dimensional (2D) projection of our three-dimensional (3D) world. Geometric reconstruction approaches usually recreate the lost structural information by taking in multiple images observing a scene from different views and solving a problem known as Structure from Motion (SfM) or Simultaneous Localization and Mapping (SLAM). Remarkably, by establishing correspondences across images and use of geometric models, these methods (under reasonable conditions) can reconstruct a scene's 3D structure as well as precisely localise the observed views relative to the scene. The success of dense every-pixel multi-view reconstruction is however limited by matching ambiguities that commonly arise due to uniform texture, occlusion, and appearance distortion, among several other factors. The standard approach to deal with matching ambiguities is to handcraft priors based on assumptions like piecewise smoothness or planarity in the 3D map, in order to "fill in" map regions supported by little or ambiguous matching evidence. In this thesis we propose learned priors that in comparison more closely model the true structure of the scene and are based on geometric information predicted from the images. The motivation stems from recent advancements in deep learning algorithms and availability of massive datasets, that have allowed Convolutional Neural Networks (CNNs) to predict geometric properties of a scene such as point-wise surface normals and depths, from just a single image, more reliably than what was possible using previous machine learning-based or hand-crafted methods. In particular, we first explore how single image-based surface normals from a CNN trained on massive amount of indoor data can benefit the accuracy of dense reconstruction given input images from a moving monocular camera. Here we propose a novel surface normal based inverse depth regularizer and compare its performance against the inverse depth smoothness prior that is typically used to regularize regions in the reconstruction that are textureless. We also propose the first real-time CNN-based framework for live dense monocular reconstruction using our learned normal prior. Next, we look at how we can use deep learning to learn features in order to improve the pixel matching process itself, which is at the heart of multi-view geometric reconstruction. We propose a self-supervised feature learning scheme using RGB-D data from a 3D sensor (that does not require any manual labelling) and a multi-scale CNN architecture for feature extraction that is fast and eficient to run inside our proposed real-time monocular reconstruction framework. We extensively analyze the combined benefits of using learned normals and deep features that are good-for-matching in the context of dense reconstruction, both quantitatively and qualitatively on large real world datasets. Lastly, we explore how learned depths, also predicted on a per-pixel basis from a single image using a CNN, can be used to inpaint sparse 3D maps obtained from monocular SLAM or a 3D sensor. We propose a novel model that uses predicted depths and confidences from CNNs as priors to inpaint maps with arbitrary scale and sparsity. We obtain more reliable reconstructions than those of traditional depth inpainting methods such as the cross-bilateral filter that in comparison offer few learnable parameters. Here we advocate the idea of "just-in-time reconstruction" where a higher level of scene understanding reliably inpaints the corresponding portion of a sparse map on-demand and in real-time.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201

    Semantic Texture for Robust Dense Tracking

    No full text
    We argue that robust dense SLAM systems can make valuable use of the layers of features coming from a standard CNN as a pyramid of `semantic texture' which is suitable for dense alignment while being much more robust to nuisance factors such as lighting than raw RGB values. We use a straightforward Lucas-Kanade formulation of image alignment, with a schedule of iterations over the coarse-to-fine levels of a pyramid, and simply replace the usual image pyramid by the hierarchy of convolutional feature maps from a pre-trained CNN. The resulting dense alignment performance is much more robust to lighting and other variations, as we show by camera rotation tracking experiments on time-lapse sequences captured over many hours. Looking towards the future of scene representation for real-time visual SLAM, we further demonstrate that a selection using simple criteria of a small number of the total set of features output by a CNN gives just as accurate but much more efficient tracking performance