885 research outputs found

    OctNetFusion: Learning Depth Fusion from Data

    Full text link
    In this paper, we present a learning based approach to depth fusion, i.e., dense 3D reconstruction from multiple depth images. The most common approach to depth fusion is based on averaging truncated signed distance functions, which was originally proposed by Curless and Levoy in 1996. While this method is simple and provides great results, it is not able to reconstruct (partially) occluded surfaces and requires a large number frames to filter out sensor noise and outliers. Motivated by the availability of large 3D model repositories and recent advances in deep learning, we present a novel 3D CNN architecture that learns to predict an implicit surface representation from the input depth maps. Our learning based method significantly outperforms the traditional volumetric fusion approach in terms of noise reduction and outlier suppression. By learning the structure of real world 3D objects and scenes, our approach is further able to reconstruct occluded regions and to fill in gaps in the reconstruction. We demonstrate that our learning based approach outperforms both vanilla TSDF fusion as well as TV-L1 fusion on the task of volumetric fusion. Further, we demonstrate state-of-the-art 3D shape completion results.Comment: 3DV 2017, https://github.com/griegler/octnetfusio

    Learning Birds-Eye View Representations for Autonomous Driving

    Get PDF
    Over the past few years, progress towards the ambitious goal of widespread fully-autonomous vehicles on our roads has accelerated dramatically. This progress has been spurred largely by the success of highly accurate LiDAR sensors, as well the use of detailed high-resolution maps, which together allow a vehicle to navigate its surroundings effectively. Often, however, one or both of these resources may be unavailable, whether due to cost, sensor failure, or the need to operate in an unmapped environment. The aim of this thesis is therefore to demonstrate that it is possible to build detailed three-dimensional representations of traffic scenes using only 2D monocular camera images as input. Such an approach faces many challenges: most notably that 2D images do not provide explicit 3D structure. We overcome this limitation by applying a combination of deep learning and geometry to transform image-based features into an orthographic birds-eye view representation of the scene, allowing algorithms to reason in a metric, 3D space. This approach is applied to solving two challenging perception tasks central to autonomous driving. The first part of this thesis addresses the problem of monocular 3D object detection, which involves determining the size and location of all objects in the scene. Our solution was based on a novel convolutional network architecture that processed features in both the image and birds-eye view perspective. Results on the KITTI dataset showed that this network outperformed existing works at the time, and although more recent works have improved on these results, we conducted extensive analysis to find that our solution performed well in many difficult edge-case scenarios such as objects close to or distant from the camera. In the second part of the thesis, we consider the related problem of semantic map prediction. This consists of estimating a birds-eye view map of the world visible from a given camera, encoding both static elements of the scene such as pavement and road layout, as well as dynamic objects such as vehicles and pedestrians. This was accomplished using a second network that built on the experience from the previous work and achieved convincing performance on two real-world driving datasets. By formulating the maps as an occupancy grid map (a widely used representation from robotics), we were able to demonstrate how predictions could be accumulated across multiple frames, and that doing so further improved the robustness of maps produced by our system.Toyota Motors Europ

    Driving among Flatmobiles: Bird-Eye-View occupancy grids from a monocular camera for holistic trajectory planning

    Full text link
    Camera-based end-to-end driving neural networks bring the promise of a low-cost system that maps camera images to driving control commands. These networks are appealing because they replace laborious hand engineered building blocks but their black-box nature makes them difficult to delve in case of failure. Recent works have shown the importance of using an explicit intermediate representation that has the benefits of increasing both the interpretability and the accuracy of networks' decisions. Nonetheless, these camera-based networks reason in camera view where scale is not homogeneous and hence not directly suitable for motion forecasting. In this paper, we introduce a novel monocular camera-only holistic end-to-end trajectory planning network with a Bird-Eye-View (BEV) intermediate representation that comes in the form of binary Occupancy Grid Maps (OGMs). To ease the prediction of OGMs in BEV from camera images, we introduce a novel scheme where the OGMs are first predicted as semantic masks in camera view and then warped in BEV using the homography between the two planes. The key element allowing this transformation to be applied to 3D objects such as vehicles, consists in predicting solely their footprint in camera-view, hence respecting the flat world hypothesis implied by the homography
    corecore