15 research outputs found

    Self-supervised Multi-view Stereo via Effective Co-Segmentation and Data-Augmentation

    Full text link
    Recent studies have witnessed that self-supervised methods based on view synthesis obtain clear progress on multi-view stereo (MVS). However, existing methods rely on the assumption that the corresponding points among different views share the same color, which may not always be true in practice. This may lead to unreliable self-supervised signal and harm the final reconstruction performance. To address the issue, we propose a framework integrated with more reliable supervision guided by semantic co-segmentation and data-augmentation. Specially, we excavate mutual semantic from multi-view images to guide the semantic consistency. And we devise effective data-augmentation mechanism which ensures the transformation robustness by treating the prediction of regular samples as pseudo ground truth to regularize the prediction of augmented samples. Experimental results on DTU dataset show that our proposed methods achieve the state-of-the-art performance among unsupervised methods, and even compete on par with supervised methods. Furthermore, extensive experiments on Tanks&Temples dataset demonstrate the effective generalization ability of the proposed method.Comment: This paper is accepted by AAAI-21 with a Distinguished Paper Awar

    S3^3M-Net: Joint Learning of Semantic Segmentation and Stereo Matching for Autonomous Driving

    Full text link
    Semantic segmentation and stereo matching are two essential components of 3D environmental perception systems for autonomous driving. Nevertheless, conventional approaches often address these two problems independently, employing separate models for each task. This approach poses practical limitations in real-world scenarios, particularly when computational resources are scarce or real-time performance is imperative. Hence, in this article, we introduce S3^3M-Net, a novel joint learning framework developed to perform semantic segmentation and stereo matching simultaneously. Specifically, S3^3M-Net shares the features extracted from RGB images between both tasks, resulting in an improved overall scene understanding capability. This feature sharing process is realized using a feature fusion adaption (FFA) module, which effectively transforms the shared features into semantic space and subsequently fuses them with the encoded disparity features. The entire joint learning framework is trained by minimizing a novel semantic consistency-guided (SCG) loss, which places emphasis on the structural consistency in both tasks. Extensive experimental results conducted on the vKITTI2 and KITTI datasets demonstrate the effectiveness of our proposed joint learning framework and its superior performance compared to other state-of-the-art single-task networks. Our project webpage is accessible at mias.group/S3M-Net.Comment: accepted to IEEE Trans. on Intelligent Vehicles (T-IV

    Learning End-To-End Scene Flow by Distilling Single Tasks Knowledge

    Full text link
    Scene flow is a challenging task aimed at jointly estimating the 3D structure and motion of the sensed environment. Although deep learning solutions achieve outstanding performance in terms of accuracy, these approaches divide the whole problem into standalone tasks (stereo and optical flow) addressing them with independent networks. Such a strategy dramatically increases the complexity of the training procedure and requires power-hungry GPUs to infer scene flow barely at 1 FPS. Conversely, we propose DWARF, a novel and lightweight architecture able to infer full scene flow jointly reasoning about depth and optical flow easily and elegantly trainable end-to-end from scratch. Moreover, since ground truth images for full scene flow are scarce, we propose to leverage on the knowledge learned by networks specialized in stereo or flow, for which much more data are available, to distill proxy annotations. Exhaustive experiments show that i) DWARF runs at about 10 FPS on a single high-end GPU and about 1 FPS on NVIDIA Jetson TX2 embedded at KITTI resolution, with moderate drop in accuracy compared to 10x deeper models, ii) learning from many distilled samples is more effective than from the few, annotated ones available. Code available at: https://github.com/FilippoAleotti/Dwarf-TensorflowComment: Accepted to AAAI 2020. Project page: https://vision.disi.unibo.it/~faleotti/dwarf.htm

    SyntCities: A Large Synthetic Remote Sensing Dataset for Disparity Estimation

    Get PDF
    Studies in the last years have proved the outstanding performance of deep learning for computer vision tasks in the remote sensing field, such as disparity estimation. However, available datasets mostly focus on close-range applications like autonomous driving or robot manipulation. To reduce the domain gap while training we present SyntCities, a synthetic dataset resembling the aerial imagery on urban areas. The pipeline used to render the images is based on 3-D modeling, which helps to avoid acquisition costs, provides subpixel accurate dense ground truth and simulates different illumination conditions. The dataset additionally provides multiclass semantic maps and can be converted to point cloud format to benefit a wider research community. We focus on the task of disparity estimation and evaluate the performance of the traditional semiglobal matching and state-of-the-art architectures, trained with SyntCities and other datasets, on real aerial and satellite images. A comparison with the widely used SceneFlow dataset is also presented. Strategies using a mixture of both real and synthetic samples are studied as well. Results show significant improvements in terms of accuracy for the disparity maps
    corecore