15 research outputs found
Self-supervised Multi-view Stereo via Effective Co-Segmentation and Data-Augmentation
Recent studies have witnessed that self-supervised methods based on view
synthesis obtain clear progress on multi-view stereo (MVS). However, existing
methods rely on the assumption that the corresponding points among different
views share the same color, which may not always be true in practice. This may
lead to unreliable self-supervised signal and harm the final reconstruction
performance. To address the issue, we propose a framework integrated with more
reliable supervision guided by semantic co-segmentation and data-augmentation.
Specially, we excavate mutual semantic from multi-view images to guide the
semantic consistency. And we devise effective data-augmentation mechanism which
ensures the transformation robustness by treating the prediction of regular
samples as pseudo ground truth to regularize the prediction of augmented
samples. Experimental results on DTU dataset show that our proposed methods
achieve the state-of-the-art performance among unsupervised methods, and even
compete on par with supervised methods. Furthermore, extensive experiments on
Tanks&Temples dataset demonstrate the effective generalization ability of the
proposed method.Comment: This paper is accepted by AAAI-21 with a Distinguished Paper Awar
SM-Net: Joint Learning of Semantic Segmentation and Stereo Matching for Autonomous Driving
Semantic segmentation and stereo matching are two essential components of 3D
environmental perception systems for autonomous driving. Nevertheless,
conventional approaches often address these two problems independently,
employing separate models for each task. This approach poses practical
limitations in real-world scenarios, particularly when computational resources
are scarce or real-time performance is imperative. Hence, in this article, we
introduce SM-Net, a novel joint learning framework developed to perform
semantic segmentation and stereo matching simultaneously. Specifically,
SM-Net shares the features extracted from RGB images between both tasks,
resulting in an improved overall scene understanding capability. This feature
sharing process is realized using a feature fusion adaption (FFA) module, which
effectively transforms the shared features into semantic space and subsequently
fuses them with the encoded disparity features. The entire joint learning
framework is trained by minimizing a novel semantic consistency-guided (SCG)
loss, which places emphasis on the structural consistency in both tasks.
Extensive experimental results conducted on the vKITTI2 and KITTI datasets
demonstrate the effectiveness of our proposed joint learning framework and its
superior performance compared to other state-of-the-art single-task networks.
Our project webpage is accessible at mias.group/S3M-Net.Comment: accepted to IEEE Trans. on Intelligent Vehicles (T-IV
Learning End-To-End Scene Flow by Distilling Single Tasks Knowledge
Scene flow is a challenging task aimed at jointly estimating the 3D structure
and motion of the sensed environment. Although deep learning solutions achieve
outstanding performance in terms of accuracy, these approaches divide the whole
problem into standalone tasks (stereo and optical flow) addressing them with
independent networks. Such a strategy dramatically increases the complexity of
the training procedure and requires power-hungry GPUs to infer scene flow
barely at 1 FPS. Conversely, we propose DWARF, a novel and lightweight
architecture able to infer full scene flow jointly reasoning about depth and
optical flow easily and elegantly trainable end-to-end from scratch. Moreover,
since ground truth images for full scene flow are scarce, we propose to
leverage on the knowledge learned by networks specialized in stereo or flow,
for which much more data are available, to distill proxy annotations.
Exhaustive experiments show that i) DWARF runs at about 10 FPS on a single
high-end GPU and about 1 FPS on NVIDIA Jetson TX2 embedded at KITTI resolution,
with moderate drop in accuracy compared to 10x deeper models, ii) learning from
many distilled samples is more effective than from the few, annotated ones
available. Code available at:
https://github.com/FilippoAleotti/Dwarf-TensorflowComment: Accepted to AAAI 2020. Project page:
https://vision.disi.unibo.it/~faleotti/dwarf.htm
SyntCities: A Large Synthetic Remote Sensing Dataset for Disparity Estimation
Studies in the last years have proved the outstanding performance of deep learning for computer vision tasks in the remote sensing field, such as disparity estimation. However, available datasets mostly focus on close-range applications like autonomous driving or robot manipulation. To reduce the domain gap while training we present SyntCities, a synthetic dataset resembling the aerial imagery on urban areas. The pipeline used to render the images is based on 3-D modeling, which helps to avoid acquisition costs, provides subpixel accurate dense ground truth and simulates different illumination conditions. The dataset additionally provides multiclass semantic maps and can be converted to point cloud format to benefit a wider research community. We focus on the task of disparity estimation and evaluate the performance of the traditional semiglobal matching and state-of-the-art architectures, trained with SyntCities and other datasets, on real aerial and satellite images. A comparison with the widely used SceneFlow dataset is also presented. Strategies using a mixture of both real and synthetic samples are studied as well. Results show significant improvements in terms of accuracy for the disparity maps