908 research outputs found
Semi-supervised Deep Multi-view Stereo
Significant progress has been witnessed in learning-based Multi-view Stereo
(MVS) under supervised and unsupervised settings. To combine their respective
merits in accuracy and completeness, meantime reducing the demand for expensive
labeled data, this paper explores the problem of learning-based MVS in a
semi-supervised setting that only a tiny part of the MVS data is attached with
dense depth ground truth. However, due to huge variation of scenarios and
flexible settings in views, it may break the basic assumption in classic
semi-supervised learning, that unlabeled data and labeled data share the same
label space and data distribution, named as semi-supervised distribution-gap
ambiguity in the MVS problem. To handle these issues, we propose a novel
semi-supervised distribution-augmented MVS framework, namely SDA-MVS. For the
simple case that the basic assumption works in MVS data, consistency
regularization encourages the model predictions to be consistent between
original sample and randomly augmented sample. For further troublesome case
that the basic assumption is conflicted in MVS data, we propose a novel style
consistency loss to alleviate the negative effect caused by the distribution
gap. The visual style of unlabeled sample is transferred to labeled sample to
shrink the gap, and the model prediction of generated sample is further
supervised with the label in original labeled sample. The experimental results
in semi-supervised settings of multiple MVS datasets show the superior
performance of the proposed method. With the same settings in backbone network,
our proposed SDA-MVS outperforms its fully-supervised and unsupervised
baselines.Comment: This paper is accepted in ACMMM-2023. The code is released at:
https://github.com/ToughStoneX/Semi-MV
S-VolSDF: Sparse Multi-View Stereo Regularization of Neural Implicit Surfaces
Neural rendering of implicit surfaces performs well in 3D vision
applications. However, it requires dense input views as supervision. When only
sparse input images are available, output quality drops significantly due to
the shape-radiance ambiguity problem. We note that this ambiguity can be
constrained when a 3D point is visible in multiple views, as is the case in
multi-view stereo (MVS). We thus propose to regularize neural rendering
optimization with an MVS solution. The use of an MVS probability volume and a
generalized cross entropy loss leads to a noise-tolerant optimization process.
In addition, neural rendering provides global consistency constraints that
guide the MVS depth hypothesis sampling and thus improves MVS performance.
Given only three sparse input views, experiments show that our method not only
outperforms generic neural rendering models by a large margin but also
significantly increases the reconstruction quality of MVS models. Project
webpage: https://hao-yu-wu.github.io/s-volsdf/
ATLAS-MVSNet: Attention Layers for Feature Extraction and Cost Volume Regularization in Multi-View Stereo
We present ATLAS-MVSNet, an end-to-end deep learning architecture relying on local attention layers for depth map inference from multi-view images. Distinct from existing works, we introduce a novel module design for neural networks, which we termed hybrid attention block, that utilizes the latest insights into attention in vision models. We are able to reap the benefits of attention in both, the carefully designed multi-stage feature extraction network and the cost volume regularization network. Our new approach displays significant improvement over its counterpart based purely on convolutions. While many state-of-the-art methods need multiple high-end GPUs in the training phase, we are able to train our network on a single consumer grade GPU. ATLAS-MVSNet exhibits excellent performance, especially in terms of accuracy, on the DTU dataset.
Furthermore, ATLAS-MVSNet ranks amongst the top published methods on the online Tanks and Temples benchmark
Constraining Depth Map Geometry for Multi-View Stereo: A Dual-Depth Approach with Saddle-shaped Depth Cells
Learning-based multi-view stereo (MVS) methods deal with predicting accurate
depth maps to achieve an accurate and complete 3D representation. Despite the
excellent performance, existing methods ignore the fact that a suitable depth
geometry is also critical in MVS. In this paper, we demonstrate that different
depth geometries have significant performance gaps, even using the same depth
prediction error. Therefore, we introduce an ideal depth geometry composed of
Saddle-Shaped Cells, whose predicted depth map oscillates upward and downward
around the ground-truth surface, rather than maintaining a continuous and
smooth depth plane. To achieve it, we develop a coarse-to-fine framework called
Dual-MVSNet (DMVSNet), which can produce an oscillating depth plane.
Technically, we predict two depth values for each pixel (Dual-Depth), and
propose a novel loss function and a checkerboard-shaped selecting strategy to
constrain the predicted depth geometry. Compared to existing methods,DMVSNet
achieves a high rank on the DTU benchmark and obtains the top performance on
challenging scenes of Tanks and Temples, demonstrating its strong performance
and generalization ability. Our method also points to a new research direction
for considering depth geometry in MVS.Comment: Accepted by ICCV 202
V-FUSE: Volumetric Depth Map Fusion with Long-Range Constraints
We introduce a learning-based depth map fusion framework that accepts a set
of depth and confidence maps generated by a Multi-View Stereo (MVS) algorithm
as input and improves them. This is accomplished by integrating volumetric
visibility constraints that encode long-range surface relationships across
different views into an end-to-end trainable architecture. We also introduce a
depth search window estimation sub-network trained jointly with the larger
fusion sub-network to reduce the depth hypothesis search space along each ray.
Our method learns to model depth consensus and violations of visibility
constraints directly from the data; effectively removing the necessity of
fine-tuning fusion parameters. Extensive experiments on MVS datasets show
substantial improvements in the accuracy of the output fused depth and
confidence maps.Comment: ICCV 202
Simultaneous Localization and Mapping (SLAM) for Autonomous Driving: Concept and Analysis
The Simultaneous Localization and Mapping (SLAM) technique has achieved astonishing progress over the last few decades and has generated considerable interest in the autonomous driving community. With its conceptual roots in navigation and mapping, SLAM outperforms some traditional positioning and localization techniques since it can support more reliable and robust localization, planning, and controlling to meet some key criteria for autonomous driving. In this study the authors first give an overview of the different SLAM implementation approaches and then discuss the applications of SLAM for autonomous driving with respect to different driving scenarios, vehicle system components and the characteristics of the SLAM approaches. The authors then discuss some challenging issues and current solutions when applying SLAM for autonomous driving. Some quantitative quality analysis means to evaluate the characteristics and performance of SLAM systems and to monitor the risk in SLAM estimation are reviewed. In addition, this study describes a real-world road test to demonstrate a multi-sensor-based modernized SLAM procedure for autonomous driving. The numerical results show that a high-precision 3D point cloud map can be generated by the SLAM procedure with the integration of Lidar and GNSS/INS. Online four–five cm accuracy localization solution can be achieved based on this pre-generated map and online Lidar scan matching with a tightly fused inertial system
RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo
Learning-based multi-view stereo (MVS) has by far centered around 3D
convolution on cost volumes. Due to the high computation and memory consumption
of 3D CNN, the resolution of output depth is often considerably limited.
Different from most existing works dedicated to adaptive refinement of cost
volumes, we opt to directly optimize the depth value along each camera ray,
mimicking the range finding of a laser scanner. This reduces the MVS problem to
ray-based depth optimization which is much more light-weight than full cost
volume optimization. In particular, we propose RayMVSNet which learns
sequential prediction of a 1D implicit field along each camera ray with the
zero-crossing point indicating scene depth. This sequential modeling, conducted
based on transformer features, essentially learns the epipolar line search in
traditional multi-view stereo. We devise a multi-task learning for better
optimization convergence and depth accuracy. We found the monotonicity property
of the SDFs along each ray greatly benefits the depth estimation. Our method
ranks top on both the DTU and the Tanks & Temples datasets over all previous
learning-based methods, achieving an overall reconstruction score of 0.33mm on
DTU and an F-score of 59.48% on Tanks & Temples. It is able to produce
high-quality depth estimation and point cloud reconstruction in challenging
scenarios such as objects/scenes with non-textured surface, severe occlusion,
and highly varying depth range. Further, we propose RayMVSNet++ to enhance
contextual feature aggregation for each ray through designing an attentional
gating unit to select semantically relevant neighboring rays within the local
frustum around that ray. RayMVSNet++ achieves state-of-the-art performance on
the ScanNet dataset. In particular, it attains an AbsRel of 0.058m and produces
accurate results on the two subsets of textureless regions and large depth
variation.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence. arXiv
admin note: substantial text overlap with arXiv:2204.0132
- …