110 research outputs found
Learning Inverse Depth Regression for Multi-View Stereo with Correlation Cost Volume
Deep learning has shown to be effective for depth inference in multi-view
stereo (MVS). However, the scalability and accuracy still remain an open
problem in this domain. This can be attributed to the memory-consuming cost
volume representation and inappropriate depth inference. Inspired by the
group-wise correlation in stereo matching, we propose an average group-wise
correlation similarity measure to construct a lightweight cost volume. This can
not only reduce the memory consumption but also reduce the computational burden
in the cost volume filtering. Based on our effective cost volume
representation, we propose a cascade 3D U-Net module to regularize the cost
volume to further boost the performance. Unlike the previous methods that treat
multi-view depth inference as a depth regression problem or an inverse depth
classification problem, we recast multi-view depth inference as an inverse
depth regression task. This allows our network to achieve sub-pixel estimation
and be applicable to large-scale scenes. Through extensive experiments on DTU
dataset and Tanks and Temples dataset, we show that our proposed network with
Correlation cost volume and Inverse DEpth Regression (CIDER), achieves
state-of-the-art results, demonstrating its superior performance on scalability
and accuracy.Comment: Accepted by AAAI-202
RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo
Learning-based multi-view stereo (MVS) has by far centered around 3D
convolution on cost volumes. Due to the high computation and memory consumption
of 3D CNN, the resolution of output depth is often considerably limited.
Different from most existing works dedicated to adaptive refinement of cost
volumes, we opt to directly optimize the depth value along each camera ray,
mimicking the range finding of a laser scanner. This reduces the MVS problem to
ray-based depth optimization which is much more light-weight than full cost
volume optimization. In particular, we propose RayMVSNet which learns
sequential prediction of a 1D implicit field along each camera ray with the
zero-crossing point indicating scene depth. This sequential modeling, conducted
based on transformer features, essentially learns the epipolar line search in
traditional multi-view stereo. We devise a multi-task learning for better
optimization convergence and depth accuracy. We found the monotonicity property
of the SDFs along each ray greatly benefits the depth estimation. Our method
ranks top on both the DTU and the Tanks & Temples datasets over all previous
learning-based methods, achieving an overall reconstruction score of 0.33mm on
DTU and an F-score of 59.48% on Tanks & Temples. It is able to produce
high-quality depth estimation and point cloud reconstruction in challenging
scenarios such as objects/scenes with non-textured surface, severe occlusion,
and highly varying depth range. Further, we propose RayMVSNet++ to enhance
contextual feature aggregation for each ray through designing an attentional
gating unit to select semantically relevant neighboring rays within the local
frustum around that ray. RayMVSNet++ achieves state-of-the-art performance on
the ScanNet dataset. In particular, it attains an AbsRel of 0.058m and produces
accurate results on the two subsets of textureless regions and large depth
variation.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence. arXiv
admin note: substantial text overlap with arXiv:2204.0132
- …