Self-supervised depth learning from monocular images normally relies on the
2D pixel-wise photometric relation between temporally adjacent image frames.
However, they neither fully exploit the 3D point-wise geometric
correspondences, nor effectively tackle the ambiguities in the photometric
warping caused by occlusions or illumination inconsistency. To address these
problems, this work proposes Density Volume Construction Network (DevNet), a
novel self-supervised monocular depth learning framework, that can consider 3D
spatial information, and exploit stronger geometric constraints among adjacent
camera frustums. Instead of directly regressing the pixel value from a single
image, our DevNet divides the camera frustum into multiple parallel planes and
predicts the pointwise occlusion probability density on each plane. The final
depth map is generated by integrating the density along corresponding rays.
During the training process, novel regularization strategies and loss functions
are introduced to mitigate photometric ambiguities and overfitting. Without
obviously enlarging model parameters size or running time, DevNet outperforms
several representative baselines on both the KITTI-2015 outdoor dataset and
NYU-V2 indoor dataset. In particular, the root-mean-square-deviation is reduced
by around 4% with DevNet on both KITTI-2015 and NYU-V2 in the task of depth
estimation. Code is available at https://github.com/gitkaichenzhou/DevNet.Comment: Accepted by European Conference on Computer Vision 2022 (ECCV2022