Recently, self-supervised monocular depth estimation has gained popularity
with numerous applications in autonomous driving and robotics. However,
existing solutions primarily seek to estimate depth from immediate visual
features, and struggle to recover fine-grained scene details with limited
generalization. In this paper, we introduce SQLdepth, a novel approach that can
effectively learn fine-grained scene structures from motion. In SQLdepth, we
propose a novel Self Query Layer (SQL) to build a self-cost volume and infer
depth from it, rather than inferring depth from feature maps. The self-cost
volume implicitly captures the intrinsic geometry of the scene within a single
frame. Each individual slice of the volume signifies the relative distances
between points and objects within a latent space. Ultimately, this volume is
compressed to the depth map via a novel decoding approach. Experimental results
on KITTI and Cityscapes show that our method attains remarkable
state-of-the-art performance (AbsRel = 0.082 on KITTI, 0.052 on KITTI with
improved ground-truth and 0.106 on Cityscapes), achieves 9.9%, 5.5% and
4.5% error reduction from the previous best. In addition, our approach
showcases reduced training complexity, computational efficiency, improved
generalization, and the ability to recover fine-grained scene details.
Moreover, the self-supervised pre-trained and metric fine-tuned SQLdepth can
surpass existing supervised methods by significant margins (AbsRel = 0.043,
14% error reduction). self-matching-oriented relative distance querying in
SQL improves the robustness and zero-shot generalization capability of
SQLdepth. Code and the pre-trained weights will be publicly available. Code is
available at
\href{https://github.com/hisfog/SQLdepth-Impl}{https://github.com/hisfog/SQLdepth-Impl}.Comment: 14 pages, 9 figure