Multi-view depth estimation has achieved impressive performance over various
benchmarks. However, almost all current multi-view systems rely on given ideal
camera poses, which are unavailable in many real-world scenarios, such as
autonomous driving. In this work, we propose a new robustness benchmark to
evaluate the depth estimation system under various noisy pose settings.
Surprisingly, we find current multi-view depth estimation methods or
single-view and multi-view fusion methods will fail when given noisy pose
settings. To address this challenge, we propose a single-view and multi-view
fused depth estimation system, which adaptively integrates high-confident
multi-view and single-view results for both robust and accurate depth
estimations. The adaptive fusion module performs fusion by dynamically
selecting high-confidence regions between two branches based on a wrapping
confidence map. Thus, the system tends to choose the more reliable branch when
facing textureless scenes, inaccurate calibration, dynamic objects, and other
degradation or challenging conditions. Our method outperforms state-of-the-art
multi-view and fusion methods under robustness testing. Furthermore, we achieve
state-of-the-art performance on challenging benchmarks (KITTI and DDAD) when
given accurate pose estimations. Project website:
https://github.com/Junda24/AFNet/.Comment: Accepted to CVPR 202