We propose a novel multi-stage depth super-resolution network, which
progressively reconstructs high-resolution depth maps from explicit and
implicit high-frequency features. The former are extracted by an efficient
transformer processing both local and global contexts, while the latter are
obtained by projecting color images into the frequency domain. Both are
combined together with depth features by means of a fusion strategy within a
multi-stage and multi-scale framework. Experiments on the main benchmarks, such
as NYUv2, Middlebury, DIML and RGBDD, show that our approach outperforms
existing methods by a large margin (~20% on NYUv2 and DIML against the
contemporary work DADA, with 16x upsampling), establishing a new
state-of-the-art in the guided depth super-resolution task