529 research outputs found
Two-in-One Depth: Bridging the Gap Between Monocular and Binocular Self-supervised Depth Estimation
Monocular and binocular self-supervised depth estimations are two important
and related tasks in computer vision, which aim to predict scene depths from
single images and stereo image pairs respectively. In literature, the two tasks
are usually tackled separately by two different kinds of models, and binocular
models generally fail to predict depth from single images, while the prediction
accuracy of monocular models is generally inferior to binocular models. In this
paper, we propose a Two-in-One self-supervised depth estimation network, called
TiO-Depth, which could not only compatibly handle the two tasks, but also
improve the prediction accuracy. TiO-Depth employs a Siamese architecture and
each sub-network of it could be used as a monocular depth estimation model. For
binocular depth estimation, a Monocular Feature Matching module is proposed for
incorporating the stereo knowledge between the two images, and the full
TiO-Depth is used to predict depths. We also design a multi-stage
joint-training strategy for improving the performances of TiO-Depth in both two
tasks by combining the relative advantages of them. Experimental results on the
KITTI, Cityscapes, and DDAD datasets demonstrate that TiO-Depth outperforms
both the monocular and binocular state-of-the-art methods in most cases, and
further verify the feasibility of a two-in-one network for monocular and
binocular depth estimation. The code is available at
https://github.com/ZM-Zhou/TiO-Depth_pytorch.Comment: Accepted to ICCV 202
The Second Monocular Depth Estimation Challenge
This paper discusses the results for the second edition of the Monocular
Depth Estimation Challenge (MDEC). This edition was open to methods using any
form of supervision, including fully-supervised, self-supervised, multi-task or
proxy depth. The challenge was based around the SYNS-Patches dataset, which
features a wide diversity of environments with high-quality dense ground-truth.
This includes complex natural environments, e.g. forests or fields, which are
greatly underrepresented in current benchmarks.
The challenge received eight unique submissions that outperformed the
provided SotA baseline on any of the pointcloud- or image-based metrics. The
top supervised submission improved relative F-Score by 27.62%, while the top
self-supervised improved it by 16.61%. Supervised submissions generally
leveraged large collections of datasets to improve data diversity.
Self-supervised submissions instead updated the network architecture and
pretrained backbones. These results represent a significant progress in the
field, while highlighting avenues for future research, such as reducing
interpolation artifacts at depth boundaries, improving self-supervised indoor
performance and overall natural image accuracy.Comment: Published at CVPRW202
DevNet: Self-supervised Monocular Depth Learning via Density Volume Construction
Self-supervised depth learning from monocular images normally relies on the
2D pixel-wise photometric relation between temporally adjacent image frames.
However, they neither fully exploit the 3D point-wise geometric
correspondences, nor effectively tackle the ambiguities in the photometric
warping caused by occlusions or illumination inconsistency. To address these
problems, this work proposes Density Volume Construction Network (DevNet), a
novel self-supervised monocular depth learning framework, that can consider 3D
spatial information, and exploit stronger geometric constraints among adjacent
camera frustums. Instead of directly regressing the pixel value from a single
image, our DevNet divides the camera frustum into multiple parallel planes and
predicts the pointwise occlusion probability density on each plane. The final
depth map is generated by integrating the density along corresponding rays.
During the training process, novel regularization strategies and loss functions
are introduced to mitigate photometric ambiguities and overfitting. Without
obviously enlarging model parameters size or running time, DevNet outperforms
several representative baselines on both the KITTI-2015 outdoor dataset and
NYU-V2 indoor dataset. In particular, the root-mean-square-deviation is reduced
by around 4% with DevNet on both KITTI-2015 and NYU-V2 in the task of depth
estimation. Code is available at https://github.com/gitkaichenzhou/DevNet.Comment: Accepted by European Conference on Computer Vision 2022 (ECCV2022
On the Synergies between Machine Learning and Binocular Stereo for Depth Estimation from Images: a Survey
Stereo matching is one of the longest-standing problems in computer vision
with close to 40 years of studies and research. Throughout the years the
paradigm has shifted from local, pixel-level decision to various forms of
discrete and continuous optimization to data-driven, learning-based methods.
Recently, the rise of machine learning and the rapid proliferation of deep
learning enhanced stereo matching with new exciting trends and applications
unthinkable until a few years ago. Interestingly, the relationship between
these two worlds is two-way. While machine, and especially deep, learning
advanced the state-of-the-art in stereo matching, stereo itself enabled new
ground-breaking methodologies such as self-supervised monocular depth
estimation based on deep networks. In this paper, we review recent research in
the field of learning-based depth estimation from single and binocular images
highlighting the synergies, the successes achieved so far and the open
challenges the community is going to face in the immediate future.Comment: Accepted to TPAMI. Paper version of our CVPR 2019 tutorial:
"Learning-based depth estimation from stereo and monocular images: successes,
limitations and future challenges"
(https://sites.google.com/view/cvpr-2019-depth-from-image/home
The Monocular Depth Estimation Challenge
This paper summarizes the results of the first Monocular Depth Estimation
Challenge (MDEC) organized at WACV2023. This challenge evaluated the progress
of self-supervised monocular depth estimation on the challenging SYNS-Patches
dataset. The challenge was organized on CodaLab and received submissions from 4
valid teams. Participants were provided a devkit containing updated reference
implementations for 16 State-of-the-Art algorithms and 4 novel techniques. The
threshold for acceptance for novel techniques was to outperform every one of
the 16 SotA baselines. All participants outperformed the baseline in
traditional metrics such as MAE or AbsRel. However, pointcloud reconstruction
metrics were challenging to improve upon. We found predictions were
characterized by interpolation artefacts at object boundaries and errors in
relative object positioning. We hope this challenge is a valuable contribution
to the community and encourage authors to participate in future editions.Comment: WACV-Workshops 202
Single View 3D Reconstruction using Deep Learning
One of the major challenges in the field of Computer Vision has been the reconstruction of a 3D object or scene from a single 2D image. While there are many notable examples, traditional methods for single view reconstruction often fail to generalise due to the presence of many brittle hand-crafted engineering solutions, limiting their applicability to real world problems. Recently, deep learning has taken over the field of Computer Vision and ”learning to reconstruct” has become the dominant technique for addressing the limitations of traditional methods when performing single view 3D reconstruction. Deep learning allows our reconstruction methods to learn generalisable image features and monocular cues that would otherwise be difficult to engineer through ad-hoc hand-crafted approaches. However, it can often be difficult to efficiently integrate the various 3D shape representations within the deep learning framework. In particular, 3D volumetric representations can be adapted to work with Convolutional Neural Networks, but they are computationally expensive and memory inefficient when using local convolutional layers. Also, the successful learning of generalisable feature representations for 3D reconstruction requires large amounts of diverse training data. In practice, this is challenging for 3D training data, as it entails a costly and time consuming manual data collection and annotation process. Researchers have attempted to address these issues by utilising self-supervised learning and generative modelling techniques, however these approaches often produce suboptimal results when compared with models trained on larger datasets. This thesis addresses several key challenges incurred when using deep learning for ”learning to reconstruct” 3D shapes from single view images. We observe that it is possible to learn a compressed representation for multiple categories of the 3D ShapeNet dataset, improving the computational and memory efficiency when working with 3D volumetric representations. To address the challenge of data acquisition, we leverage deep generative models to ”hallucinate” hidden or latent novel viewpoints for a given input image. Combining these images with depths estimated by a self-supervised depth estimator and the known camera properties, allowed us to reconstruct textured 3D point clouds without any ground truth 3D training data. Furthermore, we show that is is possible to improve upon the previous self-supervised monocular depth estimator by adding a self-attention and a discrete volumetric representation, significantly improving accuracy on the KITTI 2015 dataset and enabling the estimation of uncertainty depth predictions.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202
MonoDiffusion: Self-Supervised Monocular Depth Estimation Using Diffusion Model
Over the past few years, self-supervised monocular depth estimation that does
not depend on ground-truth during the training phase has received widespread
attention. Most efforts focus on designing different types of network
architectures and loss functions or handling edge cases, e.g., occlusion and
dynamic objects. In this work, we introduce a novel self-supervised depth
estimation framework, dubbed MonoDiffusion, by formulating it as an iterative
denoising process. Because the depth ground-truth is unavailable in the
training phase, we develop a pseudo ground-truth diffusion process to assist
the diffusion in MonoDiffusion. The pseudo ground-truth diffusion gradually
adds noise to the depth map generated by a pre-trained teacher model.
Moreover,the teacher model allows applying a distillation loss to guide the
denoised depth. Further, we develop a masked visual condition mechanism to
enhance the denoising ability of model. Extensive experiments are conducted on
the KITTI and Make3D datasets and the proposed MonoDiffusion outperforms prior
state-of-the-art competitors. The source code will be available at
https://github.com/ShuweiShao/MonoDiffusion.Comment: 10 pages, 8 figure
- …