1,096 research outputs found
DEVIANT: Depth EquiVarIAnt NeTwork for Monocular 3D Object Detection
Modern neural networks use building blocks such as convolutions that are
equivariant to arbitrary 2D translations. However, these vanilla blocks are not
equivariant to arbitrary 3D translations in the projective manifold. Even then,
all monocular 3D detectors use vanilla blocks to obtain the 3D coordinates, a
task for which the vanilla blocks are not designed for. This paper takes the
first step towards convolutions equivariant to arbitrary 3D translations in the
projective manifold. Since the depth is the hardest to estimate for monocular
detection, this paper proposes Depth EquiVarIAnt NeTwork (DEVIANT) built with
existing scale equivariant steerable blocks. As a result, DEVIANT is
equivariant to the depth translations in the projective manifold whereas
vanilla networks are not. The additional depth equivariance forces the DEVIANT
to learn consistent depth estimates, and therefore, DEVIANT achieves
state-of-the-art monocular 3D detection results on KITTI and Waymo datasets in
the image-only category and performs competitively to methods using extra
information. Moreover, DEVIANT works better than vanilla networks in
cross-dataset evaluation. Code and models at
https://github.com/abhi1kumar/DEVIANTComment: ECCV 202
SGM3D: Stereo Guided Monocular 3D Object Detection
Monocular 3D object detection aims to predict the object location, dimension
and orientation in 3D space alongside the object category given only a
monocular image. It poses a great challenge due to its ill-posed property which
is critically lack of depth information in the 2D image plane. While there
exist approaches leveraging off-the-shelve depth estimation or relying on LiDAR
sensors to mitigate this problem, the dependence on the additional depth model
or expensive equipment severely limits their scalability to generic 3D
perception. In this paper, we propose a stereo-guided monocular 3D object
detection framework, dubbed SGM3D, adapting the robust 3D features learned from
stereo inputs to enhance the feature for monocular detection. We innovatively
present a multi-granularity domain adaptation (MG-DA) mechanism to exploit the
network's ability to generate stereo-mimicking features given only on monocular
cues. Coarse BEV feature-level, as well as the fine anchor-level domain
adaptation, are both leveraged for guidance in the monocular domain.In
addition, we introduce an IoU matching-based alignment (IoU-MA) method for
object-level domain adaptation between the stereo and monocular predictions to
alleviate the mismatches while adopting the MG-DA. Extensive experiments
demonstrate state-of-the-art results on KITTI and Lyft datasets.Comment: 8 pages, 5 figure
Sparsity Invariant CNNs
In this paper, we consider convolutional neural networks operating on sparse
inputs with an application to depth upsampling from sparse laser scan data.
First, we show that traditional convolutional networks perform poorly when
applied to sparse data even when the location of missing data is provided to
the network. To overcome this problem, we propose a simple yet effective sparse
convolution layer which explicitly considers the location of missing data
during the convolution operation. We demonstrate the benefits of the proposed
network architecture in synthetic and real experiments with respect to various
baseline approaches. Compared to dense baselines, the proposed sparse
convolution network generalizes well to novel datasets and is invariant to the
level of sparsity in the data. For our evaluation, we derive a novel dataset
from the KITTI benchmark, comprising 93k depth annotated RGB images. Our
dataset allows for training and evaluating depth upsampling and depth
prediction techniques in challenging real-world settings and will be made
available upon publication
On the Synergies between Machine Learning and Binocular Stereo for Depth Estimation from Images: a Survey
Stereo matching is one of the longest-standing problems in computer vision
with close to 40 years of studies and research. Throughout the years the
paradigm has shifted from local, pixel-level decision to various forms of
discrete and continuous optimization to data-driven, learning-based methods.
Recently, the rise of machine learning and the rapid proliferation of deep
learning enhanced stereo matching with new exciting trends and applications
unthinkable until a few years ago. Interestingly, the relationship between
these two worlds is two-way. While machine, and especially deep, learning
advanced the state-of-the-art in stereo matching, stereo itself enabled new
ground-breaking methodologies such as self-supervised monocular depth
estimation based on deep networks. In this paper, we review recent research in
the field of learning-based depth estimation from single and binocular images
highlighting the synergies, the successes achieved so far and the open
challenges the community is going to face in the immediate future.Comment: Accepted to TPAMI. Paper version of our CVPR 2019 tutorial:
"Learning-based depth estimation from stereo and monocular images: successes,
limitations and future challenges"
(https://sites.google.com/view/cvpr-2019-depth-from-image/home
- …