710 research outputs found
Deep learning methods for 360 monocular depth estimation and point cloud semantic segmentation
Monocular depth estimation and point cloud segmentation are essential tasks for 3D scene understanding in computer vision. Depth estimation for omnidirectional images is challenging due to the spherical distortion issue and the availability of large-scale labeled datasets. We propose two separate works for 360 monocular depth estimation tasks. In the first work, we propose a novel, model-agnostic, two-stage pipeline for omnidirectional monocular depth estimation. Our proposed framework PanoDepth takes one 360 image as input, produces one or more synthesized views in the first stage, and feeds the original image and the synthesized images into the subsequent stereo matching stage. Utilizing the explicit stereo-based geometric constraints, PanoDepth can generate dense high-quality depth. In the second work, we propose a 360 monocular depth estimation pipeline, OmniFusion, to tackle the spherical distortion issue. Our pipeline transforms a 360 image into less-distorted perspective patches (i.e. tangent images) to obtain patch-wise predictions via CNN, and then merge the patch-wise results for final output. To handle the discrepancy between patch-wise predictions which is a major issue affecting the merging quality, we propose a new framework with (i) a geometry-aware feature fusion mechanism that combines 3D geometric features with 2D image features. (ii) the self-attention-based transformer architecture to conduct a global aggregation of patch-wise information. (iii) an iterative depth refinement mechanism to further refine the estimated depth based on the more accurate geometric features. Experiments show that both PanoDepth and OmniFusion achieve state-of-the-art performances on several 360 monocular depth estimation benchmark datasets. For point cloud analysis, we mainly focus on defining effective local point convolution operators. We propose two approaches, SPNet and Point-Voxel CNN respectively. For the former, we propose a novel point convolution operator named Shell Point Convolution (SPConv) as the building block for shape encoding and local context learning. Specifically, SPConv splits 3D neighborhood space into shells, aggregates local features on manually designed kernel points, and performs convolution on the shells. For the latter, we present a novel lightweight convolutional neural network which uses point voxel convolution (PVC) layer as building block. Each PVC layer has two parallel branches, namely the voxel branch and the point branch. For the voxel branch, we aggregate local features on non-empty voxel centers to reduce geometric information loss caused by voxelization, then apply volumetric convolutions to enhance local neighborhood geometry encoding. For the point branch, we use Multi-Layer Perceptron (MLP) to extract fine-detailed point-wise features. Outputs from these two branches are adaptively fused via a feature selection module. Experimental results show that SPConv and PVC layers are effective in local shape encoding, and our proposed networks perform well in semantic segmentation tasks.Includes bibliographical references
Equivariant Light Field Convolution and Transformer
3D reconstruction and novel view rendering can greatly benefit from geometric
priors when the input views are not sufficient in terms of coverage and
inter-view baselines. Deep learning of geometric priors from 2D images often
requires each image to be represented in a canonical frame and the prior
to be learned in a given or learned canonical frame. In this paper, given
only the relative poses of the cameras, we show how to learn priors from
multiple views equivariant to coordinate frame transformations by proposing an
-equivariant convolution and transformer in the space of rays in 3D.
This enables the creation of a light field that remains equivariant to the
choice of coordinate frame. The light field as defined in our work, refers both
to the radiance field and the feature field defined on the ray space. We model
the ray space, the domain of the light field, as a homogeneous space of
and introduce the -equivariant convolution in ray space. Depending on
the output domain of the convolution, we present convolution-based
-equivariant maps from ray space to ray space and to . Our
mathematical framework allows us to go beyond convolution to
-equivariant attention in the ray space. We demonstrate how to tailor
and adapt the equivariant convolution and transformer in the tasks of
equivariant neural rendering and reconstruction from multiple views. We
demonstrate -equivariance by obtaining robust results in roto-translated
datasets without performing transformation augmentation.Comment: 46 page
Boosting Deep Neural Networks with Geometrical Prior Knowledge: A Survey
While Deep Neural Networks (DNNs) achieve state-of-the-art results in many
different problem settings, they are affected by some crucial weaknesses. On
the one hand, DNNs depend on exploiting a vast amount of training data, whose
labeling process is time-consuming and expensive. On the other hand, DNNs are
often treated as black box systems, which complicates their evaluation and
validation. Both problems can be mitigated by incorporating prior knowledge
into the DNN.
One promising field, inspired by the success of convolutional neural networks
(CNNs) in computer vision tasks, is to incorporate knowledge about symmetric
geometrical transformations of the problem to solve. This promises an increased
data-efficiency and filter responses that are interpretable more easily. In
this survey, we try to give a concise overview about different approaches to
incorporate geometrical prior knowledge into DNNs. Additionally, we try to
connect those methods to the field of 3D object detection for autonomous
driving, where we expect promising results applying those methods.Comment: Survey Pape
- …