1 research outputs found
OCTraN: 3D Occupancy Convolutional Transformer Network in Unstructured Traffic Scenarios
Modern approaches for vision-centric environment perception for autonomous
navigation make extensive use of self-supervised monocular depth estimation
algorithms that output disparity maps. However, when this disparity map is
projected onto 3D space, the errors in disparity are magnified, resulting in a
depth estimation error that increases quadratically as the distance from the
camera increases. Though Light Detection and Ranging (LiDAR) can solve this
issue, it is expensive and not feasible for many applications. To address the
challenge of accurate ranging with low-cost sensors, we propose, OCTraN, a
transformer architecture that uses iterative-attention to convert 2D image
features into 3D occupancy features and makes use of convolution and transpose
convolution to efficiently operate on spatial information. We also develop a
self-supervised training pipeline to generalize the model to any scene by
eliminating the need for LiDAR ground truth by substituting it with
pseudo-ground truth labels obtained from boosted monocular depth estimation.Comment: This work was accepted as a spotlight presentation at the
Transformers for Vision Workshop @CVPR 202