12 research outputs found
The Right (Angled) Perspective: Improving the Understanding of Road Scenes Using Boosted Inverse Perspective Mapping
Many tasks performed by autonomous vehicles such as road marking detection,
object tracking, and path planning are simpler in bird's-eye view. Hence,
Inverse Perspective Mapping (IPM) is often applied to remove the perspective
effect from a vehicle's front-facing camera and to remap its images into a 2D
domain, resulting in a top-down view. Unfortunately, however, this leads to
unnatural blurring and stretching of objects at further distance, due to the
resolution of the camera, limiting applicability. In this paper, we present an
adversarial learning approach for generating a significantly improved IPM from
a single camera image in real time. The generated bird's-eye-view images
contain sharper features (e.g. road markings) and a more homogeneous
illumination, while (dynamic) objects are automatically removed from the scene,
thus revealing the underlying road layout in an improved fashion. We
demonstrate our framework using real-world data from the Oxford RobotCar
Dataset and show that scene understanding tasks directly benefit from our
boosted IPM approach.Comment: equal contribution of first two authors, 8 full pages, 6 figures,
accepted at IV 201
Semantic Foreground Inpainting from Weak Supervision
Semantic scene understanding is an essential task for self-driving vehicles
and mobile robots. In our work, we aim to estimate a semantic segmentation map,
in which the foreground objects are removed and semantically inpainted with
background classes, from a single RGB image. This semantic foreground
inpainting task is performed by a single-stage convolutional neural network
(CNN) that contains our novel max-pooling as inpainting (MPI) module, which is
trained with weak supervision, i.e., it does not require manual background
annotations for the foreground regions to be inpainted. Our approach is
inherently more efficient than the previous two-stage state-of-the-art method,
and outperforms it by a margin of 3% IoU for the inpainted foreground regions
on Cityscapes. The performance margin increases to 6% IoU, when tested on the
unseen KITTI dataset. The code and the manually annotated datasets for testing
are shared with the research community at
https://github.com/Chenyang-Lu/semantic-foreground-inpainting.Comment: RA-L and ICRA'2
InstaGraM: Instance-level Graph Modeling for Vectorized HD Map Learning
Inferring traffic object such as lane information is of foremost importance
for deployment of autonomous driving. Previous approaches focus on offline
construction of HD map inferred with GPS localization, which is insufficient
for globally scalable autonomous driving. To alleviate these issues, we propose
online HD map learning framework that detects HD map elements from onboard
sensor observations. We represent the map elements as a graph; we propose
InstaGraM, instance-level graph modeling of HD map that brings accurate and
fast end-to-end vectorized HD map learning. Along with the graph modeling
strategy, we propose end-to-end neural network composed of three stages: a
unified BEV feature extraction, map graph component detection, and association
via graph neural networks. Comprehensive experiments on public open dataset
show that our proposed network outperforms previous models by up to 13.7 mAP
with up to 33.8X faster computation time.Comment: Workshop on Vision-Centric Autonomous Driving (VCAD) at Conference on
Computer Vision and Pattern Recognition (CVPR) 202
Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation
Video amodal segmentation is a particularly challenging task in computer
vision, which requires to deduce the full shape of an object from the visible
parts of it. Recently, some studies have achieved promising performance by
using motion flow to integrate information across frames under a
self-supervised setting. However, motion flow has a clear limitation by the two
factors of moving cameras and object deformation. This paper presents a
rethinking to previous works. We particularly leverage the supervised signals
with object-centric representation in \textit{real-world scenarios}. The
underlying idea is the supervision signal of the specific object and the
features from different views can mutually benefit the deduction of the full
mask in any specific frame. We thus propose an Efficient object-centric
Representation amodal Segmentation (EoRaS). Specially, beyond solely relying on
supervision signals, we design a translation module to project image features
into the Bird's-Eye View (BEV), which introduces 3D information to improve
current feature quality. Furthermore, we propose a multi-view fusion layer
based temporal module which is equipped with a set of object slots and
interacts with features from different views by attention mechanism to fulfill
sufficient object representation completion. As a result, the full mask of the
object can be decoded from image features updated by object slots. Extensive
experiments on both real-world and synthetic benchmarks demonstrate the
superiority of our proposed method, achieving state-of-the-art performance. Our
code will be released at \url{https://github.com/kfan21/EoRaS}.Comment: Accepted by ICCV 202