19 research outputs found
Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation
Dense depth estimation is essential to scene-understanding for autonomous
driving. However, recent self-supervised approaches on monocular videos suffer
from scale-inconsistency across long sequences. Utilizing data from the
ubiquitously copresent global positioning systems (GPS), we tackle this
challenge by proposing a dynamically-weighted GPS-to-Scale (g2s) loss to
complement the appearance-based losses. We emphasize that the GPS is needed
only during the multimodal training, and not at inference. The relative
distance between frames captured through the GPS provides a scale signal that
is independent of the camera setup and scene distribution, resulting in richer
learned feature representations. Through extensive evaluation on multiple
datasets, we demonstrate scale-consistent and -aware depth estimation during
inference, improving the performance even when training with low-frequency GPS
data.Comment: Accepted at 2021 IEEE International Conference on Robotics and
Automation (ICRA
Monocular Vision based Crowdsourced 3D Traffic Sign Positioning with Unknown Camera Intrinsics and Distortion Coefficients
Autonomous vehicles and driver assistance systems utilize maps of 3D semantic
landmarks for improved decision making. However, scaling the mapping process as
well as regularly updating such maps come with a huge cost. Crowdsourced
mapping of these landmarks such as traffic sign positions provides an appealing
alternative. The state-of-the-art approaches to crowdsourced mapping use ground
truth camera parameters, which may not always be known or may change over time.
In this work, we demonstrate an approach to computing 3D traffic sign positions
without knowing the camera focal lengths, principal point, and distortion
coefficients a priori. We validate our proposed approach on a public dataset of
traffic signs in KITTI. Using only a monocular color camera and GPS, we achieve
an average single journey relative and absolute positioning accuracy of 0.26 m
and 1.38 m, respectively.Comment: Accepted at 2020 IEEE 23rd International Conference on Intelligent
Transportation Systems (ITSC
Adversarial Attacks on Monocular Pose Estimation
Advances in deep learning have resulted in steady progress in computer vision
with improved accuracy on tasks such as object detection and semantic
segmentation. Nevertheless, deep neural networks are vulnerable to adversarial
attacks, thus presenting a challenge in reliable deployment. Two of the
prominent tasks in 3D scene-understanding for robotics and advanced drive
assistance systems are monocular depth and pose estimation, often learned
together in an unsupervised manner. While studies evaluating the impact of
adversarial attacks on monocular depth estimation exist, a systematic
demonstration and analysis of adversarial perturbations against pose estimation
are lacking. We show how additive imperceptible perturbations can not only
change predictions to increase the trajectory drift but also catastrophically
alter its geometry. We also study the relation between adversarial
perturbations targeting monocular depth and pose estimation networks, as well
as the transferability of perturbations to other networks with different
architectures and losses. Our experiments show how the generated perturbations
lead to notable errors in relative rotation and translation predictions and
elucidate vulnerabilities of the networks.Comment: Accepted at the 2022 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS 2022
Crowdsourced 3D Mapping: A Combined Multi-View Geometry and Self-Supervised Learning Approach
The ability to efficiently utilize crowdsourced visual data carries immense
potential for the domains of large scale dynamic mapping and autonomous
driving. However, state-of-the-art methods for crowdsourced 3D mapping assume
prior knowledge of camera intrinsics. In this work, we propose a framework that
estimates the 3D positions of semantically meaningful landmarks such as traffic
signs without assuming known camera intrinsics, using only monocular color
camera and GPS. We utilize multi-view geometry as well as deep learning based
self-calibration, depth, and ego-motion estimation for traffic sign
positioning, and show that combining their strengths is important for
increasing the map coverage. To facilitate research on this task, we construct
and make available a KITTI based 3D traffic sign ground truth positioning
dataset. Using our proposed framework, we achieve an average single-journey
relative and absolute positioning accuracy of 39cm and 1.26m respectively, on
this dataset.Comment: Accepted at 2020 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS
Practical Auto-Calibration for Spatial Scene-Understanding from Crowdsourced Dashcamera Videos
Spatial scene-understanding, including dense depth and ego-motion estimation,
is an important problem in computer vision for autonomous vehicles and advanced
driver assistance systems. Thus, it is beneficial to design perception modules
that can utilize crowdsourced videos collected from arbitrary vehicular onboard
or dashboard cameras. However, the intrinsic parameters corresponding to such
cameras are often unknown or change over time. Typical manual calibration
approaches require objects such as a chessboard or additional scene-specific
information. On the other hand, automatic camera calibration does not have such
requirements. Yet, the automatic calibration of dashboard cameras is
challenging as forward and planar navigation results in critical motion
sequences with reconstruction ambiguities. Structure reconstruction of complete
visual-sequences that may contain tens of thousands of images is also
computationally untenable. Here, we propose a system for practical monocular
onboard camera auto-calibration from crowdsourced videos. We show the
effectiveness of our proposed system on the KITTI raw, Oxford RobotCar, and the
crowdsourced D-City datasets in varying conditions. Finally, we demonstrate
its application for accurate monocular dense depth and ego-motion estimation on
uncalibrated videos.Comment: Accepted at 16th International Conference on Computer Vision Theory
and Applications (VISAP, 2021