54 research outputs found
VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry
Semantic understanding and localization are fundamental enablers of robot
autonomy that have for the most part been tackled as disjoint problems. While
deep learning has enabled recent breakthroughs across a wide spectrum of scene
understanding tasks, its applicability to state estimation tasks has been
limited due to the direct formulation that renders it incapable of encoding
scene-specific constrains. In this work, we propose the VLocNet++ architecture
that employs a multitask learning approach to exploit the inter-task
relationship between learning semantics, regressing 6-DoF global pose and
odometry, for the mutual benefit of each of these tasks. Our network overcomes
the aforementioned limitation by simultaneously embedding geometric and
semantic knowledge of the world into the pose regression network. We propose a
novel adaptive weighted fusion layer to aggregate motion-specific temporal
information and to fuse semantic features into the localization stream based on
region activations. Furthermore, we propose a self-supervised warping technique
that uses the relative motion to warp intermediate network representations in
the segmentation stream for learning consistent semantics. Finally, we
introduce a first-of-a-kind urban outdoor localization dataset with pixel-level
semantic labels and multiple loops for training deep networks. Extensive
experiments on the challenging Microsoft 7-Scenes benchmark and our DeepLoc
dataset demonstrate that our approach exceeds the state-of-the-art
outperforming local feature-based methods while simultaneously performing
multiple tasks and exhibiting substantial robustness in challenging scenarios.Comment: Demo and dataset available at http://deeploc.cs.uni-freiburg.d
CMRNet++: Map and Camera Agnostic Monocular Visual Localization in LiDAR Maps
Localization is a critically essential and crucial enabler of autonomous
robots. While deep learning has made significant strides in many computer
vision tasks, it is still yet to make a sizeable impact on improving
capabilities of metric visual localization. One of the major hindrances has
been the inability of existing Convolutional Neural Network (CNN)-based pose
regression methods to generalize to previously unseen places. Our recently
introduced CMRNet effectively addresses this limitation by enabling map
independent monocular localization in LiDAR-maps. In this paper, we now take it
a step further by introducing CMRNet++, which is a significantly more robust
model that not only generalizes to new places effectively, but is also
independent of the camera parameters. We enable this capability by combining
deep learning with geometric techniques, and by moving the metric reasoning
outside the learning process. In this way, the weights of the network are not
tied to a specific camera. Extensive evaluations of CMRNet++ on three
challenging autonomous driving datasets, i.e., KITTI, Argoverse, and Lyft5,
show that CMRNet++ outperforms CMRNet as well as other baselines by a large
margin. More importantly, for the first-time, we demonstrate the ability of a
deep learning approach to accurately localize without any retraining or
fine-tuning in a completely new environment and independent of the camera
parameters.Comment: Spotlight talk at IEEE ICRA 2020 Workshop on Emerging Learning and
Algorithmic Methods for Data Association in Robotic
Vision-based Autonomous Landing in Catastrophe-Struck Environments
Unmanned Aerial Vehicles (UAVs) equipped with bioradars are a life-saving
technology that can enable identification of survivors under collapsed
buildings in the aftermath of natural disasters such as earthquakes or gas
explosions. However, these UAVs have to be able to autonomously land on debris
piles in order to accurately locate the survivors. This problem is extremely
challenging as the structure of these debris piles is often unknown and no
prior knowledge can be leveraged. In this work, we propose a computationally
efficient system that is able to reliably identify safe landing sites and
autonomously perform the landing maneuver. Specifically, our algorithm computes
costmaps based on several hazard factors including terrain flatness, steepness,
depth accuracy and energy consumption information. We first estimate dense
candidate landing sites from the resulting costmap and then employ clustering
to group neighboring sites into a safe landing region. Finally, a minimum-jerk
trajectory is computed for landing considering the surrounding obstacles and
the UAV dynamics. We demonstrate the efficacy of our system using experiments
from a city scale hyperrealistic simulation environment and in real-world
scenarios with collapsed buildings
Deep Auxiliary Learning for Visual Localization and Odometry
Localization is an indispensable component of a robot's autonomy stack that
enables it to determine where it is in the environment, essentially making it a
precursor for any action execution or planning. Although convolutional neural
networks have shown promising results for visual localization, they are still
grossly outperformed by state-of-the-art local feature-based techniques. In
this work, we propose VLocNet, a new convolutional neural network architecture
for 6-DoF global pose regression and odometry estimation from consecutive
monocular images. Our multitask model incorporates hard parameter sharing, thus
being compact and enabling real-time inference, in addition to being end-to-end
trainable. We propose a novel loss function that utilizes auxiliary learning to
leverage relative pose information during training, thereby constraining the
search space to obtain consistent pose estimates. We evaluate our proposed
VLocNet on indoor as well as outdoor datasets and show that even our single
task model exceeds the performance of state-of-the-art deep architectures for
global localization, while achieving competitive performance for visual
odometry estimation. Furthermore, we present extensive experimental evaluations
utilizing our proposed Geometric Consistency Loss that show the effectiveness
of multitask learning and demonstrate that our model is the first deep learning
technique to be on par with, and in some cases outperforms state-of-the-art
SIFT-based approaches.Comment: Accepted for ICRA 201
Robust Vision Challenge 2020 -- 1st Place Report for Panoptic Segmentation
In this technical report, we present key details of our winning panoptic
segmentation architecture EffPS_b1bs4_RVC. Our network is a lightweight version
of our state-of-the-art EfficientPS architecture that consists of our proposed
shared backbone with a modified EfficientNet-B5 model as the encoder, followed
by the 2-way FPN to learn semantically rich multi-scale features. It consists
of two task-specific heads, a modified Mask R-CNN instance head and our novel
semantic segmentation head that processes features of different scales with
specialized modules for coherent feature refinement. Finally, our proposed
panoptic fusion module adaptively fuses logits from each of the heads to yield
the panoptic segmentation output. The Robust Vision Challenge 2020 benchmarking
results show that our model is ranked #1 on Microsoft COCO, VIPER and WildDash,
and is ranked #2 on Cityscapes and Mapillary Vistas, thereby achieving the
overall rank #1 for the panoptic segmentation task
Deep Spatiotemporal Models for Robust Proprioceptive Terrain Classification
Terrain classification is a critical component of any autonomous mobile robot
system operating in unknown real-world environments. Over the years, several
proprioceptive terrain classification techniques have been introduced to
increase robustness or act as a fallback for traditional vision based
approaches. However, they lack widespread adaptation due to various factors
that include inadequate accuracy, robustness and slow run-times. In this paper,
we use vehicle-terrain interaction sounds as a proprioceptive modality and
propose a deep Long-Short Term Memory (LSTM) based recurrent model that
captures both the spatial and temporal dynamics of such a problem, thereby
overcoming these past limitations. Our model consists of a new Convolution
Neural Network (CNN) architecture that learns deep spatial features,
complemented with LSTM units that learn complex temporal dynamics. Experiments
on two extensive datasets collected with different microphones on various
indoor and outdoor terrains demonstrate state-of-the-art performance compared
to existing techniques. We additionally evaluate the performance in adverse
acoustic conditions with high ambient noise and propose a noise-aware training
scheme that enables learning of more generalizable models that are essential
for robust real-world deployments
Dynamic Object Removal and Spatio-Temporal RGB-D Inpainting via Geometry-Aware Adversarial Learning
Dynamic objects have a significant impact on the robot's perception of the
environment which degrades the performance of essential tasks such as
localization and mapping. In this work, we address this problem by synthesizing
plausible color, texture and geometry in regions occluded by dynamic objects.
We propose the novel geometry-aware DynaFill architecture that follows a
coarse-to-fine topology and incorporates our gated recurrent feedback mechanism
to adaptively fuse information from previous timesteps. We optimize our
architecture using adversarial training to synthesize fine realistic textures
which enables it to hallucinate color and depth structure in occluded regions
online in a spatially and temporally coherent manner, without relying on future
frame information. Casting our inpainting problem as an image-to-image
translation task, our model also corrects regions correlated with the presence
of dynamic objects in the scene, such as shadows or reflections. We introduce a
large-scale hyperrealistic dataset with RGB-D images, semantic segmentation
labels, camera poses as well as groundtruth RGB-D information of occluded
regions. Extensive quantitative and qualitative evaluations show that our
approach achieves state-of-the-art performance, even in challenging weather
conditions. Furthermore, we present results for retrieval-based visual
localization with the synthesized images that demonstrate the utility of our
approach.Comment: Dataset, code and models are available at
http://rl.uni-freiburg.de/research/rgbd-inpaintin
Multimodal Interaction-aware Motion Prediction for Autonomous Street Crossing
For mobile robots navigating on sidewalks, it is essential to be able to
safely cross street intersections. Most existing approaches rely on the
recognition of the traffic light signal to make an informed crossing decision.
Although these approaches have been crucial enablers for urban navigation, the
capabilities of robots employing such approaches are still limited to
navigating only on streets containing signalized intersections. In this paper,
we address this challenge and propose a multimodal convolutional neural network
framework to predict the safety of a street intersection for crossing. Our
architecture consists of two subnetworks; an interaction-aware trajectory
estimation stream IA-TCNN, that predicts the future states of all observed
traffic participants in the scene, and a traffic light recognition stream
AtteNet. Our IA-TCNN utilizes dilated causal convolutions to model the behavior
of the observable dynamic agents in the scene without explicitly assigning
priorities to the interactions among them. While AtteNet utilizes
Squeeze-Excitation blocks to learn a content-aware mechanism for selecting the
relevant features from the data, thereby improving the noise robustness.
Learned representations from the traffic light recognition stream are fused
with the estimated trajectories from the motion prediction stream to learn the
crossing decision. Furthermore, we extend our previously introduced Freiburg
Street Crossing dataset with sequences captured at different types of
intersections, demonstrating complex interactions among the traffic
participants. Extensive experimental evaluations on public benchmark datasets
and our proposed dataset demonstrate that our network achieves state-of-the-art
performance for each of the subtasks, as well as for the crossing safety
prediction.Comment: The International Journal of Robotics Research (2020
Learning Kinematic Feasibility for Mobile Manipulation through Deep Reinforcement Learning
Mobile manipulation tasks remain one of the critical challenges for the
widespread adoption of autonomous robots in both service and industrial
scenarios. While planning approaches are good at generating feasible whole-body
robot trajectories, they struggle with dynamic environments as well as the
incorporation of constraints given by the task and the environment. On the
other hand, dynamic motion models in the action space struggle with generating
kinematically feasible trajectories for mobile manipulation actions. We propose
a deep reinforcement learning approach to learn feasible dynamic motions for a
mobile base while the end-effector follows a trajectory in task space generated
by an arbitrary system to fulfill the task at hand. This modular formulation
has several benefits: it enables us to readily transform a broad range of
end-effector motions into mobile applications, it allows us to use the
kinematic feasibility of the end-effector trajectory as a dense reward signal
and its modular formulation allows it to generalise to unseen end-effector
motions at test time. We demonstrate the capabilities of our approach on
multiple mobile robot platforms with different kinematic abilities and
different types of wheeled platforms in extensive simulated as well as
real-world experiments.Comment: Accepted for publication in RA-L. Code and Models:
http://rl.uni-freiburg.de/research/kinematic-feasibility-r
LCDNet: Deep Loop Closure Detection and Point Cloud Registration for LiDAR SLAM
Loop closure detection is an essential component of Simultaneous Localization
and Mapping (SLAM) systems, which reduces the drift accumulated over time. Over
the years, several deep learning approaches have been proposed to address this
task, however their performance has been subpar compared to handcrafted
techniques, especially while dealing with reverse loops. In this paper, we
introduce the novel LCDNet that effectively detects loop closures in LiDAR
point clouds by simultaneously identifying previously visited places and
estimating the 6-DoF relative transformation between the current scan and the
map. LCDNet is composed of a shared encoder, a place recognition head that
extracts global descriptors, and a relative pose head that estimates the
transformation between two point clouds. We introduce a novel relative pose
head based on the unbalanced optimal transport theory that we implement in a
differentiable manner to allow for end-to-end training. Extensive evaluations
of LCDNet on multiple real-world autonomous driving datasets show that our
approach outperforms state-of-the-art loop closure detection and point cloud
registration techniques by a large margin, especially while dealing with
reverse loops. Moreover, we integrate our proposed loop closure detection
approach into a LiDAR SLAM library to provide a complete mapping system and
demonstrate the generalization ability using different sensor setup in an
unseen city
- …