55 research outputs found
6D Object Pose Estimation from Approximate 3D Models for Orbital Robotics
We present a novel technique to estimate the 6D pose of objects from single
images where the 3D geometry of the object is only given approximately and not
as a precise 3D model. To achieve this, we employ a dense 2D-to-3D
correspondence predictor that regresses 3D model coordinates for every pixel.
In addition to the 3D coordinates, our model also estimates the pixel-wise
coordinate error to discard correspondences that are likely wrong. This allows
us to generate multiple 6D pose hypotheses of the object, which we then refine
iteratively using a highly efficient region-based approach. We also introduce a
novel pixel-wise posterior formulation by which we can estimate the probability
for each hypothesis and select the most likely one. As we show in experiments,
our approach is capable of dealing with extreme visual conditions including
overexposure, high contrast, or low signal-to-noise ratio. This makes it a
powerful technique for the particularly challenging task of estimating the pose
of tumbling satellites for in-orbit robotic applications. Our method achieves
state-of-the-art performance on the SPEED+ dataset and has won the SPEC2021
post-mortem competition.Comment: preprin
Self-Supervised Object-in-Gripper Segmentation from Robotic Motions
Accurate object segmentation is a crucial task in the context of robotic
manipulation. However, creating sufficient annotated training data for neural
networks is particularly time consuming and often requires manual labeling. To
this end, we propose a simple, yet robust solution for learning to segment
unknown objects grasped by a robot. Specifically, we exploit motion and
temporal cues in RGB video sequences. Using optical flow estimation we first
learn to predict segmentation masks of our given manipulator. Then, these
annotations are used in combination with motion cues to automatically
distinguish between background, manipulator and unknown, grasped object. In
contrast to existing systems our approach is fully self-supervised and
independent of precise camera calibration, 3D models or potentially imperfect
depth data. We perform a thorough comparison with alternative baselines and
approaches from literature. The object masks and views are shown to be suitable
training data for segmentation networks that generalize to novel environments
and also allow for watertight 3D reconstruction.Comment: 15 pages, 11 figures. Video:
https://www.youtube.com/watch?v=srEwuuIIgz
"What's This?" -- Learning to Segment Unknown Objects from Manipulation Sequences
We present a novel framework for self-supervised grasped object segmentation
with a robotic manipulator. Our method successively learns an agnostic
foreground segmentation followed by a distinction between manipulator and
object solely by observing the motion between consecutive RGB frames. In
contrast to previous approaches, we propose a single, end-to-end trainable
architecture which jointly incorporates motion cues and semantic knowledge.
Furthermore, while the motion of the manipulator and the object are substantial
cues for our algorithm, we present means to robustly deal with distraction
objects moving in the background, as well as with completely static scenes. Our
method neither depends on any visual registration of a kinematic robot or 3D
object models, nor on precise hand-eye calibration or any additional sensor
data. By extensive experimental evaluation we demonstrate the superiority of
our framework and provide detailed insights on its capability of dealing with
the aforementioned extreme cases of motion. We also show that training a
semantic segmentation network with the automatically labeled data achieves
results on par with manually annotated training data. Code and pretrained
models will be made publicly available.Comment: 8 pages, 6 figure
Self-Supervised Object-in-Gripper Segmentation from Robotic Motions
Accurate object segmentation is a crucial task in the context of robotic manipulation. However, creating sufficient annotated training data for neural networks is particularly time consuming and often requires manual labeling. To this end, we propose a simple, yet robust solution for learning to segment unknown objects grasped by a robot. Specifically, we exploit motion and temporal cues in RGB video sequences. Using optical flow estimation we first learn to predict segmentation masks of our given manipulator. Then, these annotations are used in combination with motion cues to automatically distinguish between background, manipulator and unknown, grasped object. In contrast to existing systems our approach is fully self-supervised and independent of precise camera calibration, 3D models or potentially imperfect depth data. We perform a thorough comparison with alternative baselines and approaches from literature. The object masks and views are shown to be suitable training data for segmentation networks that generalize to novel environments and also allow for watertight 3D reconstruction
Robust Probabilistic Robot Arm Keypoint Detection Exploiting Kinematic Knowledge
We propose PK-ROKED, a novel probabilistic deep-learning algorithm to detect keypoints of a robotic manipulator in camera images and to robustly estimate the positioning inaccuracies w.r.t the camera frame. Our algorithm uses monocular images as a primary input source and augments these with prior knowledge about the keypoint locations based on the robot's forward kinematics. As output, the network provides 2D image coordinates of the keypoints and an associated uncertainty measure, where the latter is obtained using MonteCarlo dropout. In experiments on two different robotic systems, we show that our network provides superior detection results
compared to the state-of-the-art. We furthermore analyze the precision of different estimation approaches to obtain an uncertainty measure
Bayesian Active Learning for Sim-to-Real Robotic Perception
While learning from synthetic training data has
recently gained an increased attention, in real-world robotic
applications, there are still performance deficiencies due to
the so-called Sim-to-Real gap. In practice, this gap is hard
to resolve with only synthetic data. Therefore, we focus on an
efficient acquisition of real data within a Sim-to-Real learning
pipeline. Concretely, we employ deep Bayesian active learning to
minimize manual annotation efforts and devise an autonomous
learning paradigm to select the data that is considered useful
for the human expert to annotate. To achieve this, a Bayesian
Neural Network (BNN) object detector providing reliable un-
certainty estimates is adapted to infer the informativeness of the
unlabeled data. Furthermore, to cope with misalignments of the
label distribution in uncertainty-based sampling, we develop an
effective randomized sampling strategy that performs favorably
compared to other complex alternatives. In our experiments
on object classification and detection, we show benefits of
our approach and provide evidence that labeling efforts can
be reduced significantly. Finally, we demonstrate the practical
effectiveness of this idea in a grasping task on an assistive robot
Bayesian Active Learning for Sim-to-Real Robotic Perception
While learning from synthetic training data has
recently gained an increased attention, in real-world robotic
applications, there are still performance deficiencies due to
the so-called Sim-to-Real gap. In practice, this gap is hard
to resolve with only synthetic data. Therefore, we focus on an
efficient acquisition of real data within a Sim-to-Real learning
pipeline. Concretely, we employ deep Bayesian active learning to
minimize manual annotation efforts and devise an autonomous
learning paradigm to select the data that is considered useful
for the human expert to annotate. To achieve this, a Bayesian
Neural Network (BNN) object detector providing reliable un-
certainty estimates is adapted to infer the informativeness of the
unlabeled data. Furthermore, to cope with misalignments of the
label distribution in uncertainty-based sampling, we develop an
effective randomized sampling strategy that performs favorably
compared to other complex alternatives. In our experiments
on object classification and detection, we show benefits of
our approach and provide evidence that labeling efforts can
be reduced significantly. Finally, we demonstrate the practical
effectiveness of this idea in a grasping task on an assistive robot
Towards Robust Perception of Unknown Objects in the Wild
To be able to interact in dynamic and cluttered environments, detection and instance segmentation of only known objects is often not sufficient. Our recently proposed Instance Stereo Transformer (INSTR) addresses this problem by yielding pixel-wise instance masks of unknown items on dominant horizontal surfaces without requiring potentially noisy depth maps. To further boost the application of INSTR in a robotic domain, we propose two improvements: First, we extend the network to semantically label all non-object pixels, and experimentally validate that the additional explicit semantic information further enhances the object instance predictions. Second, knowledge about some detected objects might often readily be available, and we utilize Dropout as approximation of Bayesian inference to robustly classify the detected instances into known and unknown categories. The overall framework is well suited for various robotic applications, e.g. stone segmentation in planetary environments or in an unknown object grasping setting
ReSyRIS: A Real-Synthetic Rock Instance Segmentation Dataset for Training and Benchmarking
The exploration of our solar system for understanding its creation and investigating potential chances of life on other celestial bodies is a fundamental drive of human mankind. After early telescope-based observation, Apollo 11 was the first space mission able to collect samples on the lunar surface and take them back to earth for analysis. Especially in recent years this trend accelerates again, and many successors were (or are in the process of being) launched into space for extra-terrestrial sample extraction. Yet, the abundance of potential failures makes these missions extremely challenging. For operations aimed at deeper parts of the solar system, the operational working distance extends even further, and communication delay and limited bandwidth increase complexity. Consequently, sample extraction missions are designed to be more autonomous in order to carry out large parts without human intervention. One specific sub-task particularly suitable for automation is the identification of relevant extraction candidates. While there exists several approaches for rock sample identification, there are often limiting factors in the form of applicable training data, lack of suitable annotations of the very same, and unclear performance of the algorithms in extra-terrestrial environments because of inadequate test data. To address these issues, we present ReSyRIS (Real-Synthetic Rock Instance Segmentation Dataset), which consists of real-world images together with their manually created synthetic counterpart. The real-world part is collected in a quasi-extra-terrestrial environment on Mt. Etna in Sicily, and focuses recordings of several rock sample sites. Every scene is re-created in OAISYS, a Blender-based data generation pipeline for unstructured outdoor environments, for which the required meshes and textures are extracted from the volcano site. This allows not only precise re-construction of the scenes in a synthetic environment, but also generation of highly realistic training data with automatic annotations in similar fashion to the real recordings. We finally investigate the generalization capability of a neural network trained on incrementally altered versions of synthetic data to explore potential sim-to-real gaps. The real-world dataset together with the OAISYS config files to create its synthetic counterpart are publicly available at https://rm.dlr.de/resyris_en. With this novel benchmark on extra-terrestrial stone instance segmentation we hope to further push the boundaries of autonomous rock sample extraction
Multi-path Learning for Object Pose Estimation Across Domains
We introduce a scalable approach for object pose estimation trained on
simulated RGB views of multiple 3D models together. We learn an encoding of
object views that does not only describe an implicit orientation of all objects
seen during training, but can also relate views of untrained objects. Our
single-encoder-multi-decoder network is trained using a technique we denote
"multi-path learning": While the encoder is shared by all objects, each decoder
only reconstructs views of a single object. Consequently, views of different
instances do not have to be separated in the latent space and can share common
features. The resulting encoder generalizes well from synthetic to real data
and across various instances, categories, model types and datasets. We
systematically investigate the learned encodings, their generalization, and
iterative refinement strategies on the ModelNet40 and T-LESS dataset. Despite
training jointly on multiple objects, our 6D Object Detection pipeline achieves
state-of-the-art results on T-LESS at much lower runtimes than competing
approaches.Comment: To appear at CVPR 2020; Code will be available here:
https://github.com/DLR-RM/AugmentedAutoencoder/tree/multipat
- …