464 research outputs found
Implicit 3D Orientation Learning for 6D Object Detection from RGB Images
We propose a real-time RGB-based pipeline for object detection and 6D pose
estimation. Our novel 3D orientation estimation is based on a variant of the
Denoising Autoencoder that is trained on simulated views of a 3D model using
Domain Randomization. This so-called Augmented Autoencoder has several
advantages over existing methods: It does not require real, pose-annotated
training data, generalizes to various test sensors and inherently handles
object and view symmetries. Instead of learning an explicit mapping from input
images to object poses, it provides an implicit representation of object
orientations defined by samples in a latent space. Our pipeline achieves
state-of-the-art performance on the T-LESS dataset both in the RGB and RGB-D
domain. We also evaluate on the LineMOD dataset where we can compete with other
synthetically trained approaches. We further increase performance by correcting
3D orientation estimates to account for perspective errors when the object
deviates from the image center and show extended results.Comment: Code available at: https://github.com/DLR-RM/AugmentedAutoencode
Multi-path Learning for Object Pose Estimation Across Domains
We introduce a scalable approach for object pose estimation trained on
simulated RGB views of multiple 3D models together. We learn an encoding of
object views that does not only describe an implicit orientation of all objects
seen during training, but can also relate views of untrained objects. Our
single-encoder-multi-decoder network is trained using a technique we denote
"multi-path learning": While the encoder is shared by all objects, each decoder
only reconstructs views of a single object. Consequently, views of different
instances do not have to be separated in the latent space and can share common
features. The resulting encoder generalizes well from synthetic to real data
and across various instances, categories, model types and datasets. We
systematically investigate the learned encodings, their generalization, and
iterative refinement strategies on the ModelNet40 and T-LESS dataset. Despite
training jointly on multiple objects, our 6D Object Detection pipeline achieves
state-of-the-art results on T-LESS at much lower runtimes than competing
approaches.Comment: To appear at CVPR 2020; Code will be available here:
https://github.com/DLR-RM/AugmentedAutoencoder/tree/multipat
DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field
Estimating 6D poses and reconstructing 3D shapes of objects in open-world
scenes from RGB-depth image pairs is challenging. Many existing methods rely on
learning geometric features that correspond to specific templates while
disregarding shape variations and pose differences among objects in the same
category. As a result, these methods underperform when handling unseen object
instances in complex environments. In contrast, other approaches aim to achieve
category-level estimation and reconstruction by leveraging normalized geometric
structure priors, but the static prior-based reconstruction struggles with
substantial intra-class variations. To solve these problems, we propose the
DTF-Net, a novel framework for pose estimation and shape reconstruction based
on implicit neural fields of object categories. In DTF-Net, we design a
deformable template field to represent the general category-wise shape latent
features and intra-category geometric deformation features. The field
establishes continuous shape correspondences, deforming the category template
into arbitrary observed instances to accomplish shape reconstruction. We
introduce a pose regression module that shares the deformation features and
template codes from the fields to estimate the accurate 6D pose of each object
in the scene. We integrate a multi-modal representation extraction module to
extract object features and semantic masks, enabling end-to-end inference.
Moreover, during training, we implement a shape-invariant training strategy and
a viewpoint sampling method to further enhance the model's capability to
extract object pose features. Extensive experiments on the REAL275 and CAMERA25
datasets demonstrate the superiority of DTF-Net in both synthetic and real
scenes. Furthermore, we show that DTF-Net effectively supports grasping tasks
with a real robot arm.Comment: The first two authors are with equal contributions. Paper accepted by
ACM MM 202
Generalizable Pose Estimation Using Implicit Scene Representations
6-DoF pose estimation is an essential component of robotic manipulation
pipelines. However, it usually suffers from a lack of generalization to new
instances and object types. Most widely used methods learn to infer the object
pose in a discriminative setup where the model filters useful information to
infer the exact pose of the object. While such methods offer accurate poses,
the model does not store enough information to generalize to new objects. In
this work, we address the generalization capability of pose estimation using
models that contain enough information about the object to render it in
different poses. We follow the line of work that inverts neural renderers to
infer the pose. We propose i-SRN to maximize the information flowing
from the input pose to the rendered scene and invert them to infer the pose
given an input image. Specifically, we extend Scene Representation Networks
(SRNs) by incorporating a separate network for density estimation and introduce
a new way of obtaining a weighted scene representation. We investigate several
ways of initial pose estimates and losses for the neural renderer. Our final
evaluation shows a significant improvement in inference performance and speed
compared to existing approaches
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes
Estimating the 6D pose of known objects is important for robots to interact
with the real world. The problem is challenging due to the variety of objects
as well as the complexity of a scene caused by clutter and occlusions between
objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network
for 6D object pose estimation. PoseCNN estimates the 3D translation of an
object by localizing its center in the image and predicting its distance from
the camera. The 3D rotation of the object is estimated by regressing to a
quaternion representation. We also introduce a novel loss function that enables
PoseCNN to handle symmetric objects. In addition, we contribute a large scale
video dataset for 6D object pose estimation named the YCB-Video dataset. Our
dataset provides accurate 6D poses of 21 objects from the YCB dataset observed
in 92 videos with 133,827 frames. We conduct extensive experiments on our
YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is
highly robust to occlusions, can handle symmetric objects, and provide accurate
pose estimation using only color images as input. When using depth data to
further refine the poses, our approach achieves state-of-the-art results on the
challenging OccludedLINEMOD dataset. Our code and dataset are available at
https://rse-lab.cs.washington.edu/projects/posecnn/.Comment: Accepted to RSS 201
Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild
While 6D object pose estimation has wide applications across computer vision
and robotics, it remains far from being solved due to the lack of annotations.
The problem becomes even more challenging when moving to category-level 6D
pose, which requires generalization to unseen instances. Current approaches are
restricted by leveraging annotations from simulation or collected from humans.
In this paper, we overcome this barrier by introducing a self-supervised
learning approach trained directly on large-scale real-world object videos for
category-level 6D pose estimation in the wild. Our framework reconstructs the
canonical 3D shape of an object category and learns dense correspondences
between input images and the canonical shape via surface embedding. For
training, we propose novel geometrical cycle-consistency losses which construct
cycles across 2D-3D spaces, across different instances and different time
steps. The learned correspondence can be applied for 6D pose estimation and
other downstream tasks such as keypoint transfer. Surprisingly, our method,
without any human annotations or simulators, can achieve on-par or even better
performance than previous supervised or semi-supervised methods on in-the-wild
images. Our project page is: https://kywind.github.io/self-pose .Comment: Project page: https://kywind.github.io/self-pos
Multi-Path Learning for Object Pose Estimation Across Domains
We introduce a scalable approach for object pose estima-tion trained on simulated RGB views of multiple 3D modelstogether. We learn an encoding of object views that doesnot only describe an implicit orientation of all objects seenduring training, but can also relate views of untrained ob-jects. Our single-encoder-multi-decoder network is trainedusing a technique we denote multi-path learning: Whilethe encoder is shared by all objects, each decoder only re-constructs views of a single object. Consequently, viewsof different instances do not have to be separated in thelatent space and can share common features. The result-ing encoder generalizes well from synthetic to real dataand across various instances, categories, model types anddatasets. We systematically investigate the learned encod-ings, their generalization, and iterative refinement strate-gies on the ModelNet40 and T-LESS dataset. Despite train-ing jointly on multiple objects, our 6D Object Detectionpipeline achieves state-of-the-art results on T-LESS at muchlower runtimes than competing approaches
FSD: Fast Self-Supervised Single RGB-D to Categorical 3D Objects
In this work, we address the challenging task of 3D object recognition
without the reliance on real-world 3D labeled data. Our goal is to predict the
3D shape, size, and 6D pose of objects within a single RGB-D image, operating
at the category level and eliminating the need for CAD models during inference.
While existing self-supervised methods have made strides in this field, they
often suffer from inefficiencies arising from non-end-to-end processing,
reliance on separate models for different object categories, and slow surface
extraction during the training of implicit reconstruction models; thus
hindering both the speed and real-world applicability of the 3D recognition
process. Our proposed method leverages a multi-stage training pipeline,
designed to efficiently transfer synthetic performance to the real-world
domain. This approach is achieved through a combination of 2D and 3D supervised
losses during the synthetic domain training, followed by the incorporation of
2D supervised and 3D self-supervised losses on real-world data in two
additional learning stages. By adopting this comprehensive strategy, our method
successfully overcomes the aforementioned limitations and outperforms existing
self-supervised 6D pose and size estimation baselines on the NOCS test-set with
a 16.4% absolute improvement in mAP for 6D pose estimation while running in
near real-time at 5 Hz.Comment: Project page: https://fsd6d.github.i
- …