6,476 research outputs found
YOLOPose V2: Understanding and Improving Transformer-based 6D Pose Estimation
6D object pose estimation is a crucial prerequisite for autonomous robot
manipulation applications. The state-of-the-art models for pose estimation are
convolutional neural network (CNN)-based. Lately, Transformers, an architecture
originally proposed for natural language processing, is achieving
state-of-the-art results in many computer vision tasks as well. Equipped with
the multi-head self-attention mechanism, Transformers enable simple
single-stage end-to-end architectures for learning object detection and 6D
object pose estimation jointly. In this work, we propose YOLOPose (short form
for You Only Look Once Pose estimation), a Transformer-based multi-object 6D
pose estimation method based on keypoint regression and an improved variant of
the YOLOPose model. In contrast to the standard heatmaps for predicting
keypoints in an image, we directly regress the keypoints. Additionally, we
employ a learnable orientation estimation module to predict the orientation
from the keypoints. Along with a separate translation estimation module, our
model is end-to-end differentiable. Our method is suitable for real-time
applications and achieves results comparable to state-of-the-art methods. We
analyze the role of object queries in our architecture and reveal that the
object queries specialize in detecting objects in specific image regions.
Furthermore, we quantify the accuracy trade-off of using datasets of smaller
sizes to train our model.Comment: Robotics and Autonomous Systems Journal, Elsevier, to appear 2023.
arXiv admin note: substantial text overlap with arXiv:2205.0253
DR-Pose: A Two-stage Deformation-and-Registration Pipeline for Category-level 6D Object Pose Estimation
Category-level object pose estimation involves estimating the 6D pose and the
3D metric size of objects from predetermined categories. While recent
approaches take categorical shape prior information as reference to improve
pose estimation accuracy, the single-stage network design and training manner
lead to sub-optimal performance since there are two distinct tasks in the
pipeline. In this paper, the advantage of two-stage pipeline over single-stage
design is discussed. To this end, we propose a two-stage deformation-and
registration pipeline called DR-Pose, which consists of completion-aided
deformation stage and scaled registration stage. The first stage uses a point
cloud completion method to generate unseen parts of target object, guiding
subsequent deformation on the shape prior. In the second stage, a novel
registration network is designed to extract pose-sensitive features and predict
the representation of object partial point cloud in canonical space based on
the deformation results from the first stage. DR-Pose produces superior results
to the state-of-the-art shape prior-based methods on both CAMERA25 and REAL275
benchmarks. Codes are available at https://github.com/Zray26/DR-Pose.git.Comment: Camera-ready version accepted to IROS 202
Accurate 6D Object Pose Estimation by Pose Conditioned Mesh Reconstruction
Current 6D object pose methods consist of deep CNN models fully optimized for
a single object but with its architecture standardized among objects with
different shapes. In contrast to previous works, we explicitly exploit each
object's distinct topological information i.e. 3D dense meshes in the pose
estimation model, with an automated process and prior to any post-processing
refinement stage. In order to achieve this, we propose a learning framework in
which a Graph Convolutional Neural Network reconstructs a pose conditioned 3D
mesh of the object. A robust estimation of the allocentric orientation is
recovered by computing, in a differentiable manner, the Procrustes' alignment
between the canonical and reconstructed dense 3D meshes. 6D egocentric pose is
then lifted using additional mask and 2D centroid projection estimations. Our
method is capable of self validating its pose estimation by measuring the
quality of the reconstructed mesh, which is invaluable in real life
applications. In our experiments on the LINEMOD, OCCLUSION and YCB-Video
benchmarks, the proposed method outperforms state-of-the-arts
Learning Implicit Probability Distribution Functions for Symmetric Orientation Estimation from RGB Images Without Pose Labels
Object pose estimation is a necessary prerequisite for autonomous robotic
manipulation, but the presence of symmetry increases the complexity of the pose
estimation task. Existing methods for object pose estimation output a single 6D
pose. Thus, they lack the ability to reason about symmetries. Lately, modeling
object orientation as a non-parametric probability distribution on the SO(3)
manifold by neural networks has shown impressive results. However, acquiring
large-scale datasets to train pose estimation models remains a bottleneck. To
address this limitation, we introduce an automatic pose labeling scheme. Given
RGB-D images without object pose annotations and 3D object models, we design a
two-stage pipeline consisting of point cloud registration and
render-and-compare validation to generate multiple symmetrical
pseudo-ground-truth pose labels for each image. Using the generated pose
labels, we train an ImplicitPDF model to estimate the likelihood of an
orientation hypothesis given an RGB image. An efficient hierarchical sampling
of the SO(3) manifold enables tractable generation of the complete set of
symmetries at multiple resolutions. During inference, the most likely
orientation of the target object is estimated using gradient ascent. We
evaluate the proposed automatic pose labeling scheme and the ImplicitPDF model
on a photorealistic dataset and the T-Less dataset, demonstrating the
advantages of the proposed method
FSD: Fast Self-Supervised Single RGB-D to Categorical 3D Objects
In this work, we address the challenging task of 3D object recognition
without the reliance on real-world 3D labeled data. Our goal is to predict the
3D shape, size, and 6D pose of objects within a single RGB-D image, operating
at the category level and eliminating the need for CAD models during inference.
While existing self-supervised methods have made strides in this field, they
often suffer from inefficiencies arising from non-end-to-end processing,
reliance on separate models for different object categories, and slow surface
extraction during the training of implicit reconstruction models; thus
hindering both the speed and real-world applicability of the 3D recognition
process. Our proposed method leverages a multi-stage training pipeline,
designed to efficiently transfer synthetic performance to the real-world
domain. This approach is achieved through a combination of 2D and 3D supervised
losses during the synthetic domain training, followed by the incorporation of
2D supervised and 3D self-supervised losses on real-world data in two
additional learning stages. By adopting this comprehensive strategy, our method
successfully overcomes the aforementioned limitations and outperforms existing
self-supervised 6D pose and size estimation baselines on the NOCS test-set with
a 16.4% absolute improvement in mAP for 6D pose estimation while running in
near real-time at 5 Hz.Comment: Project page: https://fsd6d.github.i
CASAPose: Class-Adaptive and Semantic-Aware Multi-Object Pose Estimation
Applications in the field of augmented reality or robotics often require
joint localisation and 6D pose estimation of multiple objects. However, most
algorithms need one network per object class to be trained in order to provide
the best results. Analysing all visible objects demands multiple inferences,
which is memory and time-consuming. We present a new single-stage architecture
called CASAPose that determines 2D-3D correspondences for pose estimation of
multiple different objects in RGB images in one pass. It is fast and memory
efficient, and achieves high accuracy for multiple objects by exploiting the
output of a semantic segmentation decoder as control input to a keypoint
recognition decoder via local class-adaptive normalisation. Our new
differentiable regression of keypoint locations significantly contributes to a
faster closing of the domain gap between real test and synthetic training data.
We apply segmentation-aware convolutions and upsampling operations to increase
the focus inside the object mask and to reduce mutual interference of occluding
objects. For each inserted object, the network grows by only one output
segmentation map and a negligible number of parameters. We outperform
state-of-the-art approaches in challenging multi-object scenes with
inter-object occlusion and synthetic training.Comment: BMVC 2022, camera-ready version (this submission includes the paper
and supplementary material
iPose: Instance-Aware 6D Pose Estimation of Partly Occluded Objects
We address the task of 6D pose estimation of known rigid objects from single
input images in scenarios where the objects are partly occluded. Recent
RGB-D-based methods are robust to moderate degrees of occlusion. For RGB
inputs, no previous method works well for partly occluded objects. Our main
contribution is to present the first deep learning-based system that estimates
accurate poses for partly occluded objects from RGB-D and RGB input. We achieve
this with a new instance-aware pipeline that decomposes 6D object pose
estimation into a sequence of simpler steps, where each step removes specific
aspects of the problem. The first step localizes all known objects in the image
using an instance segmentation network, and hence eliminates surrounding
clutter and occluders. The second step densely maps pixels to 3D object surface
positions, so called object coordinates, using an encoder-decoder network, and
hence eliminates object appearance. The third, and final, step predicts the 6D
pose using geometric optimization. We demonstrate that we significantly
outperform the state-of-the-art for pose estimation of partly occluded objects
for both RGB and RGB-D input
- …