3,984 research outputs found
Learning Sampling-Based 6D Object Pose Estimation
The task of 6D object pose estimation, i.e. of estimating an object position (three degrees of freedom) and orientation (three degrees of freedom) from images is an essential building block of many modern applications, such as robotic grasping, autonomous driving, or augmented reality. Automatic pose estimation systems have to overcome a variety of visual ambiguities, including texture-less objects, clutter, and occlusion. Since many applications demand real time performance the efficient use of computational resources is an additional challenge.
In this thesis, we will take a probabilistic stance on trying to overcome said issues. We build on a highly successful automatic pose estimation framework based on predicting pixel-wise correspondences between the camera coordinate system and the local coordinate system of the object. These dense correspondences are used to generate a pool of hypotheses, which in turn serve as a starting point in a final search procedure. We will present three systems that each use probabilistic modeling and sampling to improve upon different aspects of the framework.
The goal of the first system, System I, is to enable pose tracking, i.e. estimating the pose of an object in a sequence of frames instead of a single image. By including information from previous frames tracking systems can resolve many visual ambiguities and reduce computation time. System I is a particle filter (PF) approach. The PF represents its belief about the pose in each frame by propagating a set of samples through time. Our system uses the process of hypothesis generation from the original framework as part of a proposal distribution that efficiently concentrates samples in the appropriate areas.
In System II, we focus on the problem of evaluating the quality of pose hypotheses. This task plays an essential role in the final search procedure of the original framework. We use a convolutional neural network (CNN) to assess the quality of an hypothesis by comparing rendered and observed images. To train the CNN we view it as part of an energy-based probability distribution in pose space. This probabilistic perspective allows us to train the system under the maximum likelihood paradigm. We use a sampling approach to approximate the required gradients. The resulting system for pose estimation yields superior results in particular for highly occluded objects.
In System III, we take the idea of machine learning a step further. Instead of learning to predict an hypothesis quality measure, to be used in a search procedure, we present a way of learning the search procedure itself. We train a reinforcement learning (RL) agent, termed PoseAgent, to steer the search process and make optimal use of a given computational budget. PoseAgent dynamically decides which hypothesis should be refined next, and which one should ultimately be output as final estimate. Since the search procedure includes discrete non-differentiable choices, training of the system via gradient descent is not easily possible. To solve the problem, we model behavior of PoseAgent as non-deterministic stochastic policy, which is ultimately governed by a CNN. This allows us to use a sampling-based stochastic policy gradient training procedure.
We believe that some of the ideas developed in this thesis,
such as the sampling-driven probabilistically motivated training of a CNN for the comparison of images or the search procedure implemented by PoseAgent have the potential to be applied in fields beyond pose estimation as well
E2EK: End-to-end regression network based on keypoint for 6d pose estimation
The methods based on deep learning are the mainstream of 6D object pose estimation, which mainly include direct regression and two-stage pipelines. The former are keen by many scholars at first due to their simplicity and differentiability to poses, but they usually lack in accuracy when compared with the latter that estimate the intermediate variables relating to geometries such as object keypoints or 2D-3D correspondence before PnP/RANSAC algorithm. However, the loss function of the two-stage method is non-differentiable to the 6D pose, which is hard to apply in the tasks requiring the differentiable poses. To overcome the disadvantages of the above methods, we propose an end-to-end regression network based on keypoints for 6D pose estimation. Specifically, we supervise the point-wise keypoint offsets that help the network to learn the geometric information and directly regress the 6D pose through aggregating keypoints to achieve differentiability to the pose. Furthermore, we improve the sampling method by sampling points around objects that benefits the small object and design a unit loss function that helps the learning of the keypoints. Experimental results show that our approach outperforms most methods on LM, LM-O and YCB-V datasets
DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field
Estimating 6D poses and reconstructing 3D shapes of objects in open-world
scenes from RGB-depth image pairs is challenging. Many existing methods rely on
learning geometric features that correspond to specific templates while
disregarding shape variations and pose differences among objects in the same
category. As a result, these methods underperform when handling unseen object
instances in complex environments. In contrast, other approaches aim to achieve
category-level estimation and reconstruction by leveraging normalized geometric
structure priors, but the static prior-based reconstruction struggles with
substantial intra-class variations. To solve these problems, we propose the
DTF-Net, a novel framework for pose estimation and shape reconstruction based
on implicit neural fields of object categories. In DTF-Net, we design a
deformable template field to represent the general category-wise shape latent
features and intra-category geometric deformation features. The field
establishes continuous shape correspondences, deforming the category template
into arbitrary observed instances to accomplish shape reconstruction. We
introduce a pose regression module that shares the deformation features and
template codes from the fields to estimate the accurate 6D pose of each object
in the scene. We integrate a multi-modal representation extraction module to
extract object features and semantic masks, enabling end-to-end inference.
Moreover, during training, we implement a shape-invariant training strategy and
a viewpoint sampling method to further enhance the model's capability to
extract object pose features. Extensive experiments on the REAL275 and CAMERA25
datasets demonstrate the superiority of DTF-Net in both synthetic and real
scenes. Furthermore, we show that DTF-Net effectively supports grasping tasks
with a real robot arm.Comment: The first two authors are with equal contributions. Paper accepted by
ACM MM 202
iPose: Instance-Aware 6D Pose Estimation of Partly Occluded Objects
We address the task of 6D pose estimation of known rigid objects from single
input images in scenarios where the objects are partly occluded. Recent
RGB-D-based methods are robust to moderate degrees of occlusion. For RGB
inputs, no previous method works well for partly occluded objects. Our main
contribution is to present the first deep learning-based system that estimates
accurate poses for partly occluded objects from RGB-D and RGB input. We achieve
this with a new instance-aware pipeline that decomposes 6D object pose
estimation into a sequence of simpler steps, where each step removes specific
aspects of the problem. The first step localizes all known objects in the image
using an instance segmentation network, and hence eliminates surrounding
clutter and occluders. The second step densely maps pixels to 3D object surface
positions, so called object coordinates, using an encoder-decoder network, and
hence eliminates object appearance. The third, and final, step predicts the 6D
pose using geometric optimization. We demonstrate that we significantly
outperform the state-of-the-art for pose estimation of partly occluded objects
for both RGB and RGB-D input
PoseAgent: Budget-Constrained 6D Object Pose Estimation via Reinforcement Learning
State-of-the-art computer vision algorithms often achieve efficiency by
making discrete choices about which hypotheses to explore next. This allows
allocation of computational resources to promising candidates, however, such
decisions are non-differentiable. As a result, these algorithms are hard to
train in an end-to-end fashion. In this work we propose to learn an efficient
algorithm for the task of 6D object pose estimation. Our system optimizes the
parameters of an existing state-of-the art pose estimation system using
reinforcement learning, where the pose estimation system now becomes the
stochastic policy, parametrized by a CNN. Additionally, we present an efficient
training algorithm that dramatically reduces computation time. We show
empirically that our learned pose estimation procedure makes better use of
limited resources and improves upon the state-of-the-art on a challenging
dataset. Our approach enables differentiable end-to-end training of complex
algorithmic pipelines and learns to make optimal use of a given computational
budget
- …