This paper presents a novel approach to estimating the continuous six degree
of freedom (6-DoF) pose (3D translation and rotation) of an object from a
single RGB image. The approach combines semantic keypoints predicted by a
convolutional network (convnet) with a deformable shape model. Unlike prior
work, we are agnostic to whether the object is textured or textureless, as the
convnet learns the optimal representation from the available training image
data. Furthermore, the approach can be applied to instance- and class-based
pose recovery. Empirically, we show that the proposed approach can accurately
recover the 6-DoF object pose for both instance- and class-based scenarios with
a cluttered background. For class-based object pose estimation,
state-of-the-art accuracy is shown on the large-scale PASCAL3D+ dataset.Comment: IEEE International Conference on Robotics and Automation (ICRA), 201