Object pose estimation is a core perception task that enables, for example,
object grasping and scene understanding. The widely available, inexpensive and
high-resolution RGB sensors and CNNs that allow for fast inference based on
this modality make monocular approaches especially well suited for robotics
applications. We observe that previous surveys on object pose estimation
establish the state of the art for varying modalities, single- and multi-view
settings, and datasets and metrics that consider a multitude of applications.
We argue, however, that those works' broad scope hinders the identification of
open challenges that are specific to monocular approaches and the derivation of
promising future challenges for their application in robotics. By providing a
unified view on recent publications from both robotics and computer vision, we
find that occlusion handling, novel pose representations, and formalizing and
improving category-level pose estimation are still fundamental challenges that
are highly relevant for robotics. Moreover, to further improve robotic
performance, large object sets, novel objects, refractive materials, and
uncertainty estimates are central, largely unsolved open challenges. In order
to address them, ontological reasoning, deformability handling, scene-level
reasoning, realistic datasets, and the ecological footprint of algorithms need
to be improved.Comment: arXiv admin note: substantial text overlap with arXiv:2302.1182