Most self-supervised 6D object pose estimation methods can only work with
additional depth information or rely on the accurate annotation of 2D
segmentation masks, limiting their application range. In this paper, we propose
a 6D object pose estimation method that can be trained with pure RGB images
without any auxiliary information. We first obtain a rough pose initialization
from networks trained on synthetic images rendered from the target's 3D mesh.
Then, we introduce a refinement strategy leveraging the geometry constraint in
synthetic-to-real image pairs from multiple different views. We formulate this
geometry constraint as pixel-level flow consistency between the training images
with dynamically generated pseudo labels. We evaluate our method on three
challenging datasets and demonstrate that it outperforms state-of-the-art
self-supervised methods significantly, with neither 2D annotations nor
additional depth images.Comment: Accepted by ICCV 202