Acquiring labeled 6D poses from real images is an expensive and
time-consuming task. Though massive amounts of synthetic RGB images are easy to
obtain, the models trained on them suffer from noticeable performance
degradation due to the synthetic-to-real domain gap. To mitigate this
degradation, we propose a practical self-supervised domain adaptation approach
that takes advantage of real RGB(-D) data without needing real pose labels. We
first pre-train the model with synthetic RGB images and then utilize real
RGB(-D) images to fine-tune the pre-trained model. The fine-tuning process is
self-supervised by the RGB-based pose-aware consistency and the depth-guided
object distance pseudo-label, which does not require the time-consuming online
differentiable rendering. We build our domain adaptation method based on the
recent pose estimator SC6D and evaluate it on the YCB-Video dataset. We
experimentally demonstrate that our method achieves comparable performance
against its fully-supervised counterpart while outperforming existing
state-of-the-art approaches.Comment: SCIA202