We introduce a scalable approach for object pose estimation trained on
simulated RGB views of multiple 3D models together. We learn an encoding of
object views that does not only describe an implicit orientation of all objects
seen during training, but can also relate views of untrained objects. Our
single-encoder-multi-decoder network is trained using a technique we denote
"multi-path learning": While the encoder is shared by all objects, each decoder
only reconstructs views of a single object. Consequently, views of different
instances do not have to be separated in the latent space and can share common
features. The resulting encoder generalizes well from synthetic to real data
and across various instances, categories, model types and datasets. We
systematically investigate the learned encodings, their generalization, and
iterative refinement strategies on the ModelNet40 and T-LESS dataset. Despite
training jointly on multiple objects, our 6D Object Detection pipeline achieves
state-of-the-art results on T-LESS at much lower runtimes than competing
approaches.Comment: To appear at CVPR 2020; Code will be available here:
https://github.com/DLR-RM/AugmentedAutoencoder/tree/multipat