Applications in the field of augmented reality or robotics often require
joint localisation and 6D pose estimation of multiple objects. However, most
algorithms need one network per object class to be trained in order to provide
the best results. Analysing all visible objects demands multiple inferences,
which is memory and time-consuming. We present a new single-stage architecture
called CASAPose that determines 2D-3D correspondences for pose estimation of
multiple different objects in RGB images in one pass. It is fast and memory
efficient, and achieves high accuracy for multiple objects by exploiting the
output of a semantic segmentation decoder as control input to a keypoint
recognition decoder via local class-adaptive normalisation. Our new
differentiable regression of keypoint locations significantly contributes to a
faster closing of the domain gap between real test and synthetic training data.
We apply segmentation-aware convolutions and upsampling operations to increase
the focus inside the object mask and to reduce mutual interference of occluding
objects. For each inserted object, the network grows by only one output
segmentation map and a negligible number of parameters. We outperform
state-of-the-art approaches in challenging multi-object scenes with
inter-object occlusion and synthetic training.Comment: BMVC 2022, camera-ready version (this submission includes the paper
and supplementary material