Estimating the relative pose of a new object without prior knowledge is a
hard problem, while it is an ability very much needed in robotics and Augmented
Reality. We present a method for tracking the 6D motion of objects in RGB video
sequences when neither the training images nor the 3D geometry of the objects
are available. In contrast to previous works, our method can therefore consider
unknown objects in open world instantly, without requiring any prior
information or a specific training phase. We consider two architectures, one
based on two frames, and the other relying on a Transformer Encoder, which can
exploit an arbitrary number of past frames. We train our architectures using
only synthetic renderings with domain randomization. Our results on challenging
datasets are on par with previous works that require much more information
(training images of the target objects, 3D models, and/or depth data). Our
source code is available at https://github.com/nv-nguyen/pizzaComment: 3DV Ora