Object localization, and more specifically object pose estimation, in large
industrial spaces such as warehouses and production facilities, is essential
for material flow operations. Traditional approaches rely on artificial
artifacts installed in the environment or excessively expensive equipment, that
is not suitable at scale. A more practical approach is to utilize existing
cameras in such spaces in order to address the underlying pose estimation
problem and to localize objects of interest. In order to leverage
state-of-the-art methods in deep learning for object pose estimation, large
amounts of data need to be collected and annotated. In this work, we provide an
approach to the annotation of large datasets of monocular images without the
need for manual labor. Our approach localizes cameras in space, unifies their
location with a motion capture system, and uses a set of linear mappings to
project 3D models of objects of interest at their ground truth 6D pose
locations. We test our pipeline on a custom dataset collected from a system of
eight cameras in an industrial setting that mimics the intended area of
operation. Our approach was able to provide consistent quality annotations for
our dataset with 26, 482 object instances at a fraction of the time required by
human annotators