Methods for object detection and segmentation often require abundant
instance-level annotations for training, which are time-consuming and expensive
to collect. To address this, the task of zero-shot object detection (or
segmentation) aims at learning effective methods for identifying and localizing
object instances for the categories that have no supervision available.
Constructing architectures for these tasks requires choosing from a myriad of
design options, ranging from the form of the class encoding used to transfer
information from seen to unseen categories, to the nature of the function being
optimized for learning. In this work, we extensively study these design
choices, and carefully construct a simple yet extremely effective zero-shot
recognition method. Through extensive experiments on the MSCOCO dataset on
object detection and segmentation, we highlight that our proposed method
outperforms existing, considerably more complex, architectures. Our findings
and method, which we propose as a competitive future baseline, point towards
the need to revisit some of the recent design trends in zero-shot detection /
segmentation.Comment: 17 Pages, 7 Figure