Existing object detection models are mainly trained on large-scale labeled
datasets. However, annotating data for novel aerial object classes is expensive
since it is time-consuming and may require expert knowledge. Thus, it is
desirable to study label-efficient object detection methods on aerial images.
In this work, we propose a zero-shot method for aerial object detection named
visual Description Regularization, or DescReg. Concretely, we identify the weak
semantic-visual correlation of the aerial objects and aim to address the
challenge with prior descriptions of their visual appearance. Instead of
directly encoding the descriptions into class embedding space which suffers
from the representation gap problem, we propose to infuse the prior inter-class
visual similarity conveyed in the descriptions into the embedding learning. The
infusion process is accomplished with a newly designed similarity-aware triplet
loss which incorporates structured regularization on the representation space.
We conduct extensive experiments with three challenging aerial object detection
datasets, including DIOR, xView, and DOTA. The results demonstrate that DescReg
significantly outperforms the state-of-the-art ZSD methods with complex
projection designs and generative frameworks, e.g., DescReg outperforms best
reported ZSD method on DIOR by 4.5 mAP on unseen classes and 8.1 in HM. We
further show the generalizability of DescReg by integrating it into generative
ZSD methods as well as varying the detection architecture.Comment: 13 pages, 3 figure