Weakly Supervised Object Detection (WSOD) enables the training of object
detection models using only image-level annotations. State-of-the-art WSOD
detectors commonly rely on multi-instance learning (MIL) as the backbone of
their detectors and assume that the bounding box proposals of an image are
independent of each other. However, since such approaches only utilize the
highest score proposal and discard the potentially useful information from
other proposals, their independent MIL backbone often limits models to salient
parts of an object or causes them to detect only one object per class. To solve
the above problems, we propose a novel backbone for WSOD based on our tailored
Vision Transformer named Weakly Supervised Transformer Detection Network
(WSTDN). Our algorithm is not only the first to demonstrate that self-attention
modules that consider inter-instance relationships are effective backbones for
WSOD, but also we introduce a novel bounding box mining method (BBM) integrated
with a memory transfer refinement (MTR) procedure to utilize the instance
dependencies for facilitating instance refinements. Experimental results on
PASCAL VOC2007 and VOC2012 benchmarks demonstrate the effectiveness of our
proposed WSTDN and modified instance refinement modules