Object detection on visible (RGB) and infrared (IR) images, as an emerging
solution to facilitate robust detection for around-the-clock applications, has
received extensive attention in recent years. With the help of IR images,
object detectors have been more reliable and robust in practical applications
by using RGB-IR combined information. However, existing methods still suffer
from modality miscalibration and fusion imprecision problems. Since transformer
has the powerful capability to model the pairwise correlations between
different features, in this paper, we propose a novel Calibrated and
Complementary Transformer called C2Former to address these two
problems simultaneously. In C2Former, we design an Inter-modality
Cross-Attention (ICA) module to obtain the calibrated and complementary
features by learning the cross-attention relationship between the RGB and IR
modality. To reduce the computational cost caused by computing the global
attention in ICA, an Adaptive Feature Sampling (AFS) module is introduced to
decrease the dimension of feature maps. Because C2Former performs
in the feature domain, it can be embedded into existed RGB-IR object detectors
via the backbone network. Thus, one single-stage and one two-stage object
detector both incorporating our C2Former are constructed to
evaluate its effectiveness and versatility. With extensive experiments on the
DroneVehicle and KAIST RGB-IR datasets, we verify that our method can fully
utilize the RGB-IR complementary information and achieve robust detection
results. The code is available at
https://github.com/yuanmaoxun/Calibrated-and-Complementary-Transformer-for-RGB-Infrared-Object-Detection.git