4,953 research outputs found
Unsupervised Domain Adaptation for Multispectral Pedestrian Detection
Multimodal information (e.g., visible and thermal) can generate robust
pedestrian detections to facilitate around-the-clock computer vision
applications, such as autonomous driving and video surveillance. However, it
still remains a crucial challenge to train a reliable detector working well in
different multispectral pedestrian datasets without manual annotations. In this
paper, we propose a novel unsupervised domain adaptation framework for
multispectral pedestrian detection, by iteratively generating pseudo
annotations and updating the parameters of our designed multispectral
pedestrian detector on target domain. Pseudo annotations are generated using
the detector trained on source domain, and then updated by fixing the
parameters of detector and minimizing the cross entropy loss without
back-propagation. Training labels are generated using the pseudo annotations by
considering the characteristics of similarity and complementarity between
well-aligned visible and infrared image pairs. The parameters of detector are
updated using the generated labels by minimizing our defined multi-detection
loss function with back-propagation. The optimal parameters of detector can be
obtained after iteratively updating the pseudo annotations and parameters.
Experimental results show that our proposed unsupervised multimodal domain
adaptation method achieves significantly higher detection performance than the
approach without domain adaptation, and is competitive with the supervised
multispectral pedestrian detectors
Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation
Multimodal learning seeks to utilize data from multiple sources to improve
the overall performance of downstream tasks. It is desirable for redundancies
in the data to make multimodal systems robust to missing or corrupted
observations in some correlated modalities. However, we observe that the
performance of several existing multimodal networks significantly deteriorates
if one or multiple modalities are absent at test time. To enable robustness to
missing modalities, we propose simple and parameter-efficient adaptation
procedures for pretrained multimodal networks. In particular, we exploit
low-rank adaptation and modulation of intermediate features to compensate for
the missing modalities. We demonstrate that such adaptation can partially
bridge performance drop due to missing modalities and outperform independent,
dedicated networks trained for the available modality combinations in some
cases. The proposed adaptation requires extremely small number of parameters
(e.g., fewer than 0.7% of the total parameters in most experiments). We conduct
a series of experiments to highlight the robustness of our proposed method
using diverse datasets for RGB-thermal and RGB-Depth semantic segmentation,
multimodal material segmentation, and multimodal sentiment analysis tasks. Our
proposed method demonstrates versatility across various tasks and datasets, and
outperforms existing methods for robust multimodal learning with missing
modalities.Comment: 18 pages, 3 figures, 11 table
Understanding Dark Scenes by Contrasting Multi-Modal Observations
Understanding dark scenes based on multi-modal image data is challenging, as
both the visible and auxiliary modalities provide limited semantic information
for the task. Previous methods focus on fusing the two modalities but neglect
the correlations among semantic classes when minimizing losses to align pixels
with labels, resulting in inaccurate class predictions. To address these
issues, we introduce a supervised multi-modal contrastive learning approach to
increase the semantic discriminability of the learned multi-modal feature
spaces by jointly performing cross-modal and intra-modal contrast under the
supervision of the class correlations. The cross-modal contrast encourages
same-class embeddings from across the two modalities to be closer and pushes
different-class ones apart. The intra-modal contrast forces same-class or
different-class embeddings within each modality to be together or apart. We
validate our approach on a variety of tasks that cover diverse light conditions
and image modalities. Experiments show that our approach can effectively
enhance dark scene understanding based on multi-modal images with limited
semantics by shaping semantic-discriminative feature spaces. Comparisons with
previous methods demonstrate our state-of-the-art performance. Code and
pretrained models are available at https://github.com/palmdong/SMMCL
- …