In this paper, we present a weakly-supervised RGB-D salient object detection
model via scribble supervision. Specifically, as a multimodal learning task, we
focus on effective multimodal representation learning via inter-modal mutual
information regularization. In particular, following the principle of
disentangled representation learning, we introduce a mutual information upper
bound with a mutual information minimization regularizer to encourage the
disentangled representation of each modality for salient object detection.
Based on our multimodal representation learning framework, we introduce an
asymmetric feature extractor for our multimodal data, which is proven more
effective than the conventional symmetric backbone setting. We also introduce
multimodal variational auto-encoder as stochastic prediction refinement
techniques, which takes pseudo labels from the first training stage as
supervision and generates refined prediction. Experimental results on benchmark
RGB-D salient object detection datasets verify both effectiveness of our
explicit multimodal disentangled representation learning method and the
stochastic prediction refinement strategy, achieving comparable performance
with the state-of-the-art fully supervised models. Our code and data are
available at: https://github.com/baneitixiaomai/MIRV.Comment: IEEE Transactions on Circuits and Systems for Video Technology 202