29 research outputs found
Dynamic Knowledge Distillation with A Single Stream Structure for RGB-D Salient Object Detection
RGB-D salient object detection(SOD) demonstrates its superiority on detecting
in complex environments due to the additional depth information introduced in
the data. Inevitably, an independent stream is introduced to extract features
from depth images, leading to extra computation and parameters. This
methodology which sacrifices the model size to improve the detection accuracy
may impede the practical application of SOD problems. To tackle this dilemma,
we propose a dynamic distillation method along with a lightweight framework,
which significantly reduces the parameters. This method considers the factors
of both teacher and student performance within the training stage and
dynamically assigns the distillation weight instead of applying a fixed weight
on the student model. Extensive experiments are conducted on five public
datasets to demonstrate that our method can achieve competitive performance
compared to 10 prior methods through a 78.2MB lightweight structure
Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection
Most of existing RGB-D salient object detection (SOD) methods follow the
CNN-based paradigm, which is unable to model long-range dependencies across
space and modalities due to the natural locality of CNNs. Here we propose the
Hierarchical Cross-modal Transformer (HCT), a new multi-modal transformer, to
tackle this problem. Unlike previous multi-modal transformers that directly
connecting all patches from two modalities, we explore the cross-modal
complementarity hierarchically to respect the modality gap and spatial
discrepancy in unaligned regions. Specifically, we propose to use intra-modal
self-attention to explore complementary global contexts, and measure
spatial-aligned inter-modal attention locally to capture cross-modal
correlations. In addition, we present a Feature Pyramid module for Transformer
(FPT) to boost informative cross-scale integration as well as a
consistency-complementarity module to disentangle the multi-modal integration
path and improve the fusion adaptivity. Comprehensive experiments on a large
variety of public datasets verify the efficacy of our designs and the
consistent improvement over state-of-the-art models.Comment: 10 pages, 10 figure
Bifurcated backbone strategy for RGB-D salient object detection
Multi-level feature fusion is a fundamental topic in computer vision. It has
been exploited to detect, segment and classify objects at various scales. When
multi-level features meet multi-modal cues, the optimal feature aggregation and
multi-modal learning strategy become a hot potato. In this paper, we leverage
the inherent multi-modal and multi-level nature of RGB-D salient object
detection to devise a novel cascaded refinement network. In particular, first,
we propose to regroup the multi-level features into teacher and student
features using a bifurcated backbone strategy (BBS). Second, we introduce a
depth-enhanced module (DEM) to excavate informative depth cues from the channel
and spatial views. Then, RGB and depth modalities are fused in a complementary
way. Our architecture, named Bifurcated Backbone Strategy Network (BBS-Net), is
simple, efficient, and backbone-independent. Extensive experiments show that
BBS-Net significantly outperforms eighteen SOTA models on eight challenging
datasets under five evaluation measures, demonstrating the superiority of our
approach ( improvement in S-measure the top-ranked model:
DMRA-iccv2019). In addition, we provide a comprehensive analysis on the
generalization ability of different RGB-D datasets and provide a powerful
training set for future research.Comment: A preliminary version of this work has been accepted in ECCV 202