61 research outputs found
Point-aware Interaction and CNN-induced Refinement Network for RGB-D Salient Object Detection
By integrating complementary information from RGB image and depth map, the
ability of salient object detection (SOD) for complex and challenging scenes
can be improved. In recent years, the important role of Convolutional Neural
Networks (CNNs) in feature extraction and cross-modality interaction has been
fully explored, but it is still insufficient in modeling global long-range
dependencies of self-modality and cross-modality. To this end, we introduce
CNNs-assisted Transformer architecture and propose a novel RGB-D SOD network
with Point-aware Interaction and CNN-induced Refinement (PICR-Net). On the one
hand, considering the prior correlation between RGB modality and depth
modality, an attention-triggered cross-modality point-aware interaction (CmPI)
module is designed to explore the feature interaction of different modalities
with positional constraints. On the other hand, in order to alleviate the block
effect and detail destruction problems brought by the Transformer naturally, we
design a CNN-induced refinement (CNNR) unit for content refinement and
supplementation. Extensive experiments on five RGB-D SOD datasets show that the
proposed network achieves competitive results in both quantitative and
qualitative comparisons.Comment: Accepted by ACM MM 202
RGB-D Salient Object Detection: A Survey
Salient object detection (SOD), which simulates the human visual perception
system to locate the most attractive object(s) in a scene, has been widely
applied to various computer vision tasks. Now, with the advent of depth
sensors, depth maps with affluent spatial information that can be beneficial in
boosting the performance of SOD, can easily be captured. Although various RGB-D
based SOD models with promising performance have been proposed over the past
several years, an in-depth understanding of these models and challenges in this
topic remains lacking. In this paper, we provide a comprehensive survey of
RGB-D based SOD models from various perspectives, and review related benchmark
datasets in detail. Further, considering that the light field can also provide
depth maps, we review SOD models and popular benchmark datasets from this
domain as well. Moreover, to investigate the SOD ability of existing models, we
carry out a comprehensive evaluation, as well as attribute-based evaluation of
several representative RGB-D based SOD models. Finally, we discuss several
challenges and open directions of RGB-D based SOD for future research. All
collected models, benchmark datasets, source code links, datasets constructed
for attribute-based evaluation, and codes for evaluation will be made publicly
available at https://github.com/taozh2017/RGBDSODsurveyComment: 24 pages, 12 figures. Has been accepted by Computational Visual Medi
CIR-Net: Cross-modality Interaction and Refinement for RGB-D Salient Object Detection
Focusing on the issue of how to effectively capture and utilize
cross-modality information in RGB-D salient object detection (SOD) task, we
present a convolutional neural network (CNN) model, named CIR-Net, based on the
novel cross-modality interaction and refinement. For the cross-modality
interaction, 1) a progressive attention guided integration unit is proposed to
sufficiently integrate RGB-D feature representations in the encoder stage, and
2) a convergence aggregation structure is proposed, which flows the RGB and
depth decoding features into the corresponding RGB-D decoding streams via an
importance gated fusion unit in the decoder stage. For the cross-modality
refinement, we insert a refinement middleware structure between the encoder and
the decoder, in which the RGB, depth, and RGB-D encoder features are further
refined by successively using a self-modality attention refinement unit and a
cross-modality weighting refinement unit. At last, with the gradually refined
features, we predict the saliency map in the decoder stage. Extensive
experiments on six popular RGB-D SOD benchmarks demonstrate that our network
outperforms the state-of-the-art saliency detectors both qualitatively and
quantitatively.Comment: Accepted by IEEE Transactions on Image Processing 2022, 16 pages, 11
figure
VST++: Efficient and Stronger Visual Saliency Transformer
While previous CNN-based models have exhibited promising results for salient
object detection (SOD), their ability to explore global long-range dependencies
is restricted. Our previous work, the Visual Saliency Transformer (VST),
addressed this constraint from a transformer-based sequence-to-sequence
perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task
transformer decoder that concurrently predicts saliency and boundary outcomes
in a pure transformer architecture. Moreover, we introduced a novel token
upsampling method called reverse T2T for predicting a high-resolution saliency
map effortlessly within transformer-based structures. Building upon the VST
model, we further propose an efficient and stronger VST version in this work,
i.e. VST++. To mitigate the computational costs of the VST model, we propose a
Select-Integrate Attention (SIA) module, partitioning foreground into
fine-grained segments and aggregating background information into a single
coarse-grained token. To incorporate 3D depth information with low cost, we
design a novel depth position encoding method tailored for depth maps.
Furthermore, we introduce a token-supervised prediction loss to provide
straightforward guidance for the task-related tokens. We evaluate our VST++
model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD
benchmark datasets. Experimental results show that our model outperforms
existing methods while achieving a 25% reduction in computational costs without
significant performance compromise. The demonstrated strong ability for
generalization, enhanced performance, and heightened efficiency of our VST++
model highlight its potential
Does Thermal Really Always Matter for RGB-T Salient Object Detection?
In recent years, RGB-T salient object detection (SOD) has attracted
continuous attention, which makes it possible to identify salient objects in
environments such as low light by introducing thermal image. However, most of
the existing RGB-T SOD models focus on how to perform cross-modality feature
fusion, ignoring whether thermal image is really always matter in SOD task.
Starting from the definition and nature of this task, this paper rethinks the
connotation of thermal modality, and proposes a network named TNet to solve the
RGB-T SOD task. In this paper, we introduce a global illumination estimation
module to predict the global illuminance score of the image, so as to regulate
the role played by the two modalities. In addition, considering the role of
thermal modality, we set up different cross-modality interaction mechanisms in
the encoding phase and the decoding phase. On the one hand, we introduce a
semantic constraint provider to enrich the semantics of thermal images in the
encoding phase, which makes thermal modality more suitable for the SOD task. On
the other hand, we introduce a two-stage localization and complementation
module in the decoding phase to transfer object localization cue and internal
integrity cue in thermal features to the RGB modality. Extensive experiments on
three datasets show that the proposed TNet achieves competitive performance
compared with 20 state-of-the-art methods.Comment: Accepted by IEEE Trans. Multimedia 2022, 13 pages, 9 figure
Residual Spatial Fusion Network for RGB-Thermal Semantic Segmentation
Semantic segmentation plays an important role in widespread applications such
as autonomous driving and robotic sensing. Traditional methods mostly use RGB
images which are heavily affected by lighting conditions, \eg, darkness. Recent
studies show thermal images are robust to the night scenario as a compensating
modality for segmentation. However, existing works either simply fuse
RGB-Thermal (RGB-T) images or adopt the encoder with the same structure for
both the RGB stream and the thermal stream, which neglects the modality
difference in segmentation under varying lighting conditions. Therefore, this
work proposes a Residual Spatial Fusion Network (RSFNet) for RGB-T semantic
segmentation. Specifically, we employ an asymmetric encoder to learn the
compensating features of the RGB and the thermal images. To effectively fuse
the dual-modality features, we generate the pseudo-labels by saliency detection
to supervise the feature learning, and develop the Residual Spatial Fusion
(RSF) module with structural re-parameterization to learn more promising
features by spatially fusing the cross-modality features. RSF employs a
hierarchical feature fusion to aggregate multi-level features, and applies the
spatial weights with the residual connection to adaptively control the
multi-spectral feature fusion by the confidence gate. Extensive experiments
were carried out on two benchmarks, \ie, MFNet database and PST900 database.
The results have shown the state-of-the-art segmentation performance of our
method, which achieves a good balance between accuracy and speed
- …