12 research outputs found
Explicit Attention-Enhanced Fusion for RGB-Thermal Perception Tasks
Recently, RGB-Thermal based perception has shown significant advances.
Thermal information provides useful clues when visual cameras suffer from poor
lighting conditions, such as low light and fog. However, how to effectively
fuse RGB images and thermal data remains an open challenge. Previous works
involve naive fusion strategies such as merging them at the input,
concatenating multi-modality features inside models, or applying attention to
each data modality. These fusion strategies are straightforward yet
insufficient. In this paper, we propose a novel fusion method named Explicit
Attention-Enhanced Fusion (EAEF) that fully takes advantage of each type of
data. Specifically, we consider the following cases: i) both RGB data and
thermal data, ii) only one of the types of data, and iii) none of them generate
discriminative features. EAEF uses one branch to enhance feature extraction for
i) and iii) and the other branch to remedy insufficient representations for
ii). The outputs of two branches are fused to form complementary features. As a
result, the proposed fusion method outperforms state-of-the-art by 1.6\% in
mIoU on semantic segmentation, 3.1\% in MAE on salient object detection, 2.3\%
in mAP on object detection, and 8.1\% in MAE on crowd counting. The code is
available at https://github.com/FreeformRobotics/EAEFNet
CHITNet: A Complementary to Harmonious Information Transfer Network for Infrared and Visible Image Fusion
Current infrared and visible image fusion (IVIF) methods go to great lengths
to excavate complementary features and design complex fusion strategies, which
is extremely challenging. To this end, we rethink the IVIF outside the box,
proposing a complementary to harmonious information transfer network (CHITNet).
It reasonably transfers complementary information into harmonious one, which
integrates both the shared and complementary features from two modalities.
Specifically, to skillfully sidestep aggregating complementary information in
IVIF, we design a mutual information transfer (MIT) module to mutually
represent features from two modalities, roughly transferring complementary
information into harmonious one. Then, a harmonious information acquisition
supervised by source image (HIASSI) module is devised to further ensure the
complementary to harmonious information transfer after MIT. Meanwhile, we also
propose a structure information preservation (SIP) module to guarantee that the
edge structure information of the source images can be transferred to the
fusion results. Moreover, a mutual promotion training paradigm (MPTP) with
interaction loss is adopted to facilitate better collaboration among MIT,
HIASSI and SIP. In this way, the proposed method is able to generate fused
images with higher qualities. Extensive experimental results demonstrate the
superiority of our CHITNet over state-of-the-art algorithms in terms of visual
quality and quantitative evaluations
ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer
Deep learning (DL) has advanced the field of dense prediction, while
gradually dissolving the inherent barriers between different tasks. However,
most existing works focus on designing architectures and constructing visual
cues only for the specific task, which ignores the potential uniformity
introduced by the DL paradigm. In this paper, we attempt to construct a novel
\underline{ComP}lementary \underline{tr}ansformer, \textbf{ComPtr}, for diverse
bi-source dense prediction tasks. Specifically, unlike existing methods that
over-specialize in a single task or a subset of tasks, ComPtr starts from the
more general concept of bi-source dense prediction. Based on the basic
dependence on information complementarity, we propose consistency enhancement
and difference awareness components with which ComPtr can evacuate and collect
important visual semantic cues from different image sources for diverse tasks,
respectively. ComPtr treats different inputs equally and builds an efficient
dense interaction model in the form of sequence-to-sequence on top of the
transformer. This task-generic design provides a smooth foundation for
constructing the unified model that can simultaneously deal with various
bi-source information. In extensive experiments across several representative
vision tasks, i.e. remote sensing change detection, RGB-T crowd counting,
RGB-D/T salient object detection, and RGB-D semantic segmentation, the proposed
method consistently obtains favorable performance. The code will be available
at \url{https://github.com/lartpang/ComPtr}
Does Thermal Really Always Matter for RGB-T Salient Object Detection?
In recent years, RGB-T salient object detection (SOD) has attracted
continuous attention, which makes it possible to identify salient objects in
environments such as low light by introducing thermal image. However, most of
the existing RGB-T SOD models focus on how to perform cross-modality feature
fusion, ignoring whether thermal image is really always matter in SOD task.
Starting from the definition and nature of this task, this paper rethinks the
connotation of thermal modality, and proposes a network named TNet to solve the
RGB-T SOD task. In this paper, we introduce a global illumination estimation
module to predict the global illuminance score of the image, so as to regulate
the role played by the two modalities. In addition, considering the role of
thermal modality, we set up different cross-modality interaction mechanisms in
the encoding phase and the decoding phase. On the one hand, we introduce a
semantic constraint provider to enrich the semantics of thermal images in the
encoding phase, which makes thermal modality more suitable for the SOD task. On
the other hand, we introduce a two-stage localization and complementation
module in the decoding phase to transfer object localization cue and internal
integrity cue in thermal features to the RGB modality. Extensive experiments on
three datasets show that the proposed TNet achieves competitive performance
compared with 20 state-of-the-art methods.Comment: Accepted by IEEE Trans. Multimedia 2022, 13 pages, 9 figure
VST++: Efficient and Stronger Visual Saliency Transformer
While previous CNN-based models have exhibited promising results for salient
object detection (SOD), their ability to explore global long-range dependencies
is restricted. Our previous work, the Visual Saliency Transformer (VST),
addressed this constraint from a transformer-based sequence-to-sequence
perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task
transformer decoder that concurrently predicts saliency and boundary outcomes
in a pure transformer architecture. Moreover, we introduced a novel token
upsampling method called reverse T2T for predicting a high-resolution saliency
map effortlessly within transformer-based structures. Building upon the VST
model, we further propose an efficient and stronger VST version in this work,
i.e. VST++. To mitigate the computational costs of the VST model, we propose a
Select-Integrate Attention (SIA) module, partitioning foreground into
fine-grained segments and aggregating background information into a single
coarse-grained token. To incorporate 3D depth information with low cost, we
design a novel depth position encoding method tailored for depth maps.
Furthermore, we introduce a token-supervised prediction loss to provide
straightforward guidance for the task-related tokens. We evaluate our VST++
model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD
benchmark datasets. Experimental results show that our model outperforms
existing methods while achieving a 25% reduction in computational costs without
significant performance compromise. The demonstrated strong ability for
generalization, enhanced performance, and heightened efficiency of our VST++
model highlight its potential
RGB-D Salient Object Detection: A Survey
Salient object detection (SOD), which simulates the human visual perception
system to locate the most attractive object(s) in a scene, has been widely
applied to various computer vision tasks. Now, with the advent of depth
sensors, depth maps with affluent spatial information that can be beneficial in
boosting the performance of SOD, can easily be captured. Although various RGB-D
based SOD models with promising performance have been proposed over the past
several years, an in-depth understanding of these models and challenges in this
topic remains lacking. In this paper, we provide a comprehensive survey of
RGB-D based SOD models from various perspectives, and review related benchmark
datasets in detail. Further, considering that the light field can also provide
depth maps, we review SOD models and popular benchmark datasets from this
domain as well. Moreover, to investigate the SOD ability of existing models, we
carry out a comprehensive evaluation, as well as attribute-based evaluation of
several representative RGB-D based SOD models. Finally, we discuss several
challenges and open directions of RGB-D based SOD for future research. All
collected models, benchmark datasets, source code links, datasets constructed
for attribute-based evaluation, and codes for evaluation will be made publicly
available at https://github.com/taozh2017/RGBDSODsurveyComment: 24 pages, 12 figures. Has been accepted by Computational Visual Medi
RGB-D And Thermal Sensor Fusion: A Systematic Literature Review
In the last decade, the computer vision field has seen significant progress
in multimodal data fusion and learning, where multiple sensors, including
depth, infrared, and visual, are used to capture the environment across diverse
spectral ranges. Despite these advancements, there has been no systematic and
comprehensive evaluation of fusing RGB-D and thermal modalities to date. While
autonomous driving using LiDAR, radar, RGB, and other sensors has garnered
substantial research interest, along with the fusion of RGB and depth
modalities, the integration of thermal cameras and, specifically, the fusion of
RGB-D and thermal data, has received comparatively less attention. This might
be partly due to the limited number of publicly available datasets for such
applications. This paper provides a comprehensive review of both,
state-of-the-art and traditional methods used in fusing RGB-D and thermal
camera data for various applications, such as site inspection, human tracking,
fault detection, and others. The reviewed literature has been categorised into
technical areas, such as 3D reconstruction, segmentation, object detection,
available datasets, and other related topics. Following a brief introduction
and an overview of the methodology, the study delves into calibration and
registration techniques, then examines thermal visualisation and 3D
reconstruction, before discussing the application of classic feature-based
techniques as well as modern deep learning approaches. The paper concludes with
a discourse on current limitations and potential future research directions. It
is hoped that this survey will serve as a valuable reference for researchers
looking to familiarise themselves with the latest advancements and contribute
to the RGB-DT research field.Comment: 33 pages, 20 figure