1,022 research outputs found
CIR-Net: Cross-modality Interaction and Refinement for RGB-D Salient Object Detection
Focusing on the issue of how to effectively capture and utilize
cross-modality information in RGB-D salient object detection (SOD) task, we
present a convolutional neural network (CNN) model, named CIR-Net, based on the
novel cross-modality interaction and refinement. For the cross-modality
interaction, 1) a progressive attention guided integration unit is proposed to
sufficiently integrate RGB-D feature representations in the encoder stage, and
2) a convergence aggregation structure is proposed, which flows the RGB and
depth decoding features into the corresponding RGB-D decoding streams via an
importance gated fusion unit in the decoder stage. For the cross-modality
refinement, we insert a refinement middleware structure between the encoder and
the decoder, in which the RGB, depth, and RGB-D encoder features are further
refined by successively using a self-modality attention refinement unit and a
cross-modality weighting refinement unit. At last, with the gradually refined
features, we predict the saliency map in the decoder stage. Extensive
experiments on six popular RGB-D SOD benchmarks demonstrate that our network
outperforms the state-of-the-art saliency detectors both qualitatively and
quantitatively.Comment: Accepted by IEEE Transactions on Image Processing 2022, 16 pages, 11
figure
ASF-Net: Robust Video Deraining via Temporal Alignment and Online Adaptive Learning
In recent times, learning-based methods for video deraining have demonstrated
commendable results. However, there are two critical challenges that these
methods are yet to address: exploiting temporal correlations among adjacent
frames and ensuring adaptability to unknown real-world scenarios. To overcome
these challenges, we explore video deraining from a paradigm design perspective
to learning strategy construction. Specifically, we propose a new computational
paradigm, Alignment-Shift-Fusion Network (ASF-Net), which incorporates a
temporal shift module. This module is novel to this field and provides deeper
exploration of temporal information by facilitating the exchange of
channel-level information within the feature space. To fully discharge the
model's characterization capability, we further construct a LArge-scale RAiny
video dataset (LARA) which also supports the development of this community. On
the basis of the newly-constructed dataset, we explore the parameters learning
process by developing an innovative re-degraded learning strategy. This
strategy bridges the gap between synthetic and real-world scenes, resulting in
stronger scene adaptability. Our proposed approach exhibits superior
performance in three benchmarks and compelling visual quality in real-world
scenarios, underscoring its efficacy. The code is available at
https://github.com/vis-opt-group/ASF-Net
CoCoNet: Coupled Contrastive Learning Network with Multi-level Feature Ensemble for Multi-modality Image Fusion
Infrared and visible image fusion targets to provide an informative image by
combining complementary information from different sensors. Existing
learning-based fusion approaches attempt to construct various loss functions to
preserve complementary features from both modalities, while neglecting to
discover the inter-relationship between the two modalities, leading to
redundant or even invalid information on the fusion results. To alleviate these
issues, we propose a coupled contrastive learning network, dubbed CoCoNet, to
realize infrared and visible image fusion in an end-to-end manner. Concretely,
to simultaneously retain typical features from both modalities and remove
unwanted information emerging on the fused result, we develop a coupled
contrastive constraint in our loss function.In a fused imge, its foreground
target/background detail part is pulled close to the infrared/visible source
and pushed far away from the visible/infrared source in the representation
space. We further exploit image characteristics to provide data-sensitive
weights, which allows our loss function to build a more reliable relationship
with source images. Furthermore, to learn rich hierarchical feature
representation and comprehensively transfer features in the fusion process, a
multi-level attention module is established. In addition, we also apply the
proposed CoCoNet on medical image fusion of different types, e.g., magnetic
resonance image and positron emission tomography image, magnetic resonance
image and single photon emission computed tomography image. Extensive
experiments demonstrate that our method achieves the state-of-the-art (SOTA)
performance under both subjective and objective evaluation, especially in
preserving prominent targets and recovering vital textural details.Comment: 25 pages, 16 figure
Deep visible and thermal image fusion for enhanced pedestrian visibility
Reliable vision in challenging illumination conditions is one of the crucial requirements of future autonomous automotive systems. In the last decade, thermal cameras have become more easily accessible to a larger number of researchers. This has resulted in numerous studies which confirmed the benefits of the thermal cameras in limited visibility conditions. In this paper, we propose a learning-based method for visible and thermal image fusion that focuses on generating fused images with high visual similarity to regular truecolor (red-green-blue or RGB) images, while introducing new informative details in pedestrian regions. The goal is to create natural, intuitive images that would be more informative than a regular RGB camera to a human driver in challenging visibility conditions. The main novelty of this paper is the idea to rely on two types of objective functions for optimization: a similarity metric between the RGB input and the fused output to achieve natural image appearance; and an auxiliary pedestrian detection error to help defining relevant features of the human appearance and blending them into the output. We train a convolutional neural network using image samples from variable conditions (day and night) so that the network learns the appearance of humans in the different modalities and creates more robust results applicable in realistic situations. Our experiments show that the visibility of pedestrians is noticeably improved especially in dark regions and at night. Compared to existing methods we can better learn context and define fusion rules that focus on the pedestrian appearance, while that is not guaranteed with methods that focus on low-level image quality metrics
- …