365 research outputs found
Dynamic Knowledge Distillation with A Single Stream Structure for RGB-D Salient Object Detection
RGB-D salient object detection(SOD) demonstrates its superiority on detecting
in complex environments due to the additional depth information introduced in
the data. Inevitably, an independent stream is introduced to extract features
from depth images, leading to extra computation and parameters. This
methodology which sacrifices the model size to improve the detection accuracy
may impede the practical application of SOD problems. To tackle this dilemma,
we propose a dynamic distillation method along with a lightweight framework,
which significantly reduces the parameters. This method considers the factors
of both teacher and student performance within the training stage and
dynamically assigns the distillation weight instead of applying a fixed weight
on the student model. Extensive experiments are conducted on five public
datasets to demonstrate that our method can achieve competitive performance
compared to 10 prior methods through a 78.2MB lightweight structure
Test-Time Adaptation for Nighttime Color-Thermal Semantic Segmentation
The ability to scene understanding in adverse visual conditions, e.g.,
nighttime, has sparked active research for RGB-Thermal (RGB-T) semantic
segmentation. However, it is essentially hampered by two critical problems: 1)
the day-night gap of RGB images is larger than that of thermal images, and 2)
the class-wise performance of RGB images at night is not consistently higher or
lower than that of thermal images. we propose the first test-time adaptation
(TTA) framework, dubbed Night-TTA, to address the problems for nighttime RGBT
semantic segmentation without access to the source (daytime) data during
adaptation. Our method enjoys three key technical parts. Firstly, as one
modality (e.g., RGB) suffers from a larger domain gap than that of the other
(e.g., thermal), Imaging Heterogeneity Refinement (IHR) employs an interaction
branch on the basis of RGB and thermal branches to prevent cross-modal
discrepancy and performance degradation. Then, Class Aware Refinement (CAR) is
introduced to obtain reliable ensemble logits based on pixel-level distribution
aggregation of the three branches. In addition, we also design a specific
learning scheme for our TTA framework, which enables the ensemble logits and
three student logits to collaboratively learn to improve the quality of
predictions during the testing phase of our Night TTA. Extensive experiments
show that our method achieves state-of-the-art (SoTA) performance with a 13.07%
boost in mIoU
Learning Scene Flow With Skeleton Guidance For 3D Action Recognition
Among the existing modalities for 3D action recognition, 3D flow has been
poorly examined, although conveying rich motion information cues for human
actions. Presumably, its susceptibility to noise renders it intractable, thus
challenging the learning process within deep models. This work demonstrates the
use of 3D flow sequence by a deep spatiotemporal model and further proposes an
incremental two-level spatial attention mechanism, guided from skeleton domain,
for emphasizing motion features close to the body joint areas and according to
their informativeness. Towards this end, an extended deep skeleton model is
also introduced to learn the most discriminant action motion dynamics, so as to
estimate an informativeness score for each joint. Subsequently, a late fusion
scheme is adopted between the two models for learning the high level
cross-modal correlations. Experimental results on the currently largest and
most challenging dataset NTU RGB+D, demonstrate the effectiveness of the
proposed approach, achieving state-of-the-art results.Comment: 18 pages, 3 figures, 3 tables, conferenc
LiCamGait: Gait Recognition in the Wild by Using LiDAR and Camera Multi-modal Visual Sensors
LiDAR can capture accurate depth information in large-scale scenarios without
the effect of light conditions, and the captured point cloud contains
gait-related 3D geometric properties and dynamic motion characteristics. We
make the first attempt to leverage LiDAR to remedy the limitation of
view-dependent and light-sensitive camera for more robust and accurate gait
recognition. In this paper, we propose a LiDAR-camera-based gait recognition
method with an effective multi-modal feature fusion strategy, which fully
exploits advantages of both point clouds and images. In particular, we propose
a new in-the-wild gait dataset, LiCamGait, involving multi-modal visual data
and diverse 2D/3D representations. Our method achieves state-of-the-art
performance on the new dataset. Code and dataset will be released when this
paper is published
Visible-Infrared Person Re-Identification Using Privileged Intermediate Information
Visible-infrared person re-identification (ReID) aims to recognize a same
person of interest across a network of RGB and IR cameras. Some deep learning
(DL) models have directly incorporated both modalities to discriminate persons
in a joint representation space. However, this cross-modal ReID problem remains
challenging due to the large domain shift in data distributions between RGB and
IR modalities. % This paper introduces a novel approach for a creating
intermediate virtual domain that acts as bridges between the two main domains
(i.e., RGB and IR modalities) during training. This intermediate domain is
considered as privileged information (PI) that is unavailable at test time, and
allows formulating this cross-modal matching task as a problem in learning
under privileged information (LUPI). We devised a new method to generate images
between visible and infrared domains that provide additional information to
train a deep ReID model through an intermediate domain adaptation. In
particular, by employing color-free and multi-step triplet loss objectives
during training, our method provides common feature representation spaces that
are robust to large visible-infrared domain shifts. % Experimental results on
challenging visible-infrared ReID datasets indicate that our proposed approach
consistently improves matching accuracy, without any computational overhead at
test time. The code is available at:
\href{https://github.com/alehdaghi/Cross-Modal-Re-ID-via-LUPI}{https://github.com/alehdaghi/Cross-Modal-Re-ID-via-LUPI
Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation
Multi-modal 3D scene understanding has gained considerable attention due to
its wide applications in many areas, such as autonomous driving and
human-computer interaction. Compared to conventional single-modal 3D
understanding, introducing an additional modality not only elevates the
richness and precision of scene interpretation but also ensures a more robust
and resilient understanding. This becomes especially crucial in varied and
challenging environments where solely relying on 3D data might be inadequate.
While there has been a surge in the development of multi-modal 3D methods over
past three years, especially those integrating multi-camera images (3D+2D) and
textual descriptions (3D+language), a comprehensive and in-depth review is
notably absent. In this article, we present a systematic survey of recent
progress to bridge this gap. We begin by briefly introducing a background that
formally defines various 3D multi-modal tasks and summarizes their inherent
challenges. After that, we present a novel taxonomy that delivers a thorough
categorization of existing methods according to modalities and tasks, exploring
their respective strengths and limitations. Furthermore, comparative results of
recent approaches on several benchmark datasets, together with insightful
analysis, are offered. Finally, we discuss the unresolved issues and provide
several potential avenues for future research
- …