24 research outputs found
Class Relevance Learning For Out-of-distribution Detection
Image classification plays a pivotal role across diverse applications, yet
challenges persist when models are deployed in real-world scenarios. Notably,
these models falter in detecting unfamiliar classes that were not incorporated
during classifier training, a formidable hurdle for safe and effective
real-world model deployment, commonly known as out-of-distribution (OOD)
detection. While existing techniques, like max logits, aim to leverage logits
for OOD identification, they often disregard the intricate interclass
relationships that underlie effective detection. This paper presents an
innovative class relevance learning method tailored for OOD detection. Our
method establishes a comprehensive class relevance learning framework,
strategically harnessing interclass relationships within the OOD pipeline. This
framework significantly augments OOD detection capabilities. Extensive
experimentation on diverse datasets, encompassing generic image classification
datasets (Near OOD and Far OOD datasets), demonstrates the superiority of our
method over state-of-the-art alternatives for OOD detection
Dense Depth Distillation with Out-of-Distribution Simulated Images
We study data-free knowledge distillation (KD) for monocular depth estimation
(MDE), which learns a lightweight model for real-world depth perception tasks
by compressing it from a trained teacher model while lacking training data in
the target domain. Owing to the essential difference between image
classification and dense regression, previous methods of data-free KD are not
applicable to MDE. To strengthen its applicability in real-world tasks, in this
paper, we propose to apply KD with out-of-distribution simulated images. The
major challenges to be resolved are i) lacking prior information about scene
configurations of real-world training data and ii) domain shift between
simulated and real-world images. To cope with these difficulties, we propose a
tailored framework for depth distillation. The framework generates new training
samples for embracing a multitude of possible object arrangements in the target
domain and utilizes a transformation network to efficiently adapt them to the
feature statistics preserved in the teacher model. Through extensive
experiments on various depth estimation models and two different datasets, we
show that our method outperforms the baseline KD by a good margin and even
achieves slightly better performance with as few as 1/6 of training images,
demonstrating a clear superiority
Lifelong-MonoDepth: Lifelong Learning for Multi-Domain Monocular Metric Depth Estimation
With the rapid advancements in autonomous driving and robot navigation, there
is a growing demand for lifelong learning models capable of estimating metric
(absolute) depth. Lifelong learning approaches potentially offer significant
cost savings in terms of model training, data storage, and collection. However,
the quality of RGB images and depth maps is sensor-dependent, and depth maps in
the real world exhibit domain-specific characteristics, leading to variations
in depth ranges. These challenges limit existing methods to lifelong learning
scenarios with small domain gaps and relative depth map estimation. To
facilitate lifelong metric depth learning, we identify three crucial technical
challenges that require attention: i) developing a model capable of addressing
the depth scale variation through scale-aware depth learning, ii) devising an
effective learning strategy to handle significant domain gaps, and iii)
creating an automated solution for domain-aware depth inference in practical
applications. Based on the aforementioned considerations, in this paper, we
present i) a lightweight multi-head framework that effectively tackles the
depth scale imbalance, ii) an uncertainty-aware lifelong learning solution that
adeptly handles significant domain gaps, and iii) an online domain-specific
predictor selection method for real-time inference. Through extensive numerical
studies, we show that the proposed method can achieve good efficiency,
stability, and plasticity, leading the benchmarks by 8% to 15%
Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training
The width of a neural network matters since increasing the width will
necessarily increase the model capacity. However, the performance of a network
does not improve linearly with the width and soon gets saturated. In this case,
we argue that increasing the number of networks (ensemble) can achieve better
accuracy-efficiency trade-offs than purely increasing the width. To prove it,
one large network is divided into several small ones regarding its parameters
and regularization components. Each of these small networks has a fraction of
the original one's parameters. We then train these small networks together and
make them see various views of the same data to increase their diversity.
During this co-training process, networks can also learn from each other. As a
result, small networks can achieve better ensemble performance than the large
one with few or no extra parameters or FLOPs. Small networks can also achieve
faster inference speed than the large one by concurrent running on different
devices. We validate our argument with 8 different neural architectures on
common benchmarks through extensive experiments. The code is available at
\url{https://github.com/mzhaoshuai/Divide-and-Co-training}
Explicit Attention-Enhanced Fusion for RGB-Thermal Perception Tasks
Recently, RGB-Thermal based perception has shown significant advances.
Thermal information provides useful clues when visual cameras suffer from poor
lighting conditions, such as low light and fog. However, how to effectively
fuse RGB images and thermal data remains an open challenge. Previous works
involve naive fusion strategies such as merging them at the input,
concatenating multi-modality features inside models, or applying attention to
each data modality. These fusion strategies are straightforward yet
insufficient. In this paper, we propose a novel fusion method named Explicit
Attention-Enhanced Fusion (EAEF) that fully takes advantage of each type of
data. Specifically, we consider the following cases: i) both RGB data and
thermal data, ii) only one of the types of data, and iii) none of them generate
discriminative features. EAEF uses one branch to enhance feature extraction for
i) and iii) and the other branch to remedy insufficient representations for
ii). The outputs of two branches are fused to form complementary features. As a
result, the proposed fusion method outperforms state-of-the-art by 1.6\% in
mIoU on semantic segmentation, 3.1\% in MAE on salient object detection, 2.3\%
in mAP on object detection, and 8.1\% in MAE on crowd counting. The code is
available at https://github.com/FreeformRobotics/EAEFNet