2,231 research outputs found
Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples
In the field of intelligent multimedia analysis, ultra-fine-grained visual
categorization (Ultra-FGVC) plays a vital role in distinguishing intricate
subcategories within broader categories. However, this task is inherently
challenging due to the complex granularity of category subdivisions and the
limited availability of data for each category. To address these challenges,
this work proposes CSDNet, a pioneering framework that effectively explores
contrastive learning and self-distillation to learn discriminative
representations specifically designed for Ultra-FGVC tasks. CSDNet comprises
three main modules: Subcategory-Specific Discrepancy Parsing (SSDP), Dynamic
Discrepancy Learning (DDL), and Subcategory-Specific Discrepancy Transfer
(SSDT), which collectively enhance the generalization of deep models across
instance, feature, and logit prediction levels. To increase the diversity of
training samples, the SSDP module introduces augmented samples from different
viewpoints to spotlight subcategory-specific discrepancies. Simultaneously, the
proposed DDL module stores historical intermediate features by a dynamic memory
queue, which optimizes the feature learning space through iterative contrastive
learning. Furthermore, the SSDT module is developed by a novel
self-distillation paradigm at the logit prediction level of raw and augmented
samples, which effectively distills more subcategory-specific discrepancies
knowledge from the inherent structure of limited training data without
requiring additional annotations. Experimental results demonstrate that CSDNet
outperforms current state-of-the-art Ultra-FGVC methods, emphasizing its
powerful efficacy and adaptability in addressing Ultra-FGVC tasks.Comment: The first two authors contributed equally to this wor
Deep Learning-Based Human Pose Estimation: A Survey
Human pose estimation aims to locate the human body parts and build human
body representation (e.g., body skeleton) from input data such as images and
videos. It has drawn increasing attention during the past decade and has been
utilized in a wide range of applications including human-computer interaction,
motion analysis, augmented reality, and virtual reality. Although the recently
developed deep learning-based solutions have achieved high performance in human
pose estimation, there still remain challenges due to insufficient training
data, depth ambiguities, and occlusion. The goal of this survey paper is to
provide a comprehensive review of recent deep learning-based solutions for both
2D and 3D pose estimation via a systematic analysis and comparison of these
solutions based on their input data and inference procedures. More than 240
research papers since 2014 are covered in this survey. Furthermore, 2D and 3D
human pose estimation datasets and evaluation metrics are included.
Quantitative performance comparisons of the reviewed methods on popular
datasets are summarized and discussed. Finally, the challenges involved,
applications, and future research directions are concluded. We also provide a
regularly updated project page: \url{https://github.com/zczcwh/DL-HPE
- …