61 research outputs found
Class-Incremental Grouping Network for Continual Audio-Visual Learning
Continual learning is a challenging problem in which models need to be
trained on non-stationary data across sequential tasks for class-incremental
learning. While previous methods have focused on using either regularization or
rehearsal-based frameworks to alleviate catastrophic forgetting in image
classification, they are limited to a single modality and cannot learn compact
class-aware cross-modal representations for continual audio-visual learning. To
address this gap, we propose a novel class-incremental grouping network (CIGN)
that can learn category-wise semantic features to achieve continual
audio-visual learning. Our CIGN leverages learnable audio-visual class tokens
and audio-visual grouping to continually aggregate class-aware features.
Additionally, it utilizes class tokens distillation and continual grouping to
prevent forgetting parameters learned from previous tasks, thereby improving
the model's ability to capture discriminative audio-visual categories. We
conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and
VGG-Sound Sources benchmarks. Our experimental results demonstrate that the
CIGN achieves state-of-the-art audio-visual class-incremental learning
performance. Code is available at https://github.com/stoneMo/CIGN.Comment: ICCV 2023. arXiv admin note: text overlap with arXiv:2303.1705
LAVSS: Location-Guided Audio-Visual Spatial Audio Separation
Existing machine learning research has achieved promising results in monaural
audio-visual separation (MAVS). However, most MAVS methods purely consider what
the sound source is, not where it is located. This can be a problem in VR/AR
scenarios, where listeners need to be able to distinguish between similar audio
sources located in different directions. To address this limitation, we have
generalized MAVS to spatial audio separation and proposed LAVSS: a
location-guided audio-visual spatial audio separator. LAVSS is inspired by the
correlation between spatial audio and visual location. We introduce the phase
difference carried by binaural audio as spatial cues, and we utilize positional
representations of sounding objects as additional modality guidance. We also
leverage multi-level cross-modal attention to perform visual-positional
collaboration with audio features. In addition, we adopt a pre-trained monaural
separator to transfer knowledge from rich mono sounds to boost spatial audio
separation. This exploits the correlation between monaural and binaural
channels. Experiments on the FAIR-Play dataset demonstrate the superiority of
the proposed LAVSS over existing benchmarks of audio-visual separation. Our
project page: https://yyx666660.github.io/LAVSS/.Comment: Accepted by WACV202
Efficiently Leveraging Linguistic Priors for Scene Text Spotting
Incorporating linguistic knowledge can improve scene text recognition, but it
is questionable whether the same holds for scene text spotting, which typically
involves text detection and recognition. This paper proposes a method that
leverages linguistic knowledge from a large text corpus to replace the
traditional one-hot encoding used in auto-regressive scene text spotting and
recognition models. This allows the model to capture the relationship between
characters in the same word. Additionally, we introduce a technique to generate
text distributions that align well with scene text datasets, removing the need
for in-domain fine-tuning. As a result, the newly created text distributions
are more informative than pure one-hot encoding, leading to improved spotting
and recognition performance. Our method is simple and efficient, and it can
easily be integrated into existing auto-regressive-based approaches.
Experimental results show that our method not only improves recognition
accuracy but also enables more accurate localization of words. It significantly
improves both state-of-the-art scene text spotting and recognition pipelines,
achieving state-of-the-art results on several benchmarks.Comment: 10 page
- …