25 research outputs found
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
Currently, the most widely used approach for speaker verification is the deep
speaker embedding learning. In this approach, we obtain a speaker embedding
vector by pooling single-scale features that are extracted from the last layer
of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes
multi-scale features from different layers of the feature extractor, has
recently been introduced and shows superior performance for variable-duration
utterances. To increase the robustness dealing with utterances of arbitrary
duration, this paper improves the MSA by using a feature pyramid module. The
module enhances speaker-discriminative information of features from multiple
layers via a top-down pathway and lateral connections. We extract speaker
embeddings using the enhanced features that contain rich speaker information
with different time scales. Experiments on the VoxCeleb dataset show that the
proposed module improves previous MSA methods with a smaller number of
parameters. It also achieves better performance than state-of-the-art
approaches for both short and long utterances.Comment: Accepted to Interspeech 202
Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation
Transformer-based speech self-supervised learning (SSL) models, such as
HuBERT, show surprising performance in various speech processing tasks.
However, huge number of parameters in speech SSL models necessitate the
compression to a more compact model for wider usage in academia or small
companies. In this study, we suggest to reuse attention maps across the
Transformer layers, so as to remove key and query parameters while retaining
the number of layers. Furthermore, we propose a novel masking distillation
strategy to improve the student model's speech representation quality. We
extend the distillation loss to utilize both masked and unmasked speech frames
to fully leverage the teacher model's high-quality representation. Our
universal compression strategy yields the student model that achieves phoneme
error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB
benchmark.Comment: Interspeech 2023. Code URL: https://github.com/sungnyun/ARMHuBER