193 research outputs found
Experimenting with Additive Margins for Contrastive Self-Supervised Speaker Verification
Most state-of-the-art self-supervised speaker verification systems rely on a
contrastive-based objective function to learn speaker representations from
unlabeled speech data. We explore different ways to improve the performance of
these methods by: (1) revisiting how positive and negative pairs are sampled
through a "symmetric" formulation of the contrastive loss; (2) introducing
margins similar to AM-Softmax and AAM-Softmax that have been widely adopted in
the supervised setting. We demonstrate the effectiveness of the symmetric
contrastive loss which provides more supervision for the self-supervised task.
Moreover, we show that Additive Margin and Additive Angular Margin allow
reducing the overall number of false negatives and false positives by improving
speaker separability. Finally, by combining both techniques and training a
larger model we achieve 7.50% EER and 0.5804 minDCF on the VoxCeleb1 test set,
which outperforms other contrastive self supervised methods on speaker
verification.Comment: accepted at INTERSPEECH 2023, 20th-24th August 2023, Dublin, Irelan
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
Currently, the most widely used approach for speaker verification is the deep
speaker embedding learning. In this approach, we obtain a speaker embedding
vector by pooling single-scale features that are extracted from the last layer
of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes
multi-scale features from different layers of the feature extractor, has
recently been introduced and shows superior performance for variable-duration
utterances. To increase the robustness dealing with utterances of arbitrary
duration, this paper improves the MSA by using a feature pyramid module. The
module enhances speaker-discriminative information of features from multiple
layers via a top-down pathway and lateral connections. We extract speaker
embeddings using the enhanced features that contain rich speaker information
with different time scales. Experiments on the VoxCeleb dataset show that the
proposed module improves previous MSA methods with a smaller number of
parameters. It also achieves better performance than state-of-the-art
approaches for both short and long utterances.Comment: Accepted to Interspeech 202
- …