468 research outputs found
Learnable PINs: Cross-Modal Embeddings for Person Identity
We propose and investigate an identity sensitive joint embedding of face and
voice. Such an embedding enables cross-modal retrieval from voice to face and
from face to voice. We make the following four contributions: first, we show
that the embedding can be learnt from videos of talking faces, without
requiring any identity labels, using a form of cross-modal self-supervision;
second, we develop a curriculum learning schedule for hard negative mining
targeted to this task, that is essential for learning to proceed successfully;
third, we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using the
joint embedding for automatically retrieving and labelling characters in TV
dramas.Comment: To appear in ECCV 201
Leveraging ASR Pretrained Conformers for Speaker Verification through Transfer Learning and Knowledge Distillation
This paper explores the use of ASR-pretrained Conformers for speaker
verification, leveraging their strengths in modeling speech signals. We
introduce three strategies: (1) Transfer learning to initialize the speaker
embedding network, improving generalization and reducing overfitting. (2)
Knowledge distillation to train a more flexible speaker verification model,
incorporating frame-level ASR loss as an auxiliary task. (3) A lightweight
speaker adaptor for efficient feature conversion without altering the original
ASR Conformer, allowing parallel ASR and speaker verification. Experiments on
VoxCeleb show significant improvements: transfer learning yields a 0.48% EER,
knowledge distillation results in a 0.43% EER, and the speaker adaptor
approach, with just an added 4.92M parameters to a 130.94M-parameter model,
achieves a 0.57% EER. Overall, our methods effectively transfer ASR
capabilities to speaker verification tasks
Efficient Black-Box Speaker Verification Model Adaptation with Reprogramming and Backend Learning
The development of deep neural networks (DNN) has significantly enhanced the
performance of speaker verification (SV) systems in recent years. However, a
critical issue that persists when applying DNN-based SV systems in practical
applications is domain mismatch. To mitigate the performance degradation caused
by the mismatch, domain adaptation becomes necessary. This paper introduces an
approach to adapt DNN-based SV models by manipulating the learnable model
inputs, inspired by the concept of adversarial reprogramming. The pre-trained
SV model remains fixed and functions solely in the forward process, resembling
a black-box model. A lightweight network is utilized to estimate the gradients
for the learnable parameters at the input, which bypasses the gradient
backpropagation through the black-box model. The reprogrammed output is
processed by a two-layer backend learning module as the final adapted speaker
embedding. The number of parameters involved in the gradient calculation is
small in our design. With few additional parameters, the proposed method
achieves both memory and parameter efficiency. The experiments are conducted in
language mismatch scenarios. Using much less computation cost, the proposed
method obtains close or superior performance to the fully finetuned models in
our experiments, which demonstrates its effectiveness
Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification
Even though deep speaker models have demonstrated impressive accuracy in
speaker verification tasks, this often comes at the expense of increased model
size and computation time, presenting challenges for deployment in
resource-constrained environments. Our research focuses on addressing this
limitation through the development of small footprint deep speaker embedding
extraction using knowledge distillation. While previous work in this domain has
concentrated on speaker embedding extraction at the utterance level, our
approach involves amalgamating embeddings from different levels of the x-vector
model (teacher network) to train a compact student network. The results
highlight the significance of frame-level information, with the student models
exhibiting a remarkable size reduction of 85%-91% compared to their teacher
counterparts, depending on the size of the teacher embeddings. Notably, by
concatenating teacher embeddings, we achieve student networks that maintain
comparable performance to the teacher while enjoying a substantial 75%
reduction in model size. These findings and insights extend to other x-vector
variants, underscoring the broad applicability of our approach.Comment: Submitted to Data & Knowledge Engineering at Dec. 2023. Copyright may
be transferred without notic
Efficient Adapter Tuning of Pre-trained Speech Models for Automatic Speaker Verification
With excellent generalization ability, self-supervised speech models have
shown impressive performance on various downstream speech tasks in the
pre-training and fine-tuning paradigm. However, as the growing size of
pre-trained models, fine-tuning becomes practically unfeasible due to heavy
computation and storage overhead, as well as the risk of overfitting. Adapters
are lightweight modules inserted into pre-trained models to facilitate
parameter-efficient adaptation. In this paper, we propose an effective adapter
framework designed for adapting self-supervised speech models to the speaker
verification task. With a parallel adapter design, our proposed framework
inserts two types of adapters into the pre-trained model, allowing the
adaptation of latent features within intermediate Transformer layers and output
embeddings from all Transformer layers. We conduct comprehensive experiments to
validate the efficiency and effectiveness of the proposed framework.
Experimental results on the VoxCeleb1 dataset demonstrate that the proposed
adapters surpass fine-tuning and other parameter-efficient transfer learning
methods, achieving superior performance while updating only 5% of the
parameters.Comment: Accepted to ICASSP 202
Learnable MFCCs for Speaker Verification
We propose a learnable mel-frequency cepstral coefficient (MFCC) frontend
architecture for deep neural network (DNN) based automatic speaker
verification. Our architecture retains the simplicity and interpretability of
MFCC-based features while allowing the model to be adapted to data flexibly. In
practice, we formulate data-driven versions of the four linear transforms of a
standard MFCC extractor -- windowing, discrete Fourier transform (DFT), mel
filterbank and discrete cosine transform (DCT). Results reported reach up to
6.7\% (VoxCeleb1) and 9.7\% (SITW) relative improvement in term of equal error
rate (EER) from static MFCCs, without additional tuning effort.Comment: Accepted to ISCAS 202
Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations
In this study, we propose the global context guided channel and
time-frequency transformations to model the long-range, non-local
time-frequency dependencies and channel variances in speaker representations.
We use the global context information to enhance important channels and
recalibrate salient time-frequency locations by computing the similarity
between the global context and local features. The proposed modules, together
with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset,
which is a large scale speaker verification corpus collected in the wild. This
lightweight block can be easily incorporated into a CNN model with little
additional computational costs and effectively improves the speaker
verification performance compared to the baseline ResNet-LDE model and the
Squeeze&Excitation block by a large margin. Detailed ablation studies are also
performed to analyze various factors that may impact the performance of the
proposed modules. We find that by employing the proposed L2-tf-GTFC
transformation block, the Equal Error Rate decreases from 4.56% to 3.07%, a
relative 32.68% reduction, and a relative 27.28% improvement in terms of the
DCF score. The results indicate that our proposed global context guided
transformation modules can efficiently improve the learned speaker
representations by achieving time-frequency and channel-wise feature
recalibration.Comment: Accepted to Interspeech 202
Speaker verification using attentive multi-scale convolutional recurrent network
In this paper, we propose a speaker verification method by an Attentive
Multi-scale Convolutional Recurrent Network (AMCRN). The proposed AMCRN can
acquire both local spatial information and global sequential information from
the input speech recordings. In the proposed method, logarithm Mel spectrum is
extracted from each speech recording and then fed to the proposed AMCRN for
learning speaker embedding. Afterwards, the learned speaker embedding is fed to
the back-end classifier (such as cosine similarity metric) for scoring in the
testing stage. The proposed method is compared with state-of-the-art methods
for speaker verification. Experimental data are three public datasets that are
selected from two large-scale speech corpora (VoxCeleb1 and VoxCeleb2).
Experimental results show that our method exceeds baseline methods in terms of
equal error rate and minimal detection cost function, and has advantages over
most of baseline methods in terms of computational complexity and memory
requirement. In addition, our method generalizes well across truncated speech
segments with different durations, and the speaker embedding learned by the
proposed AMCRN has stronger generalization ability across two back-end
classifiers.Comment: 21 pages, 6 figures, 8 tables. Accepted for publication in Applied
Soft Computin
- …