26 research outputs found
Centroid-based deep metric learning for speaker recognition
Speaker embedding models that utilize neural networks to map utterances to a
space where distances reflect similarity between speakers have driven recent
progress in the speaker recognition task. However, there is still a significant
performance gap between recognizing speakers in the training set and unseen
speakers. The latter case corresponds to the few-shot learning task, where a
trained model is evaluated on unseen classes. Here, we optimize a speaker
embedding model with prototypical network loss (PNL), a state-of-the-art
approach for the few-shot image classification task. The resulting embedding
model outperforms the state-of-the-art triplet loss based models in both
speaker verification and identification tasks, for both seen and unseen
speakers.Comment: ICASSP 2019 (44th International Conference on Acoustics, Speech, and
Signal Processing
Supervised attention for speaker recognition
The recently proposed self-attentive pooling (SAP) has shown good performance
in several speaker recognition systems. In SAP systems, the context vector is
trained end-to-end together with the feature extractor, where the role of
context vector is to select the most discriminative frames for speaker
recognition. However, the SAP underperforms compared to the temporal average
pooling (TAP) baseline in some settings, which implies that the attention is
not learnt effectively in end-to-end training. To tackle this problem, we
introduce strategies for training the attention mechanism in a supervised
manner, which learns the context vector using classified samples. With our
proposed methods, context vector can be boosted to select the most informative
frames. We show that our method outperforms existing methods in various
experimental settings including short utterance speaker recognition, and
achieves competitive performance over the existing baselines on the VoxCeleb
datasets.Comment: SLT 202
Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning
Domain generalization remains a critical problem for speaker recognition,
even with the state-of-the-art architectures based on deep neural nets. For
example, a model trained on reading speech may largely fail when applied to
scenarios of singing or movie. In this paper, we propose a domain-invariant
projection to improve the generalizability of speaker vectors. This projection
is a simple neural net and is trained following the Model-Agnostic
Meta-Learning (MAML) principle, for which the objective is to classify speakers
in one domain if it had been updated with speech data in another domain. We
tested the proposed method on CNCeleb, a new dataset consisting of
single-speaker multi-condition (SSMC) data. The results demonstrated that the
MAML-based domain-invariant projection can produce more generalizable speaker
vectors, and effectively improve the performance in unseen domains.Comment: submitted to INTERSPEECH 202
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
Learning good representations without supervision is still an open issue in
machine learning, and is particularly challenging for speech signals, which are
often characterized by long sequences with a complex hierarchical structure.
Some recent works, however, have shown that it is possible to derive useful
speech representations by employing a self-supervised encoder-discriminator
approach. This paper proposes an improved self-supervised method, where a
single neural encoder is followed by multiple workers that jointly solve
different self-supervised tasks. The needed consensus across different tasks
naturally imposes meaningful constraints to the encoder, contributing to
discover general representations and to minimize the risk of learning
superficial ones. Experiments show that the proposed approach can learn
transferable, robust, and problem-agnostic features that carry on relevant
information from the speech signal, such as speaker identity, phonemes, and
even higher-level features such as emotional cues. In addition, a number of
design choices make the encoder easily exportable, facilitating its direct
usage or adaptation to different problems
Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs
In practical settings, a speaker recognition system needs to identify a
speaker given a short utterance, while the enrollment utterance may be
relatively long. However, existing speaker recognition models perform poorly
with such short utterances. To solve this problem, we introduce a meta-learning
framework for imbalance length pairs. Specifically, we use a Prototypical
Networks and train it with a support set of long utterances and a query set of
short utterances of varying lengths. Further, since optimizing only for the
classes in the given episode may be insufficient for learning discriminative
embeddings for unseen classes, we additionally enforce the model to classify
both the support and the query set against the entire set of classes in the
training set. By combining these two learning schemes, our model outperforms
existing state-of-the-art speaker verification models learned with a standard
supervised learning framework on short utterance (1-2 seconds) on the VoxCeleb
datasets. We also validate our proposed model for unseen speaker
identification, on which it also achieves significant performance gains over
the existing approaches. The codes are available at
https://github.com/seongmin-kye/meta-SR.Comment: Accepted to Interspeech 2020. The codes are available at
https://github.com/seongmin-kye/meta-S
Cross attentive pooling for speaker verification
The goal of this paper is text-independent speaker verification where
utterances come from 'in the wild' videos and may contain irrelevant signal.
While speaker verification is naturally a pair-wise problem, existing methods
to produce the speaker embeddings are instance-wise. In this paper, we propose
Cross Attentive Pooling (CAP) that utilizes the context information across the
reference-query pair to generate utterance-level embeddings that contain the
most discriminative information for the pair-wise matching problem. Experiments
are performed on the VoxCeleb dataset in which our method outperforms
comparable pooling strategies.Comment: SLT 2021. Code available at https://github.com/seongmin-kye/CA
Siamese Capsule Network for End-to-End Speaker Recognition In The Wild
We propose an end-to-end deep model for speaker verification in the wild. Our
model uses thin-ResNet for extracting speaker embeddings from utterances and a
Siamese capsule network and dynamic routing as the Back-end to calculate a
similarity score between the embeddings. We conduct a series of experiments and
comparisons on our model to state-of-the-art solutions, showing that our model
outperforms all the other models using substantially less amount of training
data. We also perform additional experiments to study the impact of different
speaker embeddings on the Siamese capsule network. We show that the best
performance is achieved by using embeddings obtained directly from the feature
aggregation module of the Front-end and passing them to higher capsules using
dynamic routing.Comment: Submitted to ICASSP202
A Deep Neural Network for Short-Segment Speaker Recognition
Todays interactive devices such as smart-phone assistants and smart speakers
often deal with short-duration speech segments. As a result, speaker
recognition systems integrated into such devices will be much better suited
with models capable of performing the recognition task with short-duration
utterances. In this paper, a new deep neural network, UtterIdNet, capable of
performing speaker recognition with short speech segments is proposed. Our
proposed model utilizes a novel architecture that makes it suitable for
short-segment speaker recognition through an efficiently increased use of
information in short speech segments. UtterIdNet has been trained and tested on
the VoxCeleb datasets, the latest benchmarks in speaker recognition.
Evaluations for different segment durations show consistent and stable
performance for short segments, with significant improvement over the previous
models for segments of 2 seconds, 1 second, and especially sub-second durations
(250 ms and 500 ms).Comment: Accepted in Interspeech 201
Meta-learning for robust child-adult classification from speech
Computational modeling of naturalistic conversations in clinical applications
has seen growing interest in the past decade. An important use-case involves
child-adult interactions within the autism diagnosis and intervention domain.
In this paper, we address a specific sub-problem of speaker diarization, namely
child-adult speaker classification in such dyadic conversations with specified
roles. Training a speaker classification system robust to speaker and channel
conditions is challenging due to inherent variability in the speech within
children and the adult interlocutors. In this work, we propose the use of
meta-learning, in particular, prototypical networks which optimize a metric
space across multiple tasks. By modeling every child-adult pair in the training
set as a separate task during meta-training, we learn a representation with
improved generalizability compared to conventional supervised learning. We
demonstrate improvements over state-of-the-art speaker embeddings (x-vectors)
under two evaluation settings: weakly supervised classification (up to 14.53%
relative improvement in F1-scores) and clustering (up to relative 9.66%
improvement in cluster purity). Our results show that protonets can potentially
extract robust speaker embeddings for child-adult classification from speech
Speaker diarization with session-level speaker embedding refinement using graph neural networks
Deep speaker embedding models have been commonly used as a building block for
speaker diarization systems; however, the speaker embedding model is usually
trained according to a global loss defined on the training data, which could be
sub-optimal for distinguishing speakers locally in a specific meeting session.
In this work we present the first use of graph neural networks (GNNs) for the
speaker diarization problem, utilizing a GNN to refine speaker embeddings
locally using the structural information between speech segments inside each
session. The speaker embeddings extracted by a pre-trained model are remapped
into a new embedding space, in which the different speakers within a single
session are better separated. The model is trained for linkage prediction in a
supervised manner by minimizing the difference between the affinity matrix
constructed by the refined embeddings and the ground-truth adjacency matrix.
Spectral clustering is then applied on top of the refined embeddings. We show
that the clustering performance of the refined speaker embeddings outperforms
the original embeddings significantly on both simulated and real meeting data,
and our system achieves the state-of-the-art result on the NIST SRE 2000
CALLHOME database.Comment: ICASSP 2020 (45th International Conference on Acoustics, Speech, and
Signal Processing