Search CORE

26 research outputs found

Centroid-based deep metric learning for speaker recognition

Author: Brudno Michael
Law Marc
Rudzicz Frank
Wang Jixuan
Wang Kuan-Chieh
Publication venue
Publication date: 06/02/2019
Field of study

Speaker embedding models that utilize neural networks to map utterances to a space where distances reflect similarity between speakers have driven recent progress in the speaker recognition task. However, there is still a significant performance gap between recognizing speakers in the training set and unseen speakers. The latter case corresponds to the few-shot learning task, where a trained model is evaluated on unseen classes. Here, we optimize a speaker embedding model with prototypical network loss (PNL), a state-of-the-art approach for the few-shot image classification task. The resulting embedding model outperforms the state-of-the-art triplet loss based models in both speaker verification and identification tasks, for both seen and unseen speakers.Comment: ICASSP 2019 (44th International Conference on Acoustics, Speech, and Signal Processing

arXiv.org e-Print Archive

Supervised attention for speaker recognition

Author: Chung Joon Son
Kim Hoirin
Kye Seong Min
Publication venue
Publication date: 03/12/2020
Field of study

The recently proposed self-attentive pooling (SAP) has shown good performance in several speaker recognition systems. In SAP systems, the context vector is trained end-to-end together with the feature extractor, where the role of context vector is to select the most discriminative frames for speaker recognition. However, the SAP underperforms compared to the temporal average pooling (TAP) baseline in some settings, which implies that the attention is not learnt effectively in end-to-end training. To tackle this problem, we introduce strategies for training the attention mechanism in a supervised manner, which learns the context vector using classified samples. With our proposed methods, context vector can be boosted to select the most informative frames. We show that our method outperforms existing methods in various experimental settings including short utterance speaker recognition, and achieves competitive performance over the existing baselines on the VoxCeleb datasets.Comment: SLT 202

arXiv.org e-Print Archive

Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning

Author: Cai Yunqi
Kang Jiawen
Li Lantian
Liu Ruiqi
Wang Dong
Zheng Thomas Fang
Publication venue
Publication date: 24/05/2020
Field of study

Domain generalization remains a critical problem for speaker recognition, even with the state-of-the-art architectures based on deep neural nets. For example, a model trained on reading speech may largely fail when applied to scenarios of singing or movie. In this paper, we propose a domain-invariant projection to improve the generalizability of speaker vectors. This projection is a simple neural net and is trained following the Model-Agnostic Meta-Learning (MAML) principle, for which the objective is to classify speakers in one domain if it had been updated with speech data in another domain. We tested the proposed method on CNCeleb, a new dataset consisting of single-speaker multi-condition (SSMC) data. The results demonstrated that the MAML-based domain-invariant projection can produce more generalizable speaker vectors, and effectively improve the performance in unseen domains.Comment: submitted to INTERSPEECH 202

arXiv.org e-Print Archive

Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

Author: Bengio Yoshua
Bonafonte Antonio
Pascual Santiago
Ravanelli Mirco
Serrà Joan
Publication venue
Publication date: 06/04/2019
Field of study

Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This paper proposes an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different self-supervised tasks. The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues. In addition, a number of design choices make the encoder easily exportable, facilitating its direct usage or adaptation to different problems

arXiv.org e-Print Archive

Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs

Author: Hwang Sung Ju
Jung Youngmoon
Kim Hoirin
Kye Seong Min
Lee Hae Beom
Publication venue
Publication date: 10/08/2020
Field of study

In practical settings, a speaker recognition system needs to identify a speaker given a short utterance, while the enrollment utterance may be relatively long. However, existing speaker recognition models perform poorly with such short utterances. To solve this problem, we introduce a meta-learning framework for imbalance length pairs. Specifically, we use a Prototypical Networks and train it with a support set of long utterances and a query set of short utterances of varying lengths. Further, since optimizing only for the classes in the given episode may be insufficient for learning discriminative embeddings for unseen classes, we additionally enforce the model to classify both the support and the query set against the entire set of classes in the training set. By combining these two learning schemes, our model outperforms existing state-of-the-art speaker verification models learned with a standard supervised learning framework on short utterance (1-2 seconds) on the VoxCeleb datasets. We also validate our proposed model for unseen speaker identification, on which it also achieves significant performance gains over the existing approaches. The codes are available at https://github.com/seongmin-kye/meta-SR.Comment: Accepted to Interspeech 2020. The codes are available at https://github.com/seongmin-kye/meta-S

arXiv.org e-Print Archive

Cross attentive pooling for speaker verification

Author: Chung Joon Son
Kwon Yoohwan
Kye Seong Min
Publication venue
Publication date: 03/12/2020
Field of study

The goal of this paper is text-independent speaker verification where utterances come from 'in the wild' videos and may contain irrelevant signal. While speaker verification is naturally a pair-wise problem, existing methods to produce the speaker embeddings are instance-wise. In this paper, we propose Cross Attentive Pooling (CAP) that utilizes the context information across the reference-query pair to generate utterance-level embeddings that contain the most discriminative information for the pair-wise matching problem. Experiments are performed on the VoxCeleb dataset in which our method outperforms comparable pooling strategies.Comment: SLT 2021. Code available at https://github.com/seongmin-kye/CA

arXiv.org e-Print Archive

Siamese Capsule Network for End-to-End Speaker Recognition In The Wild

Author: Etemad Ali
Hajavi Amirhossein
Publication venue
Publication date: 28/09/2020
Field of study

We propose an end-to-end deep model for speaker verification in the wild. Our model uses thin-ResNet for extracting speaker embeddings from utterances and a Siamese capsule network and dynamic routing as the Back-end to calculate a similarity score between the embeddings. We conduct a series of experiments and comparisons on our model to state-of-the-art solutions, showing that our model outperforms all the other models using substantially less amount of training data. We also perform additional experiments to study the impact of different speaker embeddings on the Siamese capsule network. We show that the best performance is achieved by using embeddings obtained directly from the feature aggregation module of the Front-end and passing them to higher capsules using dynamic routing.Comment: Submitted to ICASSP202

arXiv.org e-Print Archive

A Deep Neural Network for Short-Segment Speaker Recognition

Author: Etemad Ali
Hajavi Amirhossein
Publication venue
Publication date: 22/07/2019
Field of study

Todays interactive devices such as smart-phone assistants and smart speakers often deal with short-duration speech segments. As a result, speaker recognition systems integrated into such devices will be much better suited with models capable of performing the recognition task with short-duration utterances. In this paper, a new deep neural network, UtterIdNet, capable of performing speaker recognition with short speech segments is proposed. Our proposed model utilizes a novel architecture that makes it suitable for short-segment speaker recognition through an efficiently increased use of information in short speech segments. UtterIdNet has been trained and tested on the VoxCeleb datasets, the latest benchmarks in speaker recognition. Evaluations for different segment durations show consistent and stable performance for short segments, with significant improvement over the previous models for segments of 2 seconds, 1 second, and especially sub-second durations (250 ms and 500 ms).Comment: Accepted in Interspeech 201

arXiv.org e-Print Archive

Meta-learning for robust child-adult classification from speech

Author: Kim So Hyun
Koluguri Nithin Rao
Kumar Manoj
Lord Catherine
Narayanan Shrikanth
Publication venue
Publication date: 28/10/2019
Field of study

Computational modeling of naturalistic conversations in clinical applications has seen growing interest in the past decade. An important use-case involves child-adult interactions within the autism diagnosis and intervention domain. In this paper, we address a specific sub-problem of speaker diarization, namely child-adult speaker classification in such dyadic conversations with specified roles. Training a speaker classification system robust to speaker and channel conditions is challenging due to inherent variability in the speech within children and the adult interlocutors. In this work, we propose the use of meta-learning, in particular, prototypical networks which optimize a metric space across multiple tasks. By modeling every child-adult pair in the training set as a separate task during meta-training, we learn a representation with improved generalizability compared to conventional supervised learning. We demonstrate improvements over state-of-the-art speaker embeddings (x-vectors) under two evaluation settings: weakly supervised classification (up to 14.53% relative improvement in F1-scores) and clustering (up to relative 9.66% improvement in cluster purity). Our results show that protonets can potentially extract robust speaker embeddings for child-adult classification from speech

arXiv.org e-Print Archive

Speaker diarization with session-level speaker embedding refinement using graph neural networks

Author: Brudno Michael
Ramamurthy Ranjani
Rudzicz Frank
Wang Jixuan
Wu Jian
Xiao Xiong
Publication venue
Publication date: 22/05/2020
Field of study

Deep speaker embedding models have been commonly used as a building block for speaker diarization systems; however, the speaker embedding model is usually trained according to a global loss defined on the training data, which could be sub-optimal for distinguishing speakers locally in a specific meeting session. In this work we present the first use of graph neural networks (GNNs) for the speaker diarization problem, utilizing a GNN to refine speaker embeddings locally using the structural information between speech segments inside each session. The speaker embeddings extracted by a pre-trained model are remapped into a new embedding space, in which the different speakers within a single session are better separated. The model is trained for linkage prediction in a supervised manner by minimizing the difference between the affinity matrix constructed by the refined embeddings and the ground-truth adjacency matrix. Spectral clustering is then applied on top of the refined embeddings. We show that the clustering performance of the refined speaker embeddings outperforms the original embeddings significantly on both simulated and real meeting data, and our system achieves the state-of-the-art result on the NIST SRE 2000 CALLHOME database.Comment: ICASSP 2020 (45th International Conference on Acoustics, Speech, and Signal Processing

arXiv.org e-Print Archive