205 research outputs found
Deep learning methods in speaker recognition: a review
This paper summarizes the applied deep learning practices in the field of
speaker recognition, both verification and identification. Speaker recognition
has been a widely used field topic of speech technology. Many research works
have been carried out and little progress has been achieved in the past 5-6
years. However, as deep learning techniques do advance in most machine learning
fields, the former state-of-the-art methods are getting replaced by them in
speaker recognition too. It seems that DL becomes the now state-of-the-art
solution for both speaker verification and identification. The standard
x-vectors, additional to i-vectors, are used as baseline in most of the novel
works. The increasing amount of gathered data opens up the territory to DL,
where they are the most effective
Contrastive Speaker Embedding With Sequential Disentanglement
Contrastive speaker embedding assumes that the contrast between the positive
and negative pairs of speech segments is attributed to speaker identity only.
However, this assumption is incorrect because speech signals contain not only
speaker identity but also linguistic content. In this paper, we propose a
contrastive learning framework with sequential disentanglement to remove
linguistic content by incorporating a disentangled sequential variational
autoencoder (DSVAE) into the conventional SimCLR framework. The DSVAE aims to
disentangle speaker factors from content factors in an embedding space so that
only the speaker factors are used for constructing a contrastive loss
objective. Because content factors have been removed from the contrastive
learning, the resulting speaker embeddings will be content-invariant.
Experimental results on VoxCeleb1-test show that the proposed method
consistently outperforms SimCLR. This suggests that applying sequential
disentanglement is beneficial to learning speaker-discriminative embeddings.Comment: Submitted to ICASSP 202
๋นํ์ ์์์ ๊ฐ์ธํ ํ์ ์ธ์์ ์ํ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2021. 2. ๊น๋จ์.Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). Also, unlike the classical Gaussian mixture model (GMM)-based techniques (e.g., GMM supervector or i-vector), since the deep learning-based embedding systems are trained in a fully supervised manner, it is impossible for them to utilize unlabeled dataset when training.
In this thesis, we propose a variational autoencoder (VAE)-based embedding framework, which extracts the total variability embedding and a representation for the uncertainty within the input speech distribution. Unlike the conventional deep learning-based embedding techniques (e.g., d-vector or x-vector), the proposed VAE-based embedding system is trained in an unsupervised manner, which enables the utilization of unlabeled datasets. Furthermore, in order to prevent the potential loss of information caused by the Kullback-Leibler divergence regularization term in the VAE-based embedding framework, we propose an adversarially learned inference (ALI)-based embedding technique. Both VAE- and ALI-based embedding techniques have shown great performance in terms of short duration speaker verification, outperforming the conventional i-vector framework.
Additionally, we present a fully supervised training method for disentangling the non-speaker nuisance information from the speaker embedding. The proposed training scheme jointly extracts the speaker and nuisance attribute (e.g., recording channel, emotion) embeddings, and train them to have maximum information on their main-task while ensuring maximum uncertainty on their sub-task. Since the proposed method does not require any heuristic training strategy as in the conventional disentanglement techniques (e.g., adversarial learning, gradient reversal), optimizing the embedding network is relatively more stable. The proposed scheme have shown state-of-the-art performance in RSR2015 Part 3 dataset, and demonstrated its capability in efficiently disentangling the recording device and emotional information from the speaker embedding.์ต๊ทผ ๋ช๋
๊ฐ ๋ค์ํ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ด ์ ์๋์ด ์์ผ๋ฉฐ, ํ์ ์ธ์์์ ๋์ ์ฑ๋ฅ์ ๋ณด์๋ค. ํ์ง๋ง ๊ณ ์ ์ ์ธ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์์์ ๋ง์ฐฌ๊ฐ์ง๋ก, ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ ์๋ก ๋ค๋ฅธ ํ๊ฒฝ (e.g., ๋
น์ ๊ธฐ๊ธฐ, ๊ฐ์ )์์ ๋
น์๋ ์์ฑ๋ค์ ๋ถ์ํ๋ ๊ณผ์ ์์ ์ฑ๋ฅ ์ ํ๋ฅผ ๊ฒช๋๋ค. ๋ํ ๊ธฐ์กด์ ๊ฐ์ฐ์์ ํผํฉ ๋ชจ๋ธ (Gaussian mixture model, GMM) ๊ธฐ๋ฐ์ ๊ธฐ๋ฒ๋ค (e.g., GMM ์ํผ๋ฒกํฐ, i-๋ฒกํฐ)์ ๋ฌ๋ฆฌ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ ๊ต์ฌ ํ์ต์ ํตํ์ฌ ์ต์ ํ๋๊ธฐ์ ๋ผ๋ฒจ์ด ์๋ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ ์ ์๋ค๋ ํ๊ณ๊ฐ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ variational autoencoder (VAE) ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์ ์ํ๋ฉฐ, ํด๋น ๊ธฐ๋ฒ์์๋ ์์ฑ ๋ถํฌ ํจํด์ ์์ฝํ๋ ๋ฒกํฐ์ ์์ฑ ๋ด์ ๋ถํ์ค์ฑ์ ํํํ๋ ๋ฒกํฐ๋ฅผ ์ถ์ถํ๋ค. ๊ธฐ์กด์ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ (e.g., d-๋ฒกํฐ, x-๋ฒกํฐ)์๋ ๋ฌ๋ฆฌ, ์ ์ํ๋ ๊ธฐ๋ฒ์ ๋น๊ต์ฌ ํ์ต์ ํตํ์ฌ ์ต์ ํ ๋๊ธฐ์ ๋ผ๋ฒจ์ด ์๋ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ ์ ์๋ค. ๋ ๋์๊ฐ VAE์ KL-divergence ์ ์ฝ ํจ์๋ก ์ธํ ์ ๋ณด ์์ค์ ๋ฐฉ์งํ๊ธฐ ์ํ์ฌ adversarially learned inference (ALI) ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์ถ๊ฐ์ ์ผ๋ก ์ ์ํ๋ค. ์ ์ํ VAE ๋ฐ ALI ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์งง์ ์์ฑ์์์ ํ์ ์ธ์ฆ ์คํ์์ ๋์ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ, ๊ธฐ์กด์ i-๋ฒกํฐ ๊ธฐ๋ฐ์ ๊ธฐ๋ฒ๋ณด๋ค ์ข์ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์๋ค.
๋ํ ๋ณธ ๋
ผ๋ฌธ์์๋ ์ฑ๋ฌธ ๋ฒกํฐ๋ก๋ถํฐ ๋น ํ์ ์์ (e.g., ๋
น์ ๊ธฐ๊ธฐ, ๊ฐ์ )์ ๋ํ ์ ๋ณด๋ฅผ ์ ๊ฑฐํ๋ ํ์ต๋ฒ์ ์ ์ํ๋ค. ์ ์ํ๋ ๊ธฐ๋ฒ์ ํ์ ๋ฒกํฐ์ ๋นํ์ ๋ฒกํฐ๋ฅผ ๋์์ ์ถ์ถํ๋ฉฐ, ๊ฐ ๋ฒกํฐ๋ ์์ ์ ์ฃผ ๋ชฉ์ ์ ๋ํ ์ ๋ณด๋ฅผ ์ต๋ํ ๋ง์ด ์ ์งํ๋, ๋ถ ๋ชฉ์ ์ ๋ํ ์ ๋ณด๋ฅผ ์ต์ํํ๋๋ก ํ์ต๋๋ค. ๊ธฐ์กด์ ๋น ํ์ ์์ ์ ๋ณด ์ ๊ฑฐ ๊ธฐ๋ฒ๋ค (e.g., adversarial learning, gradient reversal)์ ๋นํ์ฌ ์ ์ํ๋ ๊ธฐ๋ฒ์ ํด๋ฆฌ์คํฑํ ํ์ต ์ ๋ต์ ์ํ์ง ์๊ธฐ์, ๋ณด๋ค ์์ ์ ์ธ ํ์ต์ด ๊ฐ๋ฅํ๋ค. ์ ์ํ๋ ๊ธฐ๋ฒ์ RSR2015 Part3 ๋ฐ์ดํฐ์
์์ ๊ธฐ์กด ๊ธฐ๋ฒ๋ค์ ๋นํ์ฌ ๋์ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ, ์ฑ๋ฌธ ๋ฒกํฐ ๋ด์ ๋
น์ ๊ธฐ๊ธฐ ๋ฐ ๊ฐ์ ์ ๋ณด๋ฅผ ์ต์ ํ๋๋ฐ ํจ๊ณผ์ ์ด์๋ค.1. Introduction 1
2. Conventional embedding techniques for speaker recognition 7
2.1. i-vector framework 7
2.2. Deep learning-based speaker embedding 10
2.2.1. Deep embedding network 10
2.2.2. Conventional disentanglement methods 13
3. Unsupervised learning of total variability embedding for speaker verification with random digit strings 17
3.1. Introduction 17
3.2. Variational autoencoder 20
3.3. Variational inference model for non-linear total variability embedding 22
3.3.1. Maximum likelihood training 23
3.3.2. Non-linear feature extraction and speaker verification 25
3.4. Experiments 26
3.4.1. Databases 26
3.4.2. Experimental setup 27
3.4.3. Effect of the duration on the latent variable 28
3.4.4. Experiments with VAEs 30
3.4.5. Feature-level fusion of i-vector and latent variable 33
3.4.6. Score-level fusion of i-vector and latent variable 36
3.5. Summary 39
4. Adversarially learned total variability embedding for speaker recognition with random digit strings 41
4.1. Introduction 41
4.2. Adversarially learned inference 43
4.3. Adversarially learned feature extraction 45
4.3.1. Maximum likelihood criterion 47
4.3.2. Adversarially learned inference for non-linear i-vector extraction 49
4.3.3. Relationship to the VAE-based feature extractor 50
4.4. Experiments 51
4.4.1. Databases 51
4.4.2. Experimental setup 53
4.4.3. Effect of the duration on the latent variable 54
4.4.4. Speaker verification and identification with different utterance-level features 56
4.5. Summary 62
5. Disentangled speaker and nuisance attribute embedding for robust speaker verification 63
5.1. Introduction 63
5.2. Joint factor embedding 67
5.2.1. Joint factor embedding network architecture 67
5.2.2. Training for joint factor embedding 69
5.3. Experiments 71
5.3.1. Channel disentanglement experiments 71
5.3.2. Emotion disentanglement 82
5.3.3. Noise disentanglement 86
5.4. Summary 87
6. Conclusion 93
Bibliography 95
Abstract (Korean) 105Docto
DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification
Data augmentation is vital to the generalization ability and robustness of
deep neural networks (DNNs) models. Existing augmentation methods for speaker
verification manipulate the raw signal, which are time-consuming and the
augmented samples lack diversity. In this paper, we present a novel
difficulty-aware semantic augmentation (DASA) approach for speaker
verification, which can generate diversified training samples in speaker
embedding space with negligible extra computing cost. Firstly, we augment
training samples by perturbing speaker embeddings along semantic directions,
which are obtained from speaker-wise covariance matrices. Secondly, accurate
covariance matrices are estimated from robust speaker embeddings during
training, so we introduce difficultyaware additive margin softmax
(DAAM-Softmax) to obtain optimal speaker embeddings. Finally, we assume the
number of augmented samples goes to infinity and derive a closed-form upper
bound of the expected loss with DASA, which achieves compatibility and
efficiency. Extensive experiments demonstrate the proposed approach can achieve
a remarkable performance improvement. The best result achieves a 14.6% relative
reduction in EER metric on CN-Celeb evaluation set.Comment: Accepted by ICASSP 202
Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion
Traditional studies on voice conversion (VC) have made progress with parallel
training data and known speakers. Good voice conversion quality is obtained by
exploring better alignment modules or expressive mapping functions. In this
study, we investigate zero-shot VC from a novel perspective of self-supervised
disentangled speech representation learning. Specifically, we achieve the
disentanglement by balancing the information flow between global speaker
representation and time-varying content representation in a sequential
variational autoencoder (VAE). A zero-shot voice conversion is performed by
feeding an arbitrary speaker embedding and content embeddings to the VAE
decoder. Besides that, an on-the-fly data augmentation training strategy is
applied to make the learned representation noise invariant. On TIMIT and VCTK
datasets, we achieve state-of-the-art performance on both objective evaluation,
i.e., speaker verification (SV) on speaker embedding and content embedding, and
subjective evaluation, i.e., voice naturalness and similarity, and remains to
be robust even with noisy source/target utterances.Comment: Accepted to 2022 ICASS
Zero-shot Singing Technique Conversion
In this paper we propose modifications to the neural network framework, AutoVC for the task of singing technique conversion. This includes utilising a pretrained singing technique encoder which extracts technique information, upon which a decoder is conditioned during training. By swapping out a source singerโs technique information for that of the targetโs during conversion, the input spectrogram is reconstructed with the targetโs technique. We document the beneficial effects of omitting the latent loss, the importance of sequential training, and our process for fine-tuning the bottleneck. We also conducted a listening study where participants rate the specificity of technique-converted voices as well as their naturalness. From this we are able to conclude how effective the technique conversions are and how different conditions affect them, while assessing the modelโs ability to reconstruct its input data
- โฆ