205 research outputs found

    Deep learning methods in speaker recognition: a review

    Full text link
    This paper summarizes the applied deep learning practices in the field of speaker recognition, both verification and identification. Speaker recognition has been a widely used field topic of speech technology. Many research works have been carried out and little progress has been achieved in the past 5-6 years. However, as deep learning techniques do advance in most machine learning fields, the former state-of-the-art methods are getting replaced by them in speaker recognition too. It seems that DL becomes the now state-of-the-art solution for both speaker verification and identification. The standard x-vectors, additional to i-vectors, are used as baseline in most of the novel works. The increasing amount of gathered data opens up the territory to DL, where they are the most effective

    Contrastive Speaker Embedding With Sequential Disentanglement

    Full text link
    Contrastive speaker embedding assumes that the contrast between the positive and negative pairs of speech segments is attributed to speaker identity only. However, this assumption is incorrect because speech signals contain not only speaker identity but also linguistic content. In this paper, we propose a contrastive learning framework with sequential disentanglement to remove linguistic content by incorporating a disentangled sequential variational autoencoder (DSVAE) into the conventional SimCLR framework. The DSVAE aims to disentangle speaker factors from content factors in an embedding space so that only the speaker factors are used for constructing a contrastive loss objective. Because content factors have been removed from the contrastive learning, the resulting speaker embeddings will be content-invariant. Experimental results on VoxCeleb1-test show that the proposed method consistently outperforms SimCLR. This suggests that applying sequential disentanglement is beneficial to learning speaker-discriminative embeddings.Comment: Submitted to ICASSP 202

    ๋น„ํ™”์ž ์š”์†Œ์— ๊ฐ•์ธํ•œ ํ™”์ž ์ธ์‹์„ ์œ„ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021. 2. ๊น€๋‚จ์ˆ˜.Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). Also, unlike the classical Gaussian mixture model (GMM)-based techniques (e.g., GMM supervector or i-vector), since the deep learning-based embedding systems are trained in a fully supervised manner, it is impossible for them to utilize unlabeled dataset when training. In this thesis, we propose a variational autoencoder (VAE)-based embedding framework, which extracts the total variability embedding and a representation for the uncertainty within the input speech distribution. Unlike the conventional deep learning-based embedding techniques (e.g., d-vector or x-vector), the proposed VAE-based embedding system is trained in an unsupervised manner, which enables the utilization of unlabeled datasets. Furthermore, in order to prevent the potential loss of information caused by the Kullback-Leibler divergence regularization term in the VAE-based embedding framework, we propose an adversarially learned inference (ALI)-based embedding technique. Both VAE- and ALI-based embedding techniques have shown great performance in terms of short duration speaker verification, outperforming the conventional i-vector framework. Additionally, we present a fully supervised training method for disentangling the non-speaker nuisance information from the speaker embedding. The proposed training scheme jointly extracts the speaker and nuisance attribute (e.g., recording channel, emotion) embeddings, and train them to have maximum information on their main-task while ensuring maximum uncertainty on their sub-task. Since the proposed method does not require any heuristic training strategy as in the conventional disentanglement techniques (e.g., adversarial learning, gradient reversal), optimizing the embedding network is relatively more stable. The proposed scheme have shown state-of-the-art performance in RSR2015 Part 3 dataset, and demonstrated its capability in efficiently disentangling the recording device and emotional information from the speaker embedding.์ตœ๊ทผ ๋ช‡๋…„๊ฐ„ ๋‹ค์–‘ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์ด ์ œ์•ˆ๋˜์–ด ์™”์œผ๋ฉฐ, ํ™”์ž ์ธ์‹์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ํ•˜์ง€๋งŒ ๊ณ ์ „์ ์ธ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์—์„œ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์€ ์„œ๋กœ ๋‹ค๋ฅธ ํ™˜๊ฒฝ (e.g., ๋…น์Œ ๊ธฐ๊ธฐ, ๊ฐ์ •)์—์„œ ๋…น์Œ๋œ ์Œ์„ฑ๋“ค์„ ๋ถ„์„ํ•˜๋Š” ๊ณผ์ •์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๊ฒช๋Š”๋‹ค. ๋˜ํ•œ ๊ธฐ์กด์˜ ๊ฐ€์šฐ์‹œ์•ˆ ํ˜ผํ•ฉ ๋ชจ๋ธ (Gaussian mixture model, GMM) ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•๋“ค (e.g., GMM ์Šˆํผ๋ฒกํ„ฐ, i-๋ฒกํ„ฐ)์™€ ๋‹ฌ๋ฆฌ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์€ ๊ต์‚ฌ ํ•™์Šต์„ ํ†ตํ•˜์—ฌ ์ตœ์ ํ™”๋˜๊ธฐ์— ๋ผ๋ฒจ์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์—†๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” variational autoencoder (VAE) ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜๋ฉฐ, ํ•ด๋‹น ๊ธฐ๋ฒ•์—์„œ๋Š” ์Œ์„ฑ ๋ถ„ํฌ ํŒจํ„ด์„ ์š”์•ฝํ•˜๋Š” ๋ฒกํ„ฐ์™€ ์Œ์„ฑ ๋‚ด์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ํ‘œํ˜„ํ•˜๋Š” ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ๊ธฐ์กด์˜ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ• (e.g., d-๋ฒกํ„ฐ, x-๋ฒกํ„ฐ)์™€๋Š” ๋‹ฌ๋ฆฌ, ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ๋น„๊ต์‚ฌ ํ•™์Šต์„ ํ†ตํ•˜์—ฌ ์ตœ์ ํ™” ๋˜๊ธฐ์— ๋ผ๋ฒจ์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๋” ๋‚˜์•„๊ฐ€ VAE์˜ KL-divergence ์ œ์•ฝ ํ•จ์ˆ˜๋กœ ์ธํ•œ ์ •๋ณด ์†์‹ค์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ adversarially learned inference (ALI) ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์„ ์ถ”๊ฐ€์ ์œผ๋กœ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•œ VAE ๋ฐ ALI ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์€ ์งง์€ ์Œ์„ฑ์—์„œ์˜ ํ™”์ž ์ธ์ฆ ์‹คํ—˜์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ๊ธฐ์กด์˜ i-๋ฒกํ„ฐ ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•๋ณด๋‹ค ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์„ฑ๋ฌธ ๋ฒกํ„ฐ๋กœ๋ถ€ํ„ฐ ๋น„ ํ™”์ž ์š”์†Œ (e.g., ๋…น์Œ ๊ธฐ๊ธฐ, ๊ฐ์ •)์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ํ•™์Šต๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ํ™”์ž ๋ฒกํ„ฐ์™€ ๋น„ํ™”์ž ๋ฒกํ„ฐ๋ฅผ ๋™์‹œ์— ์ถ”์ถœํ•˜๋ฉฐ, ๊ฐ ๋ฒกํ„ฐ๋Š” ์ž์‹ ์˜ ์ฃผ ๋ชฉ์ ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ตœ๋Œ€ํ•œ ๋งŽ์ด ์œ ์ง€ํ•˜๋˜, ๋ถ€ ๋ชฉ์ ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค. ๊ธฐ์กด์˜ ๋น„ ํ™”์ž ์š”์†Œ ์ •๋ณด ์ œ๊ฑฐ ๊ธฐ๋ฒ•๋“ค (e.g., adversarial learning, gradient reversal)์— ๋น„ํ•˜์—ฌ ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ํœด๋ฆฌ์Šคํ‹ฑํ•œ ํ•™์Šต ์ „๋žต์„ ์š”ํ•˜์ง€ ์•Š๊ธฐ์—, ๋ณด๋‹ค ์•ˆ์ •์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ RSR2015 Part3 ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ธฐ์กด ๊ธฐ๋ฒ•๋“ค์— ๋น„ํ•˜์—ฌ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ์„ฑ๋ฌธ ๋ฒกํ„ฐ ๋‚ด์˜ ๋…น์Œ ๊ธฐ๊ธฐ ๋ฐ ๊ฐ์ • ์ •๋ณด๋ฅผ ์–ต์ œํ•˜๋Š”๋ฐ ํšจ๊ณผ์ ์ด์—ˆ๋‹ค.1. Introduction 1 2. Conventional embedding techniques for speaker recognition 7 2.1. i-vector framework 7 2.2. Deep learning-based speaker embedding 10 2.2.1. Deep embedding network 10 2.2.2. Conventional disentanglement methods 13 3. Unsupervised learning of total variability embedding for speaker verification with random digit strings 17 3.1. Introduction 17 3.2. Variational autoencoder 20 3.3. Variational inference model for non-linear total variability embedding 22 3.3.1. Maximum likelihood training 23 3.3.2. Non-linear feature extraction and speaker verification 25 3.4. Experiments 26 3.4.1. Databases 26 3.4.2. Experimental setup 27 3.4.3. Effect of the duration on the latent variable 28 3.4.4. Experiments with VAEs 30 3.4.5. Feature-level fusion of i-vector and latent variable 33 3.4.6. Score-level fusion of i-vector and latent variable 36 3.5. Summary 39 4. Adversarially learned total variability embedding for speaker recognition with random digit strings 41 4.1. Introduction 41 4.2. Adversarially learned inference 43 4.3. Adversarially learned feature extraction 45 4.3.1. Maximum likelihood criterion 47 4.3.2. Adversarially learned inference for non-linear i-vector extraction 49 4.3.3. Relationship to the VAE-based feature extractor 50 4.4. Experiments 51 4.4.1. Databases 51 4.4.2. Experimental setup 53 4.4.3. Effect of the duration on the latent variable 54 4.4.4. Speaker verification and identification with different utterance-level features 56 4.5. Summary 62 5. Disentangled speaker and nuisance attribute embedding for robust speaker verification 63 5.1. Introduction 63 5.2. Joint factor embedding 67 5.2.1. Joint factor embedding network architecture 67 5.2.2. Training for joint factor embedding 69 5.3. Experiments 71 5.3.1. Channel disentanglement experiments 71 5.3.2. Emotion disentanglement 82 5.3.3. Noise disentanglement 86 5.4. Summary 87 6. Conclusion 93 Bibliography 95 Abstract (Korean) 105Docto

    DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification

    Full text link
    Data augmentation is vital to the generalization ability and robustness of deep neural networks (DNNs) models. Existing augmentation methods for speaker verification manipulate the raw signal, which are time-consuming and the augmented samples lack diversity. In this paper, we present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification, which can generate diversified training samples in speaker embedding space with negligible extra computing cost. Firstly, we augment training samples by perturbing speaker embeddings along semantic directions, which are obtained from speaker-wise covariance matrices. Secondly, accurate covariance matrices are estimated from robust speaker embeddings during training, so we introduce difficultyaware additive margin softmax (DAAM-Softmax) to obtain optimal speaker embeddings. Finally, we assume the number of augmented samples goes to infinity and derive a closed-form upper bound of the expected loss with DASA, which achieves compatibility and efficiency. Extensive experiments demonstrate the proposed approach can achieve a remarkable performance improvement. The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.Comment: Accepted by ICASSP 202

    Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion

    Full text link
    Traditional studies on voice conversion (VC) have made progress with parallel training data and known speakers. Good voice conversion quality is obtained by exploring better alignment modules or expressive mapping functions. In this study, we investigate zero-shot VC from a novel perspective of self-supervised disentangled speech representation learning. Specifically, we achieve the disentanglement by balancing the information flow between global speaker representation and time-varying content representation in a sequential variational autoencoder (VAE). A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to the VAE decoder. Besides that, an on-the-fly data augmentation training strategy is applied to make the learned representation noise invariant. On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e., voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.Comment: Accepted to 2022 ICASS

    Zero-shot Singing Technique Conversion

    Get PDF
    In this paper we propose modifications to the neural network framework, AutoVC for the task of singing technique conversion. This includes utilising a pretrained singing technique encoder which extracts technique information, upon which a decoder is conditioned during training. By swapping out a source singerโ€™s technique information for that of the targetโ€™s during conversion, the input spectrogram is reconstructed with the targetโ€™s technique. We document the beneficial effects of omitting the latent loss, the importance of sequential training, and our process for fine-tuning the bottleneck. We also conducted a listening study where participants rate the specificity of technique-converted voices as well as their naturalness. From this we are able to conclude how effective the technique conversions are and how different conditions affect them, while assessing the modelโ€™s ability to reconstruct its input data
    • โ€ฆ
    corecore