1,201 research outputs found

    ๋น„ํ™”์ž ์š”์†Œ์— ๊ฐ•์ธํ•œ ํ™”์ž ์ธ์‹์„ ์œ„ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021. 2. ๊น€๋‚จ์ˆ˜.Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). Also, unlike the classical Gaussian mixture model (GMM)-based techniques (e.g., GMM supervector or i-vector), since the deep learning-based embedding systems are trained in a fully supervised manner, it is impossible for them to utilize unlabeled dataset when training. In this thesis, we propose a variational autoencoder (VAE)-based embedding framework, which extracts the total variability embedding and a representation for the uncertainty within the input speech distribution. Unlike the conventional deep learning-based embedding techniques (e.g., d-vector or x-vector), the proposed VAE-based embedding system is trained in an unsupervised manner, which enables the utilization of unlabeled datasets. Furthermore, in order to prevent the potential loss of information caused by the Kullback-Leibler divergence regularization term in the VAE-based embedding framework, we propose an adversarially learned inference (ALI)-based embedding technique. Both VAE- and ALI-based embedding techniques have shown great performance in terms of short duration speaker verification, outperforming the conventional i-vector framework. Additionally, we present a fully supervised training method for disentangling the non-speaker nuisance information from the speaker embedding. The proposed training scheme jointly extracts the speaker and nuisance attribute (e.g., recording channel, emotion) embeddings, and train them to have maximum information on their main-task while ensuring maximum uncertainty on their sub-task. Since the proposed method does not require any heuristic training strategy as in the conventional disentanglement techniques (e.g., adversarial learning, gradient reversal), optimizing the embedding network is relatively more stable. The proposed scheme have shown state-of-the-art performance in RSR2015 Part 3 dataset, and demonstrated its capability in efficiently disentangling the recording device and emotional information from the speaker embedding.์ตœ๊ทผ ๋ช‡๋…„๊ฐ„ ๋‹ค์–‘ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์ด ์ œ์•ˆ๋˜์–ด ์™”์œผ๋ฉฐ, ํ™”์ž ์ธ์‹์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ํ•˜์ง€๋งŒ ๊ณ ์ „์ ์ธ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์—์„œ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์€ ์„œ๋กœ ๋‹ค๋ฅธ ํ™˜๊ฒฝ (e.g., ๋…น์Œ ๊ธฐ๊ธฐ, ๊ฐ์ •)์—์„œ ๋…น์Œ๋œ ์Œ์„ฑ๋“ค์„ ๋ถ„์„ํ•˜๋Š” ๊ณผ์ •์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๊ฒช๋Š”๋‹ค. ๋˜ํ•œ ๊ธฐ์กด์˜ ๊ฐ€์šฐ์‹œ์•ˆ ํ˜ผํ•ฉ ๋ชจ๋ธ (Gaussian mixture model, GMM) ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•๋“ค (e.g., GMM ์Šˆํผ๋ฒกํ„ฐ, i-๋ฒกํ„ฐ)์™€ ๋‹ฌ๋ฆฌ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์€ ๊ต์‚ฌ ํ•™์Šต์„ ํ†ตํ•˜์—ฌ ์ตœ์ ํ™”๋˜๊ธฐ์— ๋ผ๋ฒจ์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์—†๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” variational autoencoder (VAE) ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜๋ฉฐ, ํ•ด๋‹น ๊ธฐ๋ฒ•์—์„œ๋Š” ์Œ์„ฑ ๋ถ„ํฌ ํŒจํ„ด์„ ์š”์•ฝํ•˜๋Š” ๋ฒกํ„ฐ์™€ ์Œ์„ฑ ๋‚ด์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ํ‘œํ˜„ํ•˜๋Š” ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ๊ธฐ์กด์˜ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ• (e.g., d-๋ฒกํ„ฐ, x-๋ฒกํ„ฐ)์™€๋Š” ๋‹ฌ๋ฆฌ, ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ๋น„๊ต์‚ฌ ํ•™์Šต์„ ํ†ตํ•˜์—ฌ ์ตœ์ ํ™” ๋˜๊ธฐ์— ๋ผ๋ฒจ์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๋” ๋‚˜์•„๊ฐ€ VAE์˜ KL-divergence ์ œ์•ฝ ํ•จ์ˆ˜๋กœ ์ธํ•œ ์ •๋ณด ์†์‹ค์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ adversarially learned inference (ALI) ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์„ ์ถ”๊ฐ€์ ์œผ๋กœ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•œ VAE ๋ฐ ALI ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์€ ์งง์€ ์Œ์„ฑ์—์„œ์˜ ํ™”์ž ์ธ์ฆ ์‹คํ—˜์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ๊ธฐ์กด์˜ i-๋ฒกํ„ฐ ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•๋ณด๋‹ค ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์„ฑ๋ฌธ ๋ฒกํ„ฐ๋กœ๋ถ€ํ„ฐ ๋น„ ํ™”์ž ์š”์†Œ (e.g., ๋…น์Œ ๊ธฐ๊ธฐ, ๊ฐ์ •)์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ํ•™์Šต๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ํ™”์ž ๋ฒกํ„ฐ์™€ ๋น„ํ™”์ž ๋ฒกํ„ฐ๋ฅผ ๋™์‹œ์— ์ถ”์ถœํ•˜๋ฉฐ, ๊ฐ ๋ฒกํ„ฐ๋Š” ์ž์‹ ์˜ ์ฃผ ๋ชฉ์ ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ตœ๋Œ€ํ•œ ๋งŽ์ด ์œ ์ง€ํ•˜๋˜, ๋ถ€ ๋ชฉ์ ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค. ๊ธฐ์กด์˜ ๋น„ ํ™”์ž ์š”์†Œ ์ •๋ณด ์ œ๊ฑฐ ๊ธฐ๋ฒ•๋“ค (e.g., adversarial learning, gradient reversal)์— ๋น„ํ•˜์—ฌ ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ํœด๋ฆฌ์Šคํ‹ฑํ•œ ํ•™์Šต ์ „๋žต์„ ์š”ํ•˜์ง€ ์•Š๊ธฐ์—, ๋ณด๋‹ค ์•ˆ์ •์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ RSR2015 Part3 ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ธฐ์กด ๊ธฐ๋ฒ•๋“ค์— ๋น„ํ•˜์—ฌ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ์„ฑ๋ฌธ ๋ฒกํ„ฐ ๋‚ด์˜ ๋…น์Œ ๊ธฐ๊ธฐ ๋ฐ ๊ฐ์ • ์ •๋ณด๋ฅผ ์–ต์ œํ•˜๋Š”๋ฐ ํšจ๊ณผ์ ์ด์—ˆ๋‹ค.1. Introduction 1 2. Conventional embedding techniques for speaker recognition 7 2.1. i-vector framework 7 2.2. Deep learning-based speaker embedding 10 2.2.1. Deep embedding network 10 2.2.2. Conventional disentanglement methods 13 3. Unsupervised learning of total variability embedding for speaker verification with random digit strings 17 3.1. Introduction 17 3.2. Variational autoencoder 20 3.3. Variational inference model for non-linear total variability embedding 22 3.3.1. Maximum likelihood training 23 3.3.2. Non-linear feature extraction and speaker verification 25 3.4. Experiments 26 3.4.1. Databases 26 3.4.2. Experimental setup 27 3.4.3. Effect of the duration on the latent variable 28 3.4.4. Experiments with VAEs 30 3.4.5. Feature-level fusion of i-vector and latent variable 33 3.4.6. Score-level fusion of i-vector and latent variable 36 3.5. Summary 39 4. Adversarially learned total variability embedding for speaker recognition with random digit strings 41 4.1. Introduction 41 4.2. Adversarially learned inference 43 4.3. Adversarially learned feature extraction 45 4.3.1. Maximum likelihood criterion 47 4.3.2. Adversarially learned inference for non-linear i-vector extraction 49 4.3.3. Relationship to the VAE-based feature extractor 50 4.4. Experiments 51 4.4.1. Databases 51 4.4.2. Experimental setup 53 4.4.3. Effect of the duration on the latent variable 54 4.4.4. Speaker verification and identification with different utterance-level features 56 4.5. Summary 62 5. Disentangled speaker and nuisance attribute embedding for robust speaker verification 63 5.1. Introduction 63 5.2. Joint factor embedding 67 5.2.1. Joint factor embedding network architecture 67 5.2.2. Training for joint factor embedding 69 5.3. Experiments 71 5.3.1. Channel disentanglement experiments 71 5.3.2. Emotion disentanglement 82 5.3.3. Noise disentanglement 86 5.4. Summary 87 6. Conclusion 93 Bibliography 95 Abstract (Korean) 105Docto

    Attentive Adversarial Learning for Domain-Invariant Training

    Full text link
    Adversarial domain-invariant training (ADIT) proves to be effective in suppressing the effects of domain variability in acoustic modeling and has led to improved performance in automatic speech recognition (ASR). In ADIT, an auxiliary domain classifier takes in equally-weighted deep features from a deep neural network (DNN) acoustic model and is trained to improve their domain-invariance by optimizing an adversarial loss function. In this work, we propose an attentive ADIT (AADIT) in which we advance the domain classifier with an attention mechanism to automatically weight the input deep features according to their importance in domain classification. With this attentive re-weighting, AADIT can focus on the domain normalization of phonetic components that are more susceptible to domain variability and generates deep features with improved domain-invariance and senone-discriminativity over ADIT. Most importantly, the attention block serves only as an external component to the DNN acoustic model and is not involved in ASR, so AADIT can be used to improve the acoustic modeling with any DNN architectures. More generally, the same methodology can improve any adversarial learning system with an auxiliary discriminator. Evaluated on CHiME-3 dataset, the AADIT achieves 13.6% and 9.3% relative WER improvements, respectively, over a multi-conditional model and a strong ADIT baseline.Comment: 5 pages, 1 figure, ICASSP 201

    Support vector regression in NIST SRE 2008 multichannel core task

    Full text link
    Actas de las V Jornadas en Tecnologรญa del Habla (JTH 2008)This paper explores two alternatives for speaker verification using Generalized Linear Discriminant Sequence (GLDS) kernel: classical Support Vector Classification (SVC), and Support Vector Regression (SVR), recently proposed by the authors as a more robust approach for telephone speech. In this work we address a more challenging environment, the NIST SRE 2008 multichannel core task, where strong mismatch is introduced by the use of different microphones and recordings from interviews. Channel compensation based in Nuisance Attribute Projection (NAP) has also been investigated in order to analyze its impact for both approaches. Experiments show that, although both techniques show a significant improvement over SVC-GLDS when NAP is used, SVR is also robust to channel mismatch even when channel compensation is not used. This avoids the need of a considerable set of training data adapted to the operational scenario, whose availability is not frequent in general. Results show a similar performance for SVR-GLDS without NAP and SVC-GLDS with NAP. Moreover, SVR-GLDS results are promising, since other configurations and methods for channel compensation can further improve performance.This work has been supported by the Spanish Ministry of Education under project TEC2006-13170-C02-01

    Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification

    Full text link
    In the field of speaker verification, session or channel variability poses a significant challenge. While many contemporary methods aim to disentangle session information from speaker embeddings, we introduce a novel approach using an additional embedding to represent the session information. This is achieved by training an auxiliary network appended to the speaker embedding extractor which remains fixed in this training process. This results in two similarity scores: one for the speakers information and one for the session information. The latter score acts as a compensator for the former that might be skewed due to session variations. Our extensive experiments demonstrate that session information can be effectively compensated without retraining of the embedding extractor
    • โ€ฆ
    corecore