1,201 research outputs found
๋นํ์ ์์์ ๊ฐ์ธํ ํ์ ์ธ์์ ์ํ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2021. 2. ๊น๋จ์.Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). Also, unlike the classical Gaussian mixture model (GMM)-based techniques (e.g., GMM supervector or i-vector), since the deep learning-based embedding systems are trained in a fully supervised manner, it is impossible for them to utilize unlabeled dataset when training.
In this thesis, we propose a variational autoencoder (VAE)-based embedding framework, which extracts the total variability embedding and a representation for the uncertainty within the input speech distribution. Unlike the conventional deep learning-based embedding techniques (e.g., d-vector or x-vector), the proposed VAE-based embedding system is trained in an unsupervised manner, which enables the utilization of unlabeled datasets. Furthermore, in order to prevent the potential loss of information caused by the Kullback-Leibler divergence regularization term in the VAE-based embedding framework, we propose an adversarially learned inference (ALI)-based embedding technique. Both VAE- and ALI-based embedding techniques have shown great performance in terms of short duration speaker verification, outperforming the conventional i-vector framework.
Additionally, we present a fully supervised training method for disentangling the non-speaker nuisance information from the speaker embedding. The proposed training scheme jointly extracts the speaker and nuisance attribute (e.g., recording channel, emotion) embeddings, and train them to have maximum information on their main-task while ensuring maximum uncertainty on their sub-task. Since the proposed method does not require any heuristic training strategy as in the conventional disentanglement techniques (e.g., adversarial learning, gradient reversal), optimizing the embedding network is relatively more stable. The proposed scheme have shown state-of-the-art performance in RSR2015 Part 3 dataset, and demonstrated its capability in efficiently disentangling the recording device and emotional information from the speaker embedding.์ต๊ทผ ๋ช๋
๊ฐ ๋ค์ํ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ด ์ ์๋์ด ์์ผ๋ฉฐ, ํ์ ์ธ์์์ ๋์ ์ฑ๋ฅ์ ๋ณด์๋ค. ํ์ง๋ง ๊ณ ์ ์ ์ธ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์์์ ๋ง์ฐฌ๊ฐ์ง๋ก, ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ ์๋ก ๋ค๋ฅธ ํ๊ฒฝ (e.g., ๋
น์ ๊ธฐ๊ธฐ, ๊ฐ์ )์์ ๋
น์๋ ์์ฑ๋ค์ ๋ถ์ํ๋ ๊ณผ์ ์์ ์ฑ๋ฅ ์ ํ๋ฅผ ๊ฒช๋๋ค. ๋ํ ๊ธฐ์กด์ ๊ฐ์ฐ์์ ํผํฉ ๋ชจ๋ธ (Gaussian mixture model, GMM) ๊ธฐ๋ฐ์ ๊ธฐ๋ฒ๋ค (e.g., GMM ์ํผ๋ฒกํฐ, i-๋ฒกํฐ)์ ๋ฌ๋ฆฌ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ ๊ต์ฌ ํ์ต์ ํตํ์ฌ ์ต์ ํ๋๊ธฐ์ ๋ผ๋ฒจ์ด ์๋ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ ์ ์๋ค๋ ํ๊ณ๊ฐ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ variational autoencoder (VAE) ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์ ์ํ๋ฉฐ, ํด๋น ๊ธฐ๋ฒ์์๋ ์์ฑ ๋ถํฌ ํจํด์ ์์ฝํ๋ ๋ฒกํฐ์ ์์ฑ ๋ด์ ๋ถํ์ค์ฑ์ ํํํ๋ ๋ฒกํฐ๋ฅผ ์ถ์ถํ๋ค. ๊ธฐ์กด์ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ (e.g., d-๋ฒกํฐ, x-๋ฒกํฐ)์๋ ๋ฌ๋ฆฌ, ์ ์ํ๋ ๊ธฐ๋ฒ์ ๋น๊ต์ฌ ํ์ต์ ํตํ์ฌ ์ต์ ํ ๋๊ธฐ์ ๋ผ๋ฒจ์ด ์๋ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ ์ ์๋ค. ๋ ๋์๊ฐ VAE์ KL-divergence ์ ์ฝ ํจ์๋ก ์ธํ ์ ๋ณด ์์ค์ ๋ฐฉ์งํ๊ธฐ ์ํ์ฌ adversarially learned inference (ALI) ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์ถ๊ฐ์ ์ผ๋ก ์ ์ํ๋ค. ์ ์ํ VAE ๋ฐ ALI ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์งง์ ์์ฑ์์์ ํ์ ์ธ์ฆ ์คํ์์ ๋์ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ, ๊ธฐ์กด์ i-๋ฒกํฐ ๊ธฐ๋ฐ์ ๊ธฐ๋ฒ๋ณด๋ค ์ข์ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์๋ค.
๋ํ ๋ณธ ๋
ผ๋ฌธ์์๋ ์ฑ๋ฌธ ๋ฒกํฐ๋ก๋ถํฐ ๋น ํ์ ์์ (e.g., ๋
น์ ๊ธฐ๊ธฐ, ๊ฐ์ )์ ๋ํ ์ ๋ณด๋ฅผ ์ ๊ฑฐํ๋ ํ์ต๋ฒ์ ์ ์ํ๋ค. ์ ์ํ๋ ๊ธฐ๋ฒ์ ํ์ ๋ฒกํฐ์ ๋นํ์ ๋ฒกํฐ๋ฅผ ๋์์ ์ถ์ถํ๋ฉฐ, ๊ฐ ๋ฒกํฐ๋ ์์ ์ ์ฃผ ๋ชฉ์ ์ ๋ํ ์ ๋ณด๋ฅผ ์ต๋ํ ๋ง์ด ์ ์งํ๋, ๋ถ ๋ชฉ์ ์ ๋ํ ์ ๋ณด๋ฅผ ์ต์ํํ๋๋ก ํ์ต๋๋ค. ๊ธฐ์กด์ ๋น ํ์ ์์ ์ ๋ณด ์ ๊ฑฐ ๊ธฐ๋ฒ๋ค (e.g., adversarial learning, gradient reversal)์ ๋นํ์ฌ ์ ์ํ๋ ๊ธฐ๋ฒ์ ํด๋ฆฌ์คํฑํ ํ์ต ์ ๋ต์ ์ํ์ง ์๊ธฐ์, ๋ณด๋ค ์์ ์ ์ธ ํ์ต์ด ๊ฐ๋ฅํ๋ค. ์ ์ํ๋ ๊ธฐ๋ฒ์ RSR2015 Part3 ๋ฐ์ดํฐ์
์์ ๊ธฐ์กด ๊ธฐ๋ฒ๋ค์ ๋นํ์ฌ ๋์ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ, ์ฑ๋ฌธ ๋ฒกํฐ ๋ด์ ๋
น์ ๊ธฐ๊ธฐ ๋ฐ ๊ฐ์ ์ ๋ณด๋ฅผ ์ต์ ํ๋๋ฐ ํจ๊ณผ์ ์ด์๋ค.1. Introduction 1
2. Conventional embedding techniques for speaker recognition 7
2.1. i-vector framework 7
2.2. Deep learning-based speaker embedding 10
2.2.1. Deep embedding network 10
2.2.2. Conventional disentanglement methods 13
3. Unsupervised learning of total variability embedding for speaker verification with random digit strings 17
3.1. Introduction 17
3.2. Variational autoencoder 20
3.3. Variational inference model for non-linear total variability embedding 22
3.3.1. Maximum likelihood training 23
3.3.2. Non-linear feature extraction and speaker verification 25
3.4. Experiments 26
3.4.1. Databases 26
3.4.2. Experimental setup 27
3.4.3. Effect of the duration on the latent variable 28
3.4.4. Experiments with VAEs 30
3.4.5. Feature-level fusion of i-vector and latent variable 33
3.4.6. Score-level fusion of i-vector and latent variable 36
3.5. Summary 39
4. Adversarially learned total variability embedding for speaker recognition with random digit strings 41
4.1. Introduction 41
4.2. Adversarially learned inference 43
4.3. Adversarially learned feature extraction 45
4.3.1. Maximum likelihood criterion 47
4.3.2. Adversarially learned inference for non-linear i-vector extraction 49
4.3.3. Relationship to the VAE-based feature extractor 50
4.4. Experiments 51
4.4.1. Databases 51
4.4.2. Experimental setup 53
4.4.3. Effect of the duration on the latent variable 54
4.4.4. Speaker verification and identification with different utterance-level features 56
4.5. Summary 62
5. Disentangled speaker and nuisance attribute embedding for robust speaker verification 63
5.1. Introduction 63
5.2. Joint factor embedding 67
5.2.1. Joint factor embedding network architecture 67
5.2.2. Training for joint factor embedding 69
5.3. Experiments 71
5.3.1. Channel disentanglement experiments 71
5.3.2. Emotion disentanglement 82
5.3.3. Noise disentanglement 86
5.4. Summary 87
6. Conclusion 93
Bibliography 95
Abstract (Korean) 105Docto
Attentive Adversarial Learning for Domain-Invariant Training
Adversarial domain-invariant training (ADIT) proves to be effective in
suppressing the effects of domain variability in acoustic modeling and has led
to improved performance in automatic speech recognition (ASR). In ADIT, an
auxiliary domain classifier takes in equally-weighted deep features from a deep
neural network (DNN) acoustic model and is trained to improve their
domain-invariance by optimizing an adversarial loss function. In this work, we
propose an attentive ADIT (AADIT) in which we advance the domain classifier
with an attention mechanism to automatically weight the input deep features
according to their importance in domain classification. With this attentive
re-weighting, AADIT can focus on the domain normalization of phonetic
components that are more susceptible to domain variability and generates deep
features with improved domain-invariance and senone-discriminativity over ADIT.
Most importantly, the attention block serves only as an external component to
the DNN acoustic model and is not involved in ASR, so AADIT can be used to
improve the acoustic modeling with any DNN architectures. More generally, the
same methodology can improve any adversarial learning system with an auxiliary
discriminator. Evaluated on CHiME-3 dataset, the AADIT achieves 13.6% and 9.3%
relative WER improvements, respectively, over a multi-conditional model and a
strong ADIT baseline.Comment: 5 pages, 1 figure, ICASSP 201
Support vector regression in NIST SRE 2008 multichannel core task
Actas de las V Jornadas en Tecnologรญa del Habla (JTH 2008)This paper explores two alternatives for speaker verification
using Generalized Linear Discriminant Sequence (GLDS)
kernel: classical Support Vector Classification (SVC), and
Support Vector Regression (SVR), recently proposed by the
authors as a more robust approach for telephone speech. In
this work we address a more challenging environment, the
NIST SRE 2008 multichannel core task, where strong
mismatch is introduced by the use of different microphones
and recordings from interviews. Channel compensation based
in Nuisance Attribute Projection (NAP) has also been
investigated in order to analyze its impact for both
approaches. Experiments show that, although both techniques
show a significant improvement over SVC-GLDS when NAP
is used, SVR is also robust to channel mismatch even when
channel compensation is not used. This avoids the need of a
considerable set of training data adapted to the operational
scenario, whose availability is not frequent in general. Results
show a similar performance for SVR-GLDS without NAP and
SVC-GLDS with NAP. Moreover, SVR-GLDS results are
promising, since other configurations and methods for channel
compensation can further improve performance.This work has been supported by the Spanish Ministry of Education under project TEC2006-13170-C02-01
Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification
In the field of speaker verification, session or channel variability poses a
significant challenge. While many contemporary methods aim to disentangle
session information from speaker embeddings, we introduce a novel approach
using an additional embedding to represent the session information. This is
achieved by training an auxiliary network appended to the speaker embedding
extractor which remains fixed in this training process. This results in two
similarity scores: one for the speakers information and one for the session
information. The latter score acts as a compensator for the former that might
be skewed due to session variations. Our extensive experiments demonstrate that
session information can be effectively compensated without retraining of the
embedding extractor
- โฆ