13,646 research outputs found
A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings
Modern automatic speaker verification relies largely on deep neural networks
(DNNs) trained on mel-frequency cepstral coefficient (MFCC) features. While
there are alternative feature extraction methods based on phase, prosody and
long-term temporal operations, they have not been extensively studied with
DNN-based methods. We aim to fill this gap by providing extensive re-assessment
of 14 feature extractors on VoxCeleb and SITW datasets. Our findings reveal
that features equipped with techniques such as spectral centroids, group delay
function, and integrated noise suppression provide promising alternatives to
MFCCs for deep speaker embeddings extraction. Experimental results demonstrate
up to 16.3\% (VoxCeleb) and 25.1\% (SITW) relative decrease in equal error rate
(EER) to the baseline.Comment: Accepted to Interspeech 202
A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings
International audienceModern automatic speaker verification relies largely on deep neural networks (DNNs) trained on mel-frequency cepstral coefficient (MFCC) features. While there are alternative feature extraction methods based on phase, prosody and long-term temporal operations, they have not been extensively studied with DNN-based methods. We aim to fill this gap by providing extensive re-assessment of 14 feature extractors on VoxCeleb and SITW datasets. Our findings reveal that features equipped with techniques such as spectral centroids, group delay function, and integrated noise suppression provide promising alternatives to MFCCs for deep speaker embeddings extraction. Experimental results demonstrate up to 16.3% (VoxCeleb) and 25.1% (SITW) relative decrease in equal error rate (EER) to the baseline
Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification
There are a number of studies about extraction of bottleneck (BN) features
from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases
and triphone states for improving the performance of text-dependent speaker
verification (TD-SV). However, a moderate success has been achieved. A recent
study [1] presented a time contrastive learning (TCL) concept to explore the
non-stationarity of brain signals for classification of brain states. Speech
signals have similar non-stationarity property, and TCL further has the
advantage of having no need for labeled data. We therefore present a TCL based
BN feature extraction method. The method uniformly partitions each speech
utterance in a training dataset into a predefined number of multi-frame
segments. Each segment in an utterance corresponds to one class, and class
labels are shared across utterances. DNNs are then trained to discriminate all
speech frames among the classes to exploit the temporal structure of speech. In
addition, we propose a segment-based unsupervised clustering algorithm to
re-assign class labels to the segments. TD-SV experiments were conducted on the
RedDots challenge database. The TCL-DNNs were trained using speech data of
fixed pass-phrases that were excluded from the TD-SV evaluation set, so the
learned features can be considered phrase-independent. We compare the
performance of the proposed TCL bottleneck (BN) feature with those of
short-time cepstral features and BN features extracted from DNNs discriminating
speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels
and boundaries are generated by three different automatic speech recognition
(ASR) systems. Experimental results show that the proposed TCL-BN outperforms
cepstral features and speaker+pass-phrase discriminant BN features, and its
performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
๋นํ์ ์์์ ๊ฐ์ธํ ํ์ ์ธ์์ ์ํ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2021. 2. ๊น๋จ์.Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). Also, unlike the classical Gaussian mixture model (GMM)-based techniques (e.g., GMM supervector or i-vector), since the deep learning-based embedding systems are trained in a fully supervised manner, it is impossible for them to utilize unlabeled dataset when training.
In this thesis, we propose a variational autoencoder (VAE)-based embedding framework, which extracts the total variability embedding and a representation for the uncertainty within the input speech distribution. Unlike the conventional deep learning-based embedding techniques (e.g., d-vector or x-vector), the proposed VAE-based embedding system is trained in an unsupervised manner, which enables the utilization of unlabeled datasets. Furthermore, in order to prevent the potential loss of information caused by the Kullback-Leibler divergence regularization term in the VAE-based embedding framework, we propose an adversarially learned inference (ALI)-based embedding technique. Both VAE- and ALI-based embedding techniques have shown great performance in terms of short duration speaker verification, outperforming the conventional i-vector framework.
Additionally, we present a fully supervised training method for disentangling the non-speaker nuisance information from the speaker embedding. The proposed training scheme jointly extracts the speaker and nuisance attribute (e.g., recording channel, emotion) embeddings, and train them to have maximum information on their main-task while ensuring maximum uncertainty on their sub-task. Since the proposed method does not require any heuristic training strategy as in the conventional disentanglement techniques (e.g., adversarial learning, gradient reversal), optimizing the embedding network is relatively more stable. The proposed scheme have shown state-of-the-art performance in RSR2015 Part 3 dataset, and demonstrated its capability in efficiently disentangling the recording device and emotional information from the speaker embedding.์ต๊ทผ ๋ช๋
๊ฐ ๋ค์ํ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ด ์ ์๋์ด ์์ผ๋ฉฐ, ํ์ ์ธ์์์ ๋์ ์ฑ๋ฅ์ ๋ณด์๋ค. ํ์ง๋ง ๊ณ ์ ์ ์ธ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์์์ ๋ง์ฐฌ๊ฐ์ง๋ก, ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ ์๋ก ๋ค๋ฅธ ํ๊ฒฝ (e.g., ๋
น์ ๊ธฐ๊ธฐ, ๊ฐ์ )์์ ๋
น์๋ ์์ฑ๋ค์ ๋ถ์ํ๋ ๊ณผ์ ์์ ์ฑ๋ฅ ์ ํ๋ฅผ ๊ฒช๋๋ค. ๋ํ ๊ธฐ์กด์ ๊ฐ์ฐ์์ ํผํฉ ๋ชจ๋ธ (Gaussian mixture model, GMM) ๊ธฐ๋ฐ์ ๊ธฐ๋ฒ๋ค (e.g., GMM ์ํผ๋ฒกํฐ, i-๋ฒกํฐ)์ ๋ฌ๋ฆฌ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ ๊ต์ฌ ํ์ต์ ํตํ์ฌ ์ต์ ํ๋๊ธฐ์ ๋ผ๋ฒจ์ด ์๋ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ ์ ์๋ค๋ ํ๊ณ๊ฐ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ variational autoencoder (VAE) ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์ ์ํ๋ฉฐ, ํด๋น ๊ธฐ๋ฒ์์๋ ์์ฑ ๋ถํฌ ํจํด์ ์์ฝํ๋ ๋ฒกํฐ์ ์์ฑ ๋ด์ ๋ถํ์ค์ฑ์ ํํํ๋ ๋ฒกํฐ๋ฅผ ์ถ์ถํ๋ค. ๊ธฐ์กด์ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ (e.g., d-๋ฒกํฐ, x-๋ฒกํฐ)์๋ ๋ฌ๋ฆฌ, ์ ์ํ๋ ๊ธฐ๋ฒ์ ๋น๊ต์ฌ ํ์ต์ ํตํ์ฌ ์ต์ ํ ๋๊ธฐ์ ๋ผ๋ฒจ์ด ์๋ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ ์ ์๋ค. ๋ ๋์๊ฐ VAE์ KL-divergence ์ ์ฝ ํจ์๋ก ์ธํ ์ ๋ณด ์์ค์ ๋ฐฉ์งํ๊ธฐ ์ํ์ฌ adversarially learned inference (ALI) ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์ถ๊ฐ์ ์ผ๋ก ์ ์ํ๋ค. ์ ์ํ VAE ๋ฐ ALI ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์งง์ ์์ฑ์์์ ํ์ ์ธ์ฆ ์คํ์์ ๋์ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ, ๊ธฐ์กด์ i-๋ฒกํฐ ๊ธฐ๋ฐ์ ๊ธฐ๋ฒ๋ณด๋ค ์ข์ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์๋ค.
๋ํ ๋ณธ ๋
ผ๋ฌธ์์๋ ์ฑ๋ฌธ ๋ฒกํฐ๋ก๋ถํฐ ๋น ํ์ ์์ (e.g., ๋
น์ ๊ธฐ๊ธฐ, ๊ฐ์ )์ ๋ํ ์ ๋ณด๋ฅผ ์ ๊ฑฐํ๋ ํ์ต๋ฒ์ ์ ์ํ๋ค. ์ ์ํ๋ ๊ธฐ๋ฒ์ ํ์ ๋ฒกํฐ์ ๋นํ์ ๋ฒกํฐ๋ฅผ ๋์์ ์ถ์ถํ๋ฉฐ, ๊ฐ ๋ฒกํฐ๋ ์์ ์ ์ฃผ ๋ชฉ์ ์ ๋ํ ์ ๋ณด๋ฅผ ์ต๋ํ ๋ง์ด ์ ์งํ๋, ๋ถ ๋ชฉ์ ์ ๋ํ ์ ๋ณด๋ฅผ ์ต์ํํ๋๋ก ํ์ต๋๋ค. ๊ธฐ์กด์ ๋น ํ์ ์์ ์ ๋ณด ์ ๊ฑฐ ๊ธฐ๋ฒ๋ค (e.g., adversarial learning, gradient reversal)์ ๋นํ์ฌ ์ ์ํ๋ ๊ธฐ๋ฒ์ ํด๋ฆฌ์คํฑํ ํ์ต ์ ๋ต์ ์ํ์ง ์๊ธฐ์, ๋ณด๋ค ์์ ์ ์ธ ํ์ต์ด ๊ฐ๋ฅํ๋ค. ์ ์ํ๋ ๊ธฐ๋ฒ์ RSR2015 Part3 ๋ฐ์ดํฐ์
์์ ๊ธฐ์กด ๊ธฐ๋ฒ๋ค์ ๋นํ์ฌ ๋์ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ, ์ฑ๋ฌธ ๋ฒกํฐ ๋ด์ ๋
น์ ๊ธฐ๊ธฐ ๋ฐ ๊ฐ์ ์ ๋ณด๋ฅผ ์ต์ ํ๋๋ฐ ํจ๊ณผ์ ์ด์๋ค.1. Introduction 1
2. Conventional embedding techniques for speaker recognition 7
2.1. i-vector framework 7
2.2. Deep learning-based speaker embedding 10
2.2.1. Deep embedding network 10
2.2.2. Conventional disentanglement methods 13
3. Unsupervised learning of total variability embedding for speaker verification with random digit strings 17
3.1. Introduction 17
3.2. Variational autoencoder 20
3.3. Variational inference model for non-linear total variability embedding 22
3.3.1. Maximum likelihood training 23
3.3.2. Non-linear feature extraction and speaker verification 25
3.4. Experiments 26
3.4.1. Databases 26
3.4.2. Experimental setup 27
3.4.3. Effect of the duration on the latent variable 28
3.4.4. Experiments with VAEs 30
3.4.5. Feature-level fusion of i-vector and latent variable 33
3.4.6. Score-level fusion of i-vector and latent variable 36
3.5. Summary 39
4. Adversarially learned total variability embedding for speaker recognition with random digit strings 41
4.1. Introduction 41
4.2. Adversarially learned inference 43
4.3. Adversarially learned feature extraction 45
4.3.1. Maximum likelihood criterion 47
4.3.2. Adversarially learned inference for non-linear i-vector extraction 49
4.3.3. Relationship to the VAE-based feature extractor 50
4.4. Experiments 51
4.4.1. Databases 51
4.4.2. Experimental setup 53
4.4.3. Effect of the duration on the latent variable 54
4.4.4. Speaker verification and identification with different utterance-level features 56
4.5. Summary 62
5. Disentangled speaker and nuisance attribute embedding for robust speaker verification 63
5.1. Introduction 63
5.2. Joint factor embedding 67
5.2.1. Joint factor embedding network architecture 67
5.2.2. Training for joint factor embedding 69
5.3. Experiments 71
5.3.1. Channel disentanglement experiments 71
5.3.2. Emotion disentanglement 82
5.3.3. Noise disentanglement 86
5.4. Summary 87
6. Conclusion 93
Bibliography 95
Abstract (Korean) 105Docto
Speaker recognition by means of restricted Boltzmann machine adaptation
Restricted Boltzmann Machines (RBMs) have shown success in speaker recognition. In this paper, RBMs are investigated in a framework comprising a universal model training and model adaptation. Taking advantage of RBM unsupervised learning algorithm, a global model is trained based on all available background data. This general speaker-independent model, referred to as URBM, is further adapted to the data of a specific speaker to build speaker-dependent model. In order to show its effectiveness, we have applied this framework to two different tasks. It has been used to discriminatively model target and impostor spectral features for classification. It has been also utilized to produce a vector-based representation for speakers. This vector-based representation, similar to i-vector, can be further used for speaker recognition using either cosine scoring or Probabilistic Linear Discriminant Analysis (PLDA). The evaluation is performed on the core test condition of the NIST SRE 2006 database.Peer ReviewedPostprint (author's final draft
- โฆ