590 research outputs found
๋นํ์ ์์์ ๊ฐ์ธํ ํ์ ์ธ์์ ์ํ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2021. 2. ๊น๋จ์.Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). Also, unlike the classical Gaussian mixture model (GMM)-based techniques (e.g., GMM supervector or i-vector), since the deep learning-based embedding systems are trained in a fully supervised manner, it is impossible for them to utilize unlabeled dataset when training.
In this thesis, we propose a variational autoencoder (VAE)-based embedding framework, which extracts the total variability embedding and a representation for the uncertainty within the input speech distribution. Unlike the conventional deep learning-based embedding techniques (e.g., d-vector or x-vector), the proposed VAE-based embedding system is trained in an unsupervised manner, which enables the utilization of unlabeled datasets. Furthermore, in order to prevent the potential loss of information caused by the Kullback-Leibler divergence regularization term in the VAE-based embedding framework, we propose an adversarially learned inference (ALI)-based embedding technique. Both VAE- and ALI-based embedding techniques have shown great performance in terms of short duration speaker verification, outperforming the conventional i-vector framework.
Additionally, we present a fully supervised training method for disentangling the non-speaker nuisance information from the speaker embedding. The proposed training scheme jointly extracts the speaker and nuisance attribute (e.g., recording channel, emotion) embeddings, and train them to have maximum information on their main-task while ensuring maximum uncertainty on their sub-task. Since the proposed method does not require any heuristic training strategy as in the conventional disentanglement techniques (e.g., adversarial learning, gradient reversal), optimizing the embedding network is relatively more stable. The proposed scheme have shown state-of-the-art performance in RSR2015 Part 3 dataset, and demonstrated its capability in efficiently disentangling the recording device and emotional information from the speaker embedding.์ต๊ทผ ๋ช๋
๊ฐ ๋ค์ํ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ด ์ ์๋์ด ์์ผ๋ฉฐ, ํ์ ์ธ์์์ ๋์ ์ฑ๋ฅ์ ๋ณด์๋ค. ํ์ง๋ง ๊ณ ์ ์ ์ธ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์์์ ๋ง์ฐฌ๊ฐ์ง๋ก, ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ ์๋ก ๋ค๋ฅธ ํ๊ฒฝ (e.g., ๋
น์ ๊ธฐ๊ธฐ, ๊ฐ์ )์์ ๋
น์๋ ์์ฑ๋ค์ ๋ถ์ํ๋ ๊ณผ์ ์์ ์ฑ๋ฅ ์ ํ๋ฅผ ๊ฒช๋๋ค. ๋ํ ๊ธฐ์กด์ ๊ฐ์ฐ์์ ํผํฉ ๋ชจ๋ธ (Gaussian mixture model, GMM) ๊ธฐ๋ฐ์ ๊ธฐ๋ฒ๋ค (e.g., GMM ์ํผ๋ฒกํฐ, i-๋ฒกํฐ)์ ๋ฌ๋ฆฌ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ ๊ต์ฌ ํ์ต์ ํตํ์ฌ ์ต์ ํ๋๊ธฐ์ ๋ผ๋ฒจ์ด ์๋ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ ์ ์๋ค๋ ํ๊ณ๊ฐ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ variational autoencoder (VAE) ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์ ์ํ๋ฉฐ, ํด๋น ๊ธฐ๋ฒ์์๋ ์์ฑ ๋ถํฌ ํจํด์ ์์ฝํ๋ ๋ฒกํฐ์ ์์ฑ ๋ด์ ๋ถํ์ค์ฑ์ ํํํ๋ ๋ฒกํฐ๋ฅผ ์ถ์ถํ๋ค. ๊ธฐ์กด์ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ (e.g., d-๋ฒกํฐ, x-๋ฒกํฐ)์๋ ๋ฌ๋ฆฌ, ์ ์ํ๋ ๊ธฐ๋ฒ์ ๋น๊ต์ฌ ํ์ต์ ํตํ์ฌ ์ต์ ํ ๋๊ธฐ์ ๋ผ๋ฒจ์ด ์๋ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ ์ ์๋ค. ๋ ๋์๊ฐ VAE์ KL-divergence ์ ์ฝ ํจ์๋ก ์ธํ ์ ๋ณด ์์ค์ ๋ฐฉ์งํ๊ธฐ ์ํ์ฌ adversarially learned inference (ALI) ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์ถ๊ฐ์ ์ผ๋ก ์ ์ํ๋ค. ์ ์ํ VAE ๋ฐ ALI ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์งง์ ์์ฑ์์์ ํ์ ์ธ์ฆ ์คํ์์ ๋์ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ, ๊ธฐ์กด์ i-๋ฒกํฐ ๊ธฐ๋ฐ์ ๊ธฐ๋ฒ๋ณด๋ค ์ข์ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์๋ค.
๋ํ ๋ณธ ๋
ผ๋ฌธ์์๋ ์ฑ๋ฌธ ๋ฒกํฐ๋ก๋ถํฐ ๋น ํ์ ์์ (e.g., ๋
น์ ๊ธฐ๊ธฐ, ๊ฐ์ )์ ๋ํ ์ ๋ณด๋ฅผ ์ ๊ฑฐํ๋ ํ์ต๋ฒ์ ์ ์ํ๋ค. ์ ์ํ๋ ๊ธฐ๋ฒ์ ํ์ ๋ฒกํฐ์ ๋นํ์ ๋ฒกํฐ๋ฅผ ๋์์ ์ถ์ถํ๋ฉฐ, ๊ฐ ๋ฒกํฐ๋ ์์ ์ ์ฃผ ๋ชฉ์ ์ ๋ํ ์ ๋ณด๋ฅผ ์ต๋ํ ๋ง์ด ์ ์งํ๋, ๋ถ ๋ชฉ์ ์ ๋ํ ์ ๋ณด๋ฅผ ์ต์ํํ๋๋ก ํ์ต๋๋ค. ๊ธฐ์กด์ ๋น ํ์ ์์ ์ ๋ณด ์ ๊ฑฐ ๊ธฐ๋ฒ๋ค (e.g., adversarial learning, gradient reversal)์ ๋นํ์ฌ ์ ์ํ๋ ๊ธฐ๋ฒ์ ํด๋ฆฌ์คํฑํ ํ์ต ์ ๋ต์ ์ํ์ง ์๊ธฐ์, ๋ณด๋ค ์์ ์ ์ธ ํ์ต์ด ๊ฐ๋ฅํ๋ค. ์ ์ํ๋ ๊ธฐ๋ฒ์ RSR2015 Part3 ๋ฐ์ดํฐ์
์์ ๊ธฐ์กด ๊ธฐ๋ฒ๋ค์ ๋นํ์ฌ ๋์ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ, ์ฑ๋ฌธ ๋ฒกํฐ ๋ด์ ๋
น์ ๊ธฐ๊ธฐ ๋ฐ ๊ฐ์ ์ ๋ณด๋ฅผ ์ต์ ํ๋๋ฐ ํจ๊ณผ์ ์ด์๋ค.1. Introduction 1
2. Conventional embedding techniques for speaker recognition 7
2.1. i-vector framework 7
2.2. Deep learning-based speaker embedding 10
2.2.1. Deep embedding network 10
2.2.2. Conventional disentanglement methods 13
3. Unsupervised learning of total variability embedding for speaker verification with random digit strings 17
3.1. Introduction 17
3.2. Variational autoencoder 20
3.3. Variational inference model for non-linear total variability embedding 22
3.3.1. Maximum likelihood training 23
3.3.2. Non-linear feature extraction and speaker verification 25
3.4. Experiments 26
3.4.1. Databases 26
3.4.2. Experimental setup 27
3.4.3. Effect of the duration on the latent variable 28
3.4.4. Experiments with VAEs 30
3.4.5. Feature-level fusion of i-vector and latent variable 33
3.4.6. Score-level fusion of i-vector and latent variable 36
3.5. Summary 39
4. Adversarially learned total variability embedding for speaker recognition with random digit strings 41
4.1. Introduction 41
4.2. Adversarially learned inference 43
4.3. Adversarially learned feature extraction 45
4.3.1. Maximum likelihood criterion 47
4.3.2. Adversarially learned inference for non-linear i-vector extraction 49
4.3.3. Relationship to the VAE-based feature extractor 50
4.4. Experiments 51
4.4.1. Databases 51
4.4.2. Experimental setup 53
4.4.3. Effect of the duration on the latent variable 54
4.4.4. Speaker verification and identification with different utterance-level features 56
4.5. Summary 62
5. Disentangled speaker and nuisance attribute embedding for robust speaker verification 63
5.1. Introduction 63
5.2. Joint factor embedding 67
5.2.1. Joint factor embedding network architecture 67
5.2.2. Training for joint factor embedding 69
5.3. Experiments 71
5.3.1. Channel disentanglement experiments 71
5.3.2. Emotion disentanglement 82
5.3.3. Noise disentanglement 86
5.4. Summary 87
6. Conclusion 93
Bibliography 95
Abstract (Korean) 105Docto
Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification
There are a number of studies about extraction of bottleneck (BN) features
from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases
and triphone states for improving the performance of text-dependent speaker
verification (TD-SV). However, a moderate success has been achieved. A recent
study [1] presented a time contrastive learning (TCL) concept to explore the
non-stationarity of brain signals for classification of brain states. Speech
signals have similar non-stationarity property, and TCL further has the
advantage of having no need for labeled data. We therefore present a TCL based
BN feature extraction method. The method uniformly partitions each speech
utterance in a training dataset into a predefined number of multi-frame
segments. Each segment in an utterance corresponds to one class, and class
labels are shared across utterances. DNNs are then trained to discriminate all
speech frames among the classes to exploit the temporal structure of speech. In
addition, we propose a segment-based unsupervised clustering algorithm to
re-assign class labels to the segments. TD-SV experiments were conducted on the
RedDots challenge database. The TCL-DNNs were trained using speech data of
fixed pass-phrases that were excluded from the TD-SV evaluation set, so the
learned features can be considered phrase-independent. We compare the
performance of the proposed TCL bottleneck (BN) feature with those of
short-time cepstral features and BN features extracted from DNNs discriminating
speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels
and boundaries are generated by three different automatic speech recognition
(ASR) systems. Experimental results show that the proposed TCL-BN outperforms
cepstral features and speaker+pass-phrase discriminant BN features, and its
performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Phonetic aware techniques for Speaker Verification
The goal of this thesis is to improve current state-of-the-art techniques in speaker verification
(SV), typically based on รขidentity-vectorsรข (i-vectors) and deep neural network (DNN), by exploiting diverse (phonetic) information extracted using various techniques such as automatic
speech recognition (ASR). Different speakers span different subspaces within a universal acoustic space, usually modelled by รขuniversal background modelรข. The speaker-specific subspace
depends on the speakerรขs voice characteristics, but also on the verbalised text of a speaker. In current state-of-the-art SV systems, i-vectors are extracted by applying a factor analysis
technique to obtain low dimensional speaker-specific representation. Furthermore, DNN output is also employed in a conventional i-vector framework to model phonetic information
embedded in the speech signal. This thesis proposes various techniques to exploit phonetic knowledge of speech to further enrich speaker characteristics.
More specifically, the techniques proposed in this thesis are applied to various SV tasks,
namely, text-independent and text-dependent SV. For text-independent SV task, several ASR
systems are developed and applied to compute phonetic posterior probabilities, subsequently
exploited to enhance the speaker-specific information included in i-vectors. These approaches
are then extended for text-dependent SV task, exploiting temporal information in a principled
way, i.e., by using dynamic time warping applied on speaker informative vectors.
Finally, as opposed to train DNN with phonetic information, DNN is trained in an end-to-end
fashion to directly discriminate speakers. The baseline end-to-end SV approach consists of
mapping a variable length speech segment to a fixed dimensional speaker vector by estimating
the mean of hidden representations in DNN structure. We improve upon this technique by
computing a distance function between two utterances which takes into account common
phonetic units. The whole network is optimized by employing a triplet-loss objective function.
The proposed approaches are evaluated on commonly used datasets such as NIST SRE 2010
and RSR2015. Significant improvements are observed over the baseline systems on both the
text-dependent and text-independent SV tasks by applying phonetic knowledge
Identity verification using voice and its use in a privacy preserving system
Since security has been a growing concern in recent years, the field of biometrics has gained popularity and became an active research area. Beside new identity authentication and recognition methods, protection against theft of biometric data and potential privacy loss are current directions in biometric systems research. Biometric traits which are used for verification can be grouped into two: physical and behavioral traits. Physical traits such as fingerprints and iris patterns are characteristics that do not undergo major changes over time. On the other hand, behavioral traits such as voice, signature, and gait are more variable; they are therefore more suitable to lower security applications. Behavioral traits such as voice and signature also have the advantage of being able to generate numerous different biometric templates of the same modality (e.g. different pass-phrases or signatures), in order to provide cancelability of the biometric template and to prevent crossmatching of different databases. In this thesis, we present three new biometric verification systems based mainly on voice modality. First, we propose a text-dependent (TD) system where acoustic features are extracted from individual frames of the utterances, after they are aligned via phonetic HMMs. Data from 163 speakers from the TIDIGITS database are employed for this work and the best equal error rate (EER) is reported as 0.49% for 6-digit user passwords. Second, a text-independent (TI) speaker verification method is implemented inspired by the feature extraction method utilized for our text-dependent system. Our proposed TI system depends on creating speaker specific phoneme codebooks. Once phoneme codebooks are created on the enrollment stage using HMM alignment and segmentation to extract discriminative user information, test utterances are verified by calculating the total dissimilarity/distance to the claimed codebook. For benchmarking, a GMM-based TI system is implemented as a baseline. The results of the proposed TD system (0.22% EER for 7-digit passwords) is superior compared to the GMM-based system (0.31% EER for 7-digit sequences) whereas the proposed TI system yields worse results (5.79% EER for 7-digit sequences) using the data of 163 people from the TIDIGITS database . Finally, we introduce a new implementation of the multi-biometric template framework of Yanikoglu and Kholmatov [12], using fingerprint and voice modalities. In this framework, two biometric data are fused at the template level to create a multi-biometric template, in order to increase template security and privacy. The current work aims to also provide cancelability by exploiting the behavioral aspect of the voice modality
- โฆ