590 research outputs found

    ๋น„ํ™”์ž ์š”์†Œ์— ๊ฐ•์ธํ•œ ํ™”์ž ์ธ์‹์„ ์œ„ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021. 2. ๊น€๋‚จ์ˆ˜.Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). Also, unlike the classical Gaussian mixture model (GMM)-based techniques (e.g., GMM supervector or i-vector), since the deep learning-based embedding systems are trained in a fully supervised manner, it is impossible for them to utilize unlabeled dataset when training. In this thesis, we propose a variational autoencoder (VAE)-based embedding framework, which extracts the total variability embedding and a representation for the uncertainty within the input speech distribution. Unlike the conventional deep learning-based embedding techniques (e.g., d-vector or x-vector), the proposed VAE-based embedding system is trained in an unsupervised manner, which enables the utilization of unlabeled datasets. Furthermore, in order to prevent the potential loss of information caused by the Kullback-Leibler divergence regularization term in the VAE-based embedding framework, we propose an adversarially learned inference (ALI)-based embedding technique. Both VAE- and ALI-based embedding techniques have shown great performance in terms of short duration speaker verification, outperforming the conventional i-vector framework. Additionally, we present a fully supervised training method for disentangling the non-speaker nuisance information from the speaker embedding. The proposed training scheme jointly extracts the speaker and nuisance attribute (e.g., recording channel, emotion) embeddings, and train them to have maximum information on their main-task while ensuring maximum uncertainty on their sub-task. Since the proposed method does not require any heuristic training strategy as in the conventional disentanglement techniques (e.g., adversarial learning, gradient reversal), optimizing the embedding network is relatively more stable. The proposed scheme have shown state-of-the-art performance in RSR2015 Part 3 dataset, and demonstrated its capability in efficiently disentangling the recording device and emotional information from the speaker embedding.์ตœ๊ทผ ๋ช‡๋…„๊ฐ„ ๋‹ค์–‘ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์ด ์ œ์•ˆ๋˜์–ด ์™”์œผ๋ฉฐ, ํ™”์ž ์ธ์‹์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ํ•˜์ง€๋งŒ ๊ณ ์ „์ ์ธ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์—์„œ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์€ ์„œ๋กœ ๋‹ค๋ฅธ ํ™˜๊ฒฝ (e.g., ๋…น์Œ ๊ธฐ๊ธฐ, ๊ฐ์ •)์—์„œ ๋…น์Œ๋œ ์Œ์„ฑ๋“ค์„ ๋ถ„์„ํ•˜๋Š” ๊ณผ์ •์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๊ฒช๋Š”๋‹ค. ๋˜ํ•œ ๊ธฐ์กด์˜ ๊ฐ€์šฐ์‹œ์•ˆ ํ˜ผํ•ฉ ๋ชจ๋ธ (Gaussian mixture model, GMM) ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•๋“ค (e.g., GMM ์Šˆํผ๋ฒกํ„ฐ, i-๋ฒกํ„ฐ)์™€ ๋‹ฌ๋ฆฌ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์€ ๊ต์‚ฌ ํ•™์Šต์„ ํ†ตํ•˜์—ฌ ์ตœ์ ํ™”๋˜๊ธฐ์— ๋ผ๋ฒจ์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์—†๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” variational autoencoder (VAE) ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜๋ฉฐ, ํ•ด๋‹น ๊ธฐ๋ฒ•์—์„œ๋Š” ์Œ์„ฑ ๋ถ„ํฌ ํŒจํ„ด์„ ์š”์•ฝํ•˜๋Š” ๋ฒกํ„ฐ์™€ ์Œ์„ฑ ๋‚ด์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ํ‘œํ˜„ํ•˜๋Š” ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ๊ธฐ์กด์˜ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ• (e.g., d-๋ฒกํ„ฐ, x-๋ฒกํ„ฐ)์™€๋Š” ๋‹ฌ๋ฆฌ, ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ๋น„๊ต์‚ฌ ํ•™์Šต์„ ํ†ตํ•˜์—ฌ ์ตœ์ ํ™” ๋˜๊ธฐ์— ๋ผ๋ฒจ์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๋” ๋‚˜์•„๊ฐ€ VAE์˜ KL-divergence ์ œ์•ฝ ํ•จ์ˆ˜๋กœ ์ธํ•œ ์ •๋ณด ์†์‹ค์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ adversarially learned inference (ALI) ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์„ ์ถ”๊ฐ€์ ์œผ๋กœ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•œ VAE ๋ฐ ALI ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์€ ์งง์€ ์Œ์„ฑ์—์„œ์˜ ํ™”์ž ์ธ์ฆ ์‹คํ—˜์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ๊ธฐ์กด์˜ i-๋ฒกํ„ฐ ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•๋ณด๋‹ค ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์„ฑ๋ฌธ ๋ฒกํ„ฐ๋กœ๋ถ€ํ„ฐ ๋น„ ํ™”์ž ์š”์†Œ (e.g., ๋…น์Œ ๊ธฐ๊ธฐ, ๊ฐ์ •)์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ํ•™์Šต๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ํ™”์ž ๋ฒกํ„ฐ์™€ ๋น„ํ™”์ž ๋ฒกํ„ฐ๋ฅผ ๋™์‹œ์— ์ถ”์ถœํ•˜๋ฉฐ, ๊ฐ ๋ฒกํ„ฐ๋Š” ์ž์‹ ์˜ ์ฃผ ๋ชฉ์ ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ตœ๋Œ€ํ•œ ๋งŽ์ด ์œ ์ง€ํ•˜๋˜, ๋ถ€ ๋ชฉ์ ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค. ๊ธฐ์กด์˜ ๋น„ ํ™”์ž ์š”์†Œ ์ •๋ณด ์ œ๊ฑฐ ๊ธฐ๋ฒ•๋“ค (e.g., adversarial learning, gradient reversal)์— ๋น„ํ•˜์—ฌ ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ํœด๋ฆฌ์Šคํ‹ฑํ•œ ํ•™์Šต ์ „๋žต์„ ์š”ํ•˜์ง€ ์•Š๊ธฐ์—, ๋ณด๋‹ค ์•ˆ์ •์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ RSR2015 Part3 ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ธฐ์กด ๊ธฐ๋ฒ•๋“ค์— ๋น„ํ•˜์—ฌ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ์„ฑ๋ฌธ ๋ฒกํ„ฐ ๋‚ด์˜ ๋…น์Œ ๊ธฐ๊ธฐ ๋ฐ ๊ฐ์ • ์ •๋ณด๋ฅผ ์–ต์ œํ•˜๋Š”๋ฐ ํšจ๊ณผ์ ์ด์—ˆ๋‹ค.1. Introduction 1 2. Conventional embedding techniques for speaker recognition 7 2.1. i-vector framework 7 2.2. Deep learning-based speaker embedding 10 2.2.1. Deep embedding network 10 2.2.2. Conventional disentanglement methods 13 3. Unsupervised learning of total variability embedding for speaker verification with random digit strings 17 3.1. Introduction 17 3.2. Variational autoencoder 20 3.3. Variational inference model for non-linear total variability embedding 22 3.3.1. Maximum likelihood training 23 3.3.2. Non-linear feature extraction and speaker verification 25 3.4. Experiments 26 3.4.1. Databases 26 3.4.2. Experimental setup 27 3.4.3. Effect of the duration on the latent variable 28 3.4.4. Experiments with VAEs 30 3.4.5. Feature-level fusion of i-vector and latent variable 33 3.4.6. Score-level fusion of i-vector and latent variable 36 3.5. Summary 39 4. Adversarially learned total variability embedding for speaker recognition with random digit strings 41 4.1. Introduction 41 4.2. Adversarially learned inference 43 4.3. Adversarially learned feature extraction 45 4.3.1. Maximum likelihood criterion 47 4.3.2. Adversarially learned inference for non-linear i-vector extraction 49 4.3.3. Relationship to the VAE-based feature extractor 50 4.4. Experiments 51 4.4.1. Databases 51 4.4.2. Experimental setup 53 4.4.3. Effect of the duration on the latent variable 54 4.4.4. Speaker verification and identification with different utterance-level features 56 4.5. Summary 62 5. Disentangled speaker and nuisance attribute embedding for robust speaker verification 63 5.1. Introduction 63 5.2. Joint factor embedding 67 5.2.1. Joint factor embedding network architecture 67 5.2.2. Training for joint factor embedding 69 5.3. Experiments 71 5.3.1. Channel disentanglement experiments 71 5.3.2. Emotion disentanglement 82 5.3.3. Noise disentanglement 86 5.4. Summary 87 6. Conclusion 93 Bibliography 95 Abstract (Korean) 105Docto

    Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification

    Get PDF
    There are a number of studies about extraction of bottleneck (BN) features from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases and triphone states for improving the performance of text-dependent speaker verification (TD-SV). However, a moderate success has been achieved. A recent study [1] presented a time contrastive learning (TCL) concept to explore the non-stationarity of brain signals for classification of brain states. Speech signals have similar non-stationarity property, and TCL further has the advantage of having no need for labeled data. We therefore present a TCL based BN feature extraction method. The method uniformly partitions each speech utterance in a training dataset into a predefined number of multi-frame segments. Each segment in an utterance corresponds to one class, and class labels are shared across utterances. DNNs are then trained to discriminate all speech frames among the classes to exploit the temporal structure of speech. In addition, we propose a segment-based unsupervised clustering algorithm to re-assign class labels to the segments. TD-SV experiments were conducted on the RedDots challenge database. The TCL-DNNs were trained using speech data of fixed pass-phrases that were excluded from the TD-SV evaluation set, so the learned features can be considered phrase-independent. We compare the performance of the proposed TCL bottleneck (BN) feature with those of short-time cepstral features and BN features extracted from DNNs discriminating speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels and boundaries are generated by three different automatic speech recognition (ASR) systems. Experimental results show that the proposed TCL-BN outperforms cepstral features and speaker+pass-phrase discriminant BN features, and its performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other work

    Phonetic aware techniques for Speaker Verification

    Get PDF
    The goal of this thesis is to improve current state-of-the-art techniques in speaker verification (SV), typically based on รขidentity-vectorsรข (i-vectors) and deep neural network (DNN), by exploiting diverse (phonetic) information extracted using various techniques such as automatic speech recognition (ASR). Different speakers span different subspaces within a universal acoustic space, usually modelled by รขuniversal background modelรข. The speaker-specific subspace depends on the speakerรขs voice characteristics, but also on the verbalised text of a speaker. In current state-of-the-art SV systems, i-vectors are extracted by applying a factor analysis technique to obtain low dimensional speaker-specific representation. Furthermore, DNN output is also employed in a conventional i-vector framework to model phonetic information embedded in the speech signal. This thesis proposes various techniques to exploit phonetic knowledge of speech to further enrich speaker characteristics. More specifically, the techniques proposed in this thesis are applied to various SV tasks, namely, text-independent and text-dependent SV. For text-independent SV task, several ASR systems are developed and applied to compute phonetic posterior probabilities, subsequently exploited to enhance the speaker-specific information included in i-vectors. These approaches are then extended for text-dependent SV task, exploiting temporal information in a principled way, i.e., by using dynamic time warping applied on speaker informative vectors. Finally, as opposed to train DNN with phonetic information, DNN is trained in an end-to-end fashion to directly discriminate speakers. The baseline end-to-end SV approach consists of mapping a variable length speech segment to a fixed dimensional speaker vector by estimating the mean of hidden representations in DNN structure. We improve upon this technique by computing a distance function between two utterances which takes into account common phonetic units. The whole network is optimized by employing a triplet-loss objective function. The proposed approaches are evaluated on commonly used datasets such as NIST SRE 2010 and RSR2015. Significant improvements are observed over the baseline systems on both the text-dependent and text-independent SV tasks by applying phonetic knowledge

    Identity verification using voice and its use in a privacy preserving system

    Get PDF
    Since security has been a growing concern in recent years, the field of biometrics has gained popularity and became an active research area. Beside new identity authentication and recognition methods, protection against theft of biometric data and potential privacy loss are current directions in biometric systems research. Biometric traits which are used for verification can be grouped into two: physical and behavioral traits. Physical traits such as fingerprints and iris patterns are characteristics that do not undergo major changes over time. On the other hand, behavioral traits such as voice, signature, and gait are more variable; they are therefore more suitable to lower security applications. Behavioral traits such as voice and signature also have the advantage of being able to generate numerous different biometric templates of the same modality (e.g. different pass-phrases or signatures), in order to provide cancelability of the biometric template and to prevent crossmatching of different databases. In this thesis, we present three new biometric verification systems based mainly on voice modality. First, we propose a text-dependent (TD) system where acoustic features are extracted from individual frames of the utterances, after they are aligned via phonetic HMMs. Data from 163 speakers from the TIDIGITS database are employed for this work and the best equal error rate (EER) is reported as 0.49% for 6-digit user passwords. Second, a text-independent (TI) speaker verification method is implemented inspired by the feature extraction method utilized for our text-dependent system. Our proposed TI system depends on creating speaker specific phoneme codebooks. Once phoneme codebooks are created on the enrollment stage using HMM alignment and segmentation to extract discriminative user information, test utterances are verified by calculating the total dissimilarity/distance to the claimed codebook. For benchmarking, a GMM-based TI system is implemented as a baseline. The results of the proposed TD system (0.22% EER for 7-digit passwords) is superior compared to the GMM-based system (0.31% EER for 7-digit sequences) whereas the proposed TI system yields worse results (5.79% EER for 7-digit sequences) using the data of 163 people from the TIDIGITS database . Finally, we introduce a new implementation of the multi-biometric template framework of Yanikoglu and Kholmatov [12], using fingerprint and voice modalities. In this framework, two biometric data are fused at the template level to create a multi-biometric template, in order to increase template security and privacy. The current work aims to also provide cancelability by exploiting the behavioral aspect of the voice modality
    • โ€ฆ
    corecore