317 research outputs found

    Large Margin GMM for discriminative speaker verifi cation

    Get PDF
    International audienceGaussian mixture models (GMM), trained using the generative cri- terion of maximum likelihood estimation, have been the most popular ap- proach in speaker recognition during the last decades. This approach is also widely used in many other classi cation tasks and applications. Generative learning in not however the optimal way to address classi cation problems. In this paper we rst present a new algorithm for discriminative learning of diagonal GMM under a large margin criterion. This algorithm has the ma- jor advantage of being highly e cient, which allow fast discriminative GMM training using large scale databases. We then evaluate its performances on a full NIST speaker veri cation task using NIST-SRE'2006 data. In particular, we use the popular Symmetrical Factor Analysis (SFA) for session variability compensation. The results show that our system outperforms the state-of-the- art approaches of GMM-SFA and the SVM-based one, GSL-NAP. Relative reductions of the Equal Error Rate of about 9.33% and 14.88% are respec- tively achieved over these systems

    Discriminative speaker recognition using Large Margin GMM

    Get PDF
    International audienceMost state-of-the-art speaker recognition systems are based on discriminative learning approaches. On the other hand, generative Gaussian mixture models (GMM) have been widely used in speaker recognition during the last decades. In an earlier work, we proposed an algorithm for discriminative training of GMM with diagonal covariances under a large margin criterion. In this paper, we propose an improvement of this algorithm which has the major advantage of being computationally highly efficient, thus well suited to handle large scale databases. We also develop a new strategy to detect and handle the outliers that occur in the training data. To evaluate the performances of our new algorithm, we carry out full NIST speaker identification and verification tasks using NIST-SRE'2006 data, in a Symmetrical Factor Analysis compensation scheme. The results show that our system significantly outperforms the traditional discriminative Support Vector Machines (SVM) based system of SVM-GMM supervectors, in the two speaker recognition tasks

    Online adaptive learning of continuous-density hidden Markov models based on multiple-stream prior evolution and posterior pooling

    Get PDF
    We introduce a new adaptive Bayesian learning framework, called multiple-stream prior evolution and posterior pooling, for online adaptation of the continuous density hidden Markov model (CDHMM) parameters. Among three architectures we proposed for this framework, we study in detail a specific two stream system where linear transformations are applied to the mean vectors of the CDHMMs to control the evolution of their prior distribution. This new stream of prior distribution can be combined with another stream of prior distribution evolved without any constraints applied. In a series of speaker adaptation experiments on the task of continuous Mandarin speech recognition, we show that the new adaptation algorithm achieves a similar fast-adaptation performance as that of the incremental maximum likelihood linear regression (MLLR) in the case of small amount of adaptation data, while maintains the good asymptotic convergence property as that of our previously proposed quasi-Bayes adaptation algorithms.published_or_final_versio

    비화자 요소에 강인한 화자 인식을 위한 딥러닝 기반 성문 추출

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2021. 2. 김남수.Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). Also, unlike the classical Gaussian mixture model (GMM)-based techniques (e.g., GMM supervector or i-vector), since the deep learning-based embedding systems are trained in a fully supervised manner, it is impossible for them to utilize unlabeled dataset when training. In this thesis, we propose a variational autoencoder (VAE)-based embedding framework, which extracts the total variability embedding and a representation for the uncertainty within the input speech distribution. Unlike the conventional deep learning-based embedding techniques (e.g., d-vector or x-vector), the proposed VAE-based embedding system is trained in an unsupervised manner, which enables the utilization of unlabeled datasets. Furthermore, in order to prevent the potential loss of information caused by the Kullback-Leibler divergence regularization term in the VAE-based embedding framework, we propose an adversarially learned inference (ALI)-based embedding technique. Both VAE- and ALI-based embedding techniques have shown great performance in terms of short duration speaker verification, outperforming the conventional i-vector framework. Additionally, we present a fully supervised training method for disentangling the non-speaker nuisance information from the speaker embedding. The proposed training scheme jointly extracts the speaker and nuisance attribute (e.g., recording channel, emotion) embeddings, and train them to have maximum information on their main-task while ensuring maximum uncertainty on their sub-task. Since the proposed method does not require any heuristic training strategy as in the conventional disentanglement techniques (e.g., adversarial learning, gradient reversal), optimizing the embedding network is relatively more stable. The proposed scheme have shown state-of-the-art performance in RSR2015 Part 3 dataset, and demonstrated its capability in efficiently disentangling the recording device and emotional information from the speaker embedding.최근 몇년간 다양한 딥러닝 기반 성문 추출 기법들이 제안되어 왔으며, 화자 인식에서 높은 성능을 보였다. 하지만 고전적인 성문 추출 기법에서와 마찬가지로, 딥러닝 기반 성문 추출 기법들은 서로 다른 환경 (e.g., 녹음 기기, 감정)에서 녹음된 음성들을 분석하는 과정에서 성능 저하를 겪는다. 또한 기존의 가우시안 혼합 모델 (Gaussian mixture model, GMM) 기반의 기법들 (e.g., GMM 슈퍼벡터, i-벡터)와 달리 딥러닝 기반 성문 추출 기법들은 교사 학습을 통하여 최적화되기에 라벨이 없는 데이터를 활용할 수 없다는 한계가 있다. 본 논문에서는 variational autoencoder (VAE) 기반의 성문 추출 기법을 제안하며, 해당 기법에서는 음성 분포 패턴을 요약하는 벡터와 음성 내의 불확실성을 표현하는 벡터를 추출한다. 기존의 딥러닝 기반 성문 추출 기법 (e.g., d-벡터, x-벡터)와는 달리, 제안하는 기법은 비교사 학습을 통하여 최적화 되기에 라벨이 없는 데이터를 활용할 수 있다. 더 나아가 VAE의 KL-divergence 제약 함수로 인한 정보 손실을 방지하기 위하여 adversarially learned inference (ALI) 기반의 성문 추출 기법을 추가적으로 제안한다. 제안한 VAE 및 ALI 기반의 성문 추출 기법은 짧은 음성에서의 화자 인증 실험에서 높은 성능을 보였으며, 기존의 i-벡터 기반의 기법보다 좋은 결과를 보였다. 또한 본 논문에서는 성문 벡터로부터 비 화자 요소 (e.g., 녹음 기기, 감정)에 대한 정보를 제거하는 학습법을 제안한다. 제안하는 기법은 화자 벡터와 비화자 벡터를 동시에 추출하며, 각 벡터는 자신의 주 목적에 대한 정보를 최대한 많이 유지하되, 부 목적에 대한 정보를 최소화하도록 학습된다. 기존의 비 화자 요소 정보 제거 기법들 (e.g., adversarial learning, gradient reversal)에 비하여 제안하는 기법은 휴리스틱한 학습 전략을 요하지 않기에, 보다 안정적인 학습이 가능하다. 제안하는 기법은 RSR2015 Part3 데이터셋에서 기존 기법들에 비하여 높은 성능을 보였으며, 성문 벡터 내의 녹음 기기 및 감정 정보를 억제하는데 효과적이었다.1. Introduction 1 2. Conventional embedding techniques for speaker recognition 7 2.1. i-vector framework 7 2.2. Deep learning-based speaker embedding 10 2.2.1. Deep embedding network 10 2.2.2. Conventional disentanglement methods 13 3. Unsupervised learning of total variability embedding for speaker verification with random digit strings 17 3.1. Introduction 17 3.2. Variational autoencoder 20 3.3. Variational inference model for non-linear total variability embedding 22 3.3.1. Maximum likelihood training 23 3.3.2. Non-linear feature extraction and speaker verification 25 3.4. Experiments 26 3.4.1. Databases 26 3.4.2. Experimental setup 27 3.4.3. Effect of the duration on the latent variable 28 3.4.4. Experiments with VAEs 30 3.4.5. Feature-level fusion of i-vector and latent variable 33 3.4.6. Score-level fusion of i-vector and latent variable 36 3.5. Summary 39 4. Adversarially learned total variability embedding for speaker recognition with random digit strings 41 4.1. Introduction 41 4.2. Adversarially learned inference 43 4.3. Adversarially learned feature extraction 45 4.3.1. Maximum likelihood criterion 47 4.3.2. Adversarially learned inference for non-linear i-vector extraction 49 4.3.3. Relationship to the VAE-based feature extractor 50 4.4. Experiments 51 4.4.1. Databases 51 4.4.2. Experimental setup 53 4.4.3. Effect of the duration on the latent variable 54 4.4.4. Speaker verification and identification with different utterance-level features 56 4.5. Summary 62 5. Disentangled speaker and nuisance attribute embedding for robust speaker verification 63 5.1. Introduction 63 5.2. Joint factor embedding 67 5.2.1. Joint factor embedding network architecture 67 5.2.2. Training for joint factor embedding 69 5.3. Experiments 71 5.3.1. Channel disentanglement experiments 71 5.3.2. Emotion disentanglement 82 5.3.3. Noise disentanglement 86 5.4. Summary 87 6. Conclusion 93 Bibliography 95 Abstract (Korean) 105Docto

    Acoustic Approaches to Gender and Accent Identification

    Get PDF
    There has been considerable research on the problems of speaker and language recognition from samples of speech. A less researched problem is that of accent recognition. Although this is a similar problem to language identification, di�erent accents of a language exhibit more fine-grained di�erences between classes than languages. This presents a tougher problem for traditional classification techniques. In this thesis, we propose and evaluate a number of techniques for gender and accent classification. These techniques are novel modifications and extensions to state of the art algorithms, and they result in enhanced performance on gender and accent recognition. The first part of the thesis focuses on the problem of gender identification, and presents a technique that gives improved performance in situations where training and test conditions are mismatched. The bulk of this thesis is concerned with the application of the i-Vector technique to accent identification, which is the most successful approach to acoustic classification to have emerged in recent years. We show that it is possible to achieve high accuracy accent identification without reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis describes various stages in the development of i-Vector based accent classification that improve the standard approaches usually applied for speaker or language identification, which are insu�cient. We demonstrate that very good accent identification performance is possible with acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can obtain from the same data. We claim to have achieved the best accent identification performance on the test corpus for acoustic methods, with up to 90% identification rate. This performance is even better than previously reported acoustic-phonotactic based systems on the same corpus, and is very close to performance obtained via transcription based accent identification. Finally, we demonstrate that the utilization of our techniques for speech recognition purposes leads to considerably lower word error rates. Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British English, Prosody, Speech Recognition

    Speaker segmentation and clustering

    Get PDF
    This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved
    corecore