1,189 research outputs found

    Human and Machine Speaker Recognition Based on Short Trivial Events

    Full text link
    Trivial events are ubiquitous in human to human conversations, e.g., cough, laugh and sniff. Compared to regular speech, these trivial events are usually short and unclear, thus generally regarded as not speaker discriminative and so are largely ignored by present speaker recognition research. However, these trivial events are highly valuable in some particular circumstances such as forensic examination, as they are less subjected to intentional change, so can be used to discover the genuine speaker from disguised speech. In this paper, we collect a trivial event speech database that involves 75 speakers and 6 types of events, and report preliminary speaker recognition results on this database, by both human listeners and machines. Particularly, the deep feature learning technique recently proposed by our group is utilized to analyze and recognize the trivial events, which leads to acceptable equal error rates (EERs) despite the extremely short durations (0.2-0.5 seconds) of these events. Comparing different types of events, 'hmm' seems more speaker discriminative.Comment: ICASSP 201

    Secure Automatic Speaker Verification Systems

    Get PDF
    The growing number of voice-enabled devices and applications consider automatic speaker verification (ASV) a fundamental component. However, maximum outreach for ASV in critical domains e.g., financial services and health care, is not possible unless we overcome security breaches caused by voice cloning, and replayed audios collectively known as the spoofing attacks. The audio spoofing attacks over ASV systems on one hand strictly limit the usability of voice-enabled applications; and on the other hand, the counterfeiter also remains untraceable. Therefore, to overcome these vulnerabilities, a secure ASV (SASV) system is presented in this dissertation. The proposed SASV system is based on the concept of novel sign modified acoustic local ternary pattern (sm-ALTP) features and asymmetric bagging-based classifier-ensemble. The proposed audio representation approach clusters the high and low-frequency components in audio frames by normally distributing frequency components against a convex function. Then, the neighborhood statistics are applied to capture the user specific vocal tract information. This information is then utilized by the classifier ensemble that is based on the concept of weighted normalized voting rule to detect various spoofing attacks. Contrary to the existing ASV systems, the proposed SASV system not only detects the conventional spoofing attacks (i.e. voice cloning, and replays), but also the new attacks that are still unexplored by the research community and a requirement of the future. In this regard, a concept of cloned replays is presented in this dissertation, where, replayed audios contains the microphone characteristics as well as the voice cloning artifacts. This depicts the scenario when voice cloning is applied in real-time. The voice cloning artifacts suppresses the microphone characteristics thus fails replay detection modules and similarly with the amalgamation of microphone characteristics the voice cloning detection gets deceived. Furthermore, the proposed scheme can be utilized to obtain a possible clue against the counterfeiter through voice cloning algorithm detection module that is also a novel concept proposed in this dissertation. The voice cloning algorithm detection module determines the voice cloning algorithm used to generate the fake audios. Overall, the proposed SASV system simultaneously verifies the bonafide speakers and detects the voice cloning attack, cloning algorithm used to synthesize cloned audio (in the defined settings), and voice-replay attacks over the ASVspoof 2019 dataset. In addition, the proposed method detects the voice replay and cloned voice replay attacks over the VSDC dataset. Rigorous experimentation against state-of-the-art approaches also confirms the robustness of the proposed research

    비화자 요소에 강인한 화자 인식을 위한 딥러닝 기반 성문 추출

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2021. 2. 김남수.Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). Also, unlike the classical Gaussian mixture model (GMM)-based techniques (e.g., GMM supervector or i-vector), since the deep learning-based embedding systems are trained in a fully supervised manner, it is impossible for them to utilize unlabeled dataset when training. In this thesis, we propose a variational autoencoder (VAE)-based embedding framework, which extracts the total variability embedding and a representation for the uncertainty within the input speech distribution. Unlike the conventional deep learning-based embedding techniques (e.g., d-vector or x-vector), the proposed VAE-based embedding system is trained in an unsupervised manner, which enables the utilization of unlabeled datasets. Furthermore, in order to prevent the potential loss of information caused by the Kullback-Leibler divergence regularization term in the VAE-based embedding framework, we propose an adversarially learned inference (ALI)-based embedding technique. Both VAE- and ALI-based embedding techniques have shown great performance in terms of short duration speaker verification, outperforming the conventional i-vector framework. Additionally, we present a fully supervised training method for disentangling the non-speaker nuisance information from the speaker embedding. The proposed training scheme jointly extracts the speaker and nuisance attribute (e.g., recording channel, emotion) embeddings, and train them to have maximum information on their main-task while ensuring maximum uncertainty on their sub-task. Since the proposed method does not require any heuristic training strategy as in the conventional disentanglement techniques (e.g., adversarial learning, gradient reversal), optimizing the embedding network is relatively more stable. The proposed scheme have shown state-of-the-art performance in RSR2015 Part 3 dataset, and demonstrated its capability in efficiently disentangling the recording device and emotional information from the speaker embedding.최근 몇년간 다양한 딥러닝 기반 성문 추출 기법들이 제안되어 왔으며, 화자 인식에서 높은 성능을 보였다. 하지만 고전적인 성문 추출 기법에서와 마찬가지로, 딥러닝 기반 성문 추출 기법들은 서로 다른 환경 (e.g., 녹음 기기, 감정)에서 녹음된 음성들을 분석하는 과정에서 성능 저하를 겪는다. 또한 기존의 가우시안 혼합 모델 (Gaussian mixture model, GMM) 기반의 기법들 (e.g., GMM 슈퍼벡터, i-벡터)와 달리 딥러닝 기반 성문 추출 기법들은 교사 학습을 통하여 최적화되기에 라벨이 없는 데이터를 활용할 수 없다는 한계가 있다. 본 논문에서는 variational autoencoder (VAE) 기반의 성문 추출 기법을 제안하며, 해당 기법에서는 음성 분포 패턴을 요약하는 벡터와 음성 내의 불확실성을 표현하는 벡터를 추출한다. 기존의 딥러닝 기반 성문 추출 기법 (e.g., d-벡터, x-벡터)와는 달리, 제안하는 기법은 비교사 학습을 통하여 최적화 되기에 라벨이 없는 데이터를 활용할 수 있다. 더 나아가 VAE의 KL-divergence 제약 함수로 인한 정보 손실을 방지하기 위하여 adversarially learned inference (ALI) 기반의 성문 추출 기법을 추가적으로 제안한다. 제안한 VAE 및 ALI 기반의 성문 추출 기법은 짧은 음성에서의 화자 인증 실험에서 높은 성능을 보였으며, 기존의 i-벡터 기반의 기법보다 좋은 결과를 보였다. 또한 본 논문에서는 성문 벡터로부터 비 화자 요소 (e.g., 녹음 기기, 감정)에 대한 정보를 제거하는 학습법을 제안한다. 제안하는 기법은 화자 벡터와 비화자 벡터를 동시에 추출하며, 각 벡터는 자신의 주 목적에 대한 정보를 최대한 많이 유지하되, 부 목적에 대한 정보를 최소화하도록 학습된다. 기존의 비 화자 요소 정보 제거 기법들 (e.g., adversarial learning, gradient reversal)에 비하여 제안하는 기법은 휴리스틱한 학습 전략을 요하지 않기에, 보다 안정적인 학습이 가능하다. 제안하는 기법은 RSR2015 Part3 데이터셋에서 기존 기법들에 비하여 높은 성능을 보였으며, 성문 벡터 내의 녹음 기기 및 감정 정보를 억제하는데 효과적이었다.1. Introduction 1 2. Conventional embedding techniques for speaker recognition 7 2.1. i-vector framework 7 2.2. Deep learning-based speaker embedding 10 2.2.1. Deep embedding network 10 2.2.2. Conventional disentanglement methods 13 3. Unsupervised learning of total variability embedding for speaker verification with random digit strings 17 3.1. Introduction 17 3.2. Variational autoencoder 20 3.3. Variational inference model for non-linear total variability embedding 22 3.3.1. Maximum likelihood training 23 3.3.2. Non-linear feature extraction and speaker verification 25 3.4. Experiments 26 3.4.1. Databases 26 3.4.2. Experimental setup 27 3.4.3. Effect of the duration on the latent variable 28 3.4.4. Experiments with VAEs 30 3.4.5. Feature-level fusion of i-vector and latent variable 33 3.4.6. Score-level fusion of i-vector and latent variable 36 3.5. Summary 39 4. Adversarially learned total variability embedding for speaker recognition with random digit strings 41 4.1. Introduction 41 4.2. Adversarially learned inference 43 4.3. Adversarially learned feature extraction 45 4.3.1. Maximum likelihood criterion 47 4.3.2. Adversarially learned inference for non-linear i-vector extraction 49 4.3.3. Relationship to the VAE-based feature extractor 50 4.4. Experiments 51 4.4.1. Databases 51 4.4.2. Experimental setup 53 4.4.3. Effect of the duration on the latent variable 54 4.4.4. Speaker verification and identification with different utterance-level features 56 4.5. Summary 62 5. Disentangled speaker and nuisance attribute embedding for robust speaker verification 63 5.1. Introduction 63 5.2. Joint factor embedding 67 5.2.1. Joint factor embedding network architecture 67 5.2.2. Training for joint factor embedding 69 5.3. Experiments 71 5.3.1. Channel disentanglement experiments 71 5.3.2. Emotion disentanglement 82 5.3.3. Noise disentanglement 86 5.4. Summary 87 6. Conclusion 93 Bibliography 95 Abstract (Korean) 105Docto

    Sketching for Large-Scale Learning of Mixture Models

    Get PDF
    Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. We propose a "compressive learning" framework where we estimate model parameters from a sketch of the training data. This sketch is a collection of generalized moments of the underlying probability distribution of the data. It can be computed in a single pass on the training set, and is easily computable on streams or distributed datasets. The proposed framework shares similarities with compressive sensing, which aims at drastically reducing the dimension of high-dimensional signals while preserving the ability to reconstruct them. To perform the estimation task, we derive an iterative algorithm analogous to sparse reconstruction algorithms in the context of linear inverse problems. We exemplify our framework with the compressive estimation of a Gaussian Mixture Model (GMM), providing heuristics on the choice of the sketching procedure and theoretical guarantees of reconstruction. We experimentally show on synthetic data that the proposed algorithm yields results comparable to the classical Expectation-Maximization (EM) technique while requiring significantly less memory and fewer computations when the number of database elements is large. We further demonstrate the potential of the approach on real large-scale data (over 10 8 training samples) for the task of model-based speaker verification. Finally, we draw some connections between the proposed framework and approximate Hilbert space embedding of probability distributions using random features. We show that the proposed sketching operator can be seen as an innovative method to design translation-invariant kernels adapted to the analysis of GMMs. We also use this theoretical framework to derive information preservation guarantees, in the spirit of infinite-dimensional compressive sensing

    One-Class Classification: Taxonomy of Study and Review of Techniques

    Full text link
    One-class classification (OCC) algorithms aim to build classification models when the negative class is either absent, poorly sampled or not well defined. This unique situation constrains the learning of efficient classifiers by defining class boundary just with the knowledge of positive class. The OCC problem has been considered and applied under many research themes, such as outlier/novelty detection and concept learning. In this paper we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application domains applied. We further delve into each of the categories of the proposed taxonomy and present a comprehensive literature review of the OCC algorithms, techniques and methodologies with a focus on their significance, limitations and applications. We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure
    corecore