13,970 research outputs found

    Text-independent bilingual speaker verification system.

    Get PDF
    Ma Bin.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 96-102).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Biometrics --- p.2Chapter 1.2 --- Speaker Verification --- p.3Chapter 1.3 --- Overview of Speaker Verification Systems --- p.4Chapter 1.4 --- Text Dependency --- p.4Chapter 1.4.1 --- Text-Dependent Speaker Verification --- p.5Chapter 1.4.2 --- GMM-based Speaker Verification --- p.6Chapter 1.5 --- Language Dependency --- p.6Chapter 1.6 --- Normalization Techniques --- p.7Chapter 1.7 --- Objectives of the Thesis --- p.8Chapter 1.8 --- Thesis Organization --- p.8Chapter 2 --- Background --- p.10Chapter 2.1 --- Background Information --- p.11Chapter 2.1.1 --- Speech Signal Acquisition --- p.11Chapter 2.1.2 --- Speech Processing --- p.11Chapter 2.1.3 --- Engineering Model of Speech Signal --- p.13Chapter 2.1.4 --- Speaker Information in the Speech Signal --- p.14Chapter 2.1.5 --- Feature Parameters --- p.15Chapter 2.1.5.1 --- Mel-Frequency Cepstral Coefficients --- p.16Chapter 2.1.5.2 --- Linear Predictive Coding Derived Cep- stral Coefficients --- p.18Chapter 2.1.5.3 --- Energy Measures --- p.20Chapter 2.1.5.4 --- Derivatives of Cepstral Coefficients --- p.21Chapter 2.1.6 --- Evaluating Speaker Verification Systems --- p.22Chapter 2.2 --- Common Techniques --- p.24Chapter 2.2.1 --- Template Model Matching Methods --- p.25Chapter 2.2.2 --- Statistical Model Methods --- p.26Chapter 2.2.2.1 --- HMM Modeling Technique --- p.27Chapter 2.2.2.2 --- GMM Modeling Techniques --- p.30Chapter 2.2.2.3 --- Gaussian Mixture Model --- p.31Chapter 2.2.2.4 --- The Advantages of GMM --- p.32Chapter 2.2.3 --- Likelihood Scoring --- p.32Chapter 2.2.4 --- General Approach to Decision Making --- p.35Chapter 2.2.5 --- Cohort Normalization --- p.35Chapter 2.2.5.1 --- Probability Score Normalization --- p.36Chapter 2.2.5.2 --- Cohort Selection --- p.37Chapter 2.3 --- Chapter Summary --- p.38Chapter 3 --- Experimental Corpora --- p.39Chapter 3.1 --- The YOHO Corpus --- p.39Chapter 3.1.1 --- Design of the YOHO Corpus --- p.39Chapter 3.1.2 --- Data Collection Process of the YOHO Corpus --- p.40Chapter 3.1.3 --- Experimentation with the YOHO Corpus --- p.41Chapter 3.2 --- CUHK Bilingual Speaker Verification Corpus --- p.42Chapter 3.2.1 --- Design of the CUBS Corpus --- p.42Chapter 3.2.2 --- Data Collection Process for the CUBS Corpus --- p.44Chapter 3.3 --- Chapter Summary --- p.46Chapter 4 --- Text-Dependent Speaker Verification --- p.47Chapter 4.1 --- Front-End Processing on the YOHO Corpus --- p.48Chapter 4.2 --- Cohort Normalization Setup --- p.50Chapter 4.3 --- HMM-based Speaker Verification Experiments --- p.53Chapter 4.3.1 --- Subword HMM Models --- p.53Chapter 4.3.2 --- Experimental Results --- p.55Chapter 4.3.2.1 --- Comparison of Feature Representations --- p.55Chapter 4.3.2.2 --- Effect of Cohort Normalization --- p.58Chapter 4.4 --- Experiments on GMM-based Speaker Verification --- p.61Chapter 4.4.1 --- Experimental Setup --- p.61Chapter 4.4.2 --- The number of Gaussian Mixture Components --- p.62Chapter 4.4.3 --- The Effect of Cohort Normalization --- p.64Chapter 4.4.4 --- Comparison of HMM and GMM --- p.65Chapter 4.5 --- Comparison with Previous Systems --- p.67Chapter 4.6 --- Chapter Summary --- p.70Chapter 5 --- Language- and Text-Independent Speaker Verification --- p.71Chapter 5.1 --- Front-End Processing of the CUBS --- p.72Chapter 5.2 --- Language- and Text-Independent Speaker Modeling --- p.73Chapter 5.3 --- Cohort Normalization --- p.74Chapter 5.4 --- Experimental Results and Analysis --- p.75Chapter 5.4.1 --- Number of Gaussian Mixture Components --- p.78Chapter 5.4.2 --- The Cohort Normalization Effect --- p.79Chapter 5.4.3 --- Language Dependency --- p.80Chapter 5.4.4 --- Language-Independency --- p.83Chapter 5.5 --- Chapter Summary --- p.88Chapter 6 --- Conclusions and Future Work --- p.90Chapter 6.1 --- Summary --- p.90Chapter 6.1.1 --- Feature Comparison --- p.91Chapter 6.1.2 --- HMM Modeling --- p.91Chapter 6.1.3 --- GMM Modeling --- p.91Chapter 6.1.4 --- Cohort Normalization --- p.92Chapter 6.1.5 --- Language Dependency --- p.92Chapter 6.2 --- Future Work --- p.93Chapter 6.2.1 --- Feature Parameters --- p.93Chapter 6.2.2 --- Model Quality --- p.93Chapter 6.2.2.1 --- Variance Flooring --- p.93Chapter 6.2.2.2 --- Silence Detection --- p.94Chapter 6.2.3 --- Conversational Speaker Verification --- p.95Bibliography --- p.10

    A Generative Model for Score Normalization in Speaker Recognition

    Full text link
    We propose a theoretical framework for thinking about score normalization, which confirms that normalization is not needed under (admittedly fragile) ideal conditions. If, however, these conditions are not met, e.g. under data-set shift between training and runtime, our theory reveals dependencies between scores that could be exploited by strategies such as score normalization. Indeed, it has been demonstrated over and over experimentally, that various ad-hoc score normalization recipes do work. We present a first attempt at using probability theory to design a generative score-space normalization model which gives similar improvements to ZT-norm on the text-dependent RSR 2015 database

    Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification

    Full text link
    In this paper a novel cross-device text-independent speaker verification architecture is proposed. Majority of the state-of-the-art deep architectures that are used for speaker verification tasks consider Mel-frequency cepstral coefficients. In contrast, our proposed Siamese convolutional neural network architecture uses Mel-frequency spectrogram coefficients to benefit from the dependency of the adjacent spectro-temporal features. Moreover, although spectro-temporal features have proved to be highly reliable in speaker verification models, they only represent some aspects of short-term acoustic level traits of the speaker's voice. However, the human voice consists of several linguistic levels such as acoustic, lexicon, prosody, and phonetics, that can be utilized in speaker verification models. To compensate for these inherited shortcomings in spectro-temporal features, we propose to enhance the proposed Siamese convolutional neural network architecture by deploying a multilayer perceptron network to incorporate the prosodic, jitter, and shimmer features. The proposed end-to-end verification architecture performs feature extraction and verification simultaneously. This proposed architecture displays significant improvement over classical signal processing approaches and deep algorithms for forensic cross-device speaker verification.Comment: Accepted in 9th IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS 2018

    Deep Speaker Feature Learning for Text-independent Speaker Verification

    Full text link
    Recently deep neural networks (DNNs) have been used to learn speaker features. However, the quality of the learned features is not sufficiently good, so a complex back-end model, either neural or probabilistic, has to be used to address the residual uncertainty when applied to speaker verification, just as with raw features. This paper presents a convolutional time-delay deep neural network structure (CT-DNN) for speaker feature learning. Our experimental results on the Fisher database demonstrated that this CT-DNN can produce high-quality speaker features: even with a single feature (0.3 seconds including the context), the EER can be as low as 7.68%. This effectively confirmed that the speaker trait is largely a deterministic short-time property rather than a long-time distributional pattern, and therefore can be extracted from just dozens of frames.Comment: deep neural networks, speaker verification, speaker featur

    Attentive Statistics Pooling for Deep Speaker Embedding

    Full text link
    This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an utterance-level feature. Our method utilizes an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations. In this way, it can capture long-term variations in speaker characteristics more effectively. An evaluation on the NIST SRE 2012 and the VoxCeleb data sets shows that it reduces equal error rates (EERs) from the conventional method by 7.5% and 8.1%, respectively.Comment: Proc. Interspeech 2018, pp2252--2256. arXiv admin note: text overlap with arXiv:1809.0931
    • …
    corecore