4 research outputs found

    Discriminative features for GMM and i-vector based speaker diarization

    Get PDF
    Speaker diarization has received several research attentions over the last decade. Among the different domains of speaker diarization, diarization in meeting domain is the most challenging one. It usually contains spontaneous speech and is, for example, susceptible to reverberation. The appropriate selection of speech features is one of the factors that affect the performance of speaker diarization systems. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used short-term speech features in speaker diarization. Other factors that affect the performance of speaker diarization systems are the techniques employed to perform both speaker segmentation and speaker clustering. In this thesis, we have proposed the use of jitter and shimmer long-term voice-quality features both for Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. The voice-quality features are used together with the state-of-the-art short-term cepstral and long-term speech ones. The long-term features consist of prosody and Glottal-to-Noise excitation ratio (GNE) descriptors. Firstly, the voice-quality, prosodic and GNE features are stacked in the same feature vector. Then, they are fused with cepstral coefficients at the score likelihood level both for the proposed Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. For the proposed GMM based speaker diarization system, independent HMM models are estimated from the short-term and long-term speech feature sets. The fusion of the short-term descriptors with the long-term ones in speaker segmentation is carried out by linearly weighting the log-likelihood scores of Viterbi decoding. In the case of speaker clustering, the fusion of the short-term cepstral features with the long-term ones is carried out by linearly fusing the Bayesian Information Criterion (BIC) scores corresponding to these feature sets. For the proposed i-vector based speaker diarization system, the speaker segmentation is carried out exactly the same as in the previously mentioned GMM based speaker diarization system. However, the speaker clustering technique is based on the recently introduced factor analysis paradigm. Two set of i-vectors are extracted from the speaker segmentation hypothesis. Whilst the first i-vector is extracted from short-term cepstral features, the second one is extracted from the voice quality, prosody and GNE descriptors. Then, the cosine-distance and Probabilistic Linear Discriminant Analysis (PLDA) scores of i-vectors are linearly weighted to obtain a fused similarity score. Finally, the fused score is used as speaker clustering distance. We have also proposed the use of delta dynamic features for speaker clustering. The motivation for using deltas in clustering is that delta dynamic features capture the transitional characteristics of the speech signal which contain speaker specific information. This information is not captured by the static cepstral coefficients. The delta features are used together with the short-term static cepstral coefficients and long-term speech features (i.e., voice-quality, prosody and GNE) both for GMM and i-vector based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction (AMI) meeting corpus. The experimental results show that the use of voice-quality, prosody, GNE and delta dynamic features improve the performance of both GMM and i-vector based speaker diarization systems.La diarizaci贸n del altavoz ha recibido varias atenciones de investigaci贸n durante la 煤ltima d茅cada. Entre los diferentes dominios de la diarizaci贸n del hablante, la diarizaci贸n en el dominio del encuentro es la m谩s dif铆cil. Normalmente contiene habla espont谩nea y, por ejemplo, es susceptible de reverberaci贸n. La selecci贸n apropiada de las caracter铆sticas del habla es uno de los factores que afectan el rendimiento de los sistemas de diarizaci贸n de los altavoces. Los Coeficientes Cepstral de Frecuencia Mel (MFCC) son las caracter铆sticas de habla de corto plazo m谩s utilizadas en la diarizaci贸n de los altavoces. Otros factores que afectan el rendimiento de los sistemas de diarizaci贸n del altavoz son las t茅cnicas empleadas para realizar tanto la segmentaci贸n del altavoz como el agrupamiento de altavoces. En esta tesis, hemos propuesto el uso de jitter y shimmer caracter铆sticas de calidad de voz a largo plazo tanto para GMM y i-vector basada en sistemas de diarizaci贸n de altavoces. Las caracter铆sticas de calidad de voz se utilizan junto con el estado de la t茅cnica a corto plazo cepstral y de larga duraci贸n de habla. Las caracter铆sticas a largo plazo consisten en la prosodia y los descriptores de relaci贸n de excitaci贸n Glottal-a-Ruido (GNE). En primer lugar, las caracter铆sticas de calidad de voz, pros贸dica y GNE se apilan en el mismo vector de caracter铆sticas. A continuaci贸n, se fusionan con coeficientes cepstrales en el nivel de verosimilitud de puntajes tanto para los sistemas de diarizaci贸n de altavoces basados 驴驴en el modelo Gaussian Mixture Modeling (GMM) como en los sistemas basados 驴驴en i-vector. . Para el sistema de diarizaci贸n de altavoces basado en GMM propuesto, se calculan modelos HMM independientes a partir de cada conjunto de caracter铆sticas. En la segmentaci贸n de los altavoces, la fusi贸n de los descriptores a corto plazo con los de largo plazo se lleva a cabo mediante la ponderaci贸n lineal de las puntuaciones log-probabilidad de decodificaci贸n Viterbi. En la agrupaci贸n de altavoces, la fusi贸n de las caracter铆sticas cepstrales a corto plazo con las de largo plazo se lleva a cabo mediante la fusi贸n lineal de las puntuaciones Bayesian Information Criterion (BIC) correspondientes a estos conjuntos de caracter铆sticas. Para el sistema de diarizaci贸n de altavoces basado en un vector i, la fusi贸n de caracter铆sticas se realiza exactamente igual a la del sistema basado en GMM antes mencionado. Sin embargo, la t茅cnica de agrupaci贸n de altavoces se basa en el paradigma de an谩lisis de factores recientemente introducido. Dos conjuntos de i-vectores se extraen de la hip贸tesis de segmentaci贸n de altavoz. Mientras que el primer vector i se extrae de caracter铆sticas espectrales a corto plazo, el segundo se extrae de los descriptores de calidad de voz apilados, pros贸dicos y GNE. A continuaci贸n, las puntuaciones de coseno-distancia y Probabilistic Linear Discriminant Analysis (PLDA) entre i-vectores se ponderan linealmente para obtener una puntuaci贸n de similitud fundida. Finalmente, la puntuaci贸n fusionada se utiliza como distancia de agrupaci贸n de altavoces. Tambi茅n hemos propuesto el uso de caracter铆sticas din谩micas delta para la agrupaci贸n de locutores. La motivaci贸n para el uso de deltas en la agrupaci贸n es que las caracter铆sticas din谩micas delta capturan las caracter铆sticas de transici贸n de la se帽al de voz que contienen informaci贸n espec铆fica del locutor. Esta informaci贸n no es capturada por los coeficientes cepstrales est谩ticos. Las caracter铆sticas delta se usan junto con los coeficientes cepstrales est谩ticos a corto plazo y las caracter铆sticas de voz a largo plazo (es decir, calidad de voz, prosodia y GNE) tanto para sistemas de diarizaci贸n de altavoces basados en GMM como en sistemas i-vector. Los resultados experimentales sobre AMI muestran que el uso de calidad vocal, pros贸dica, GNE y din谩micas delta mejoran el rendimiento de los sistemas de diarizaci贸n de altavoces basados en GMM e i-vector.Postprint (published version

    Identity verification using voice and its use in a privacy preserving system

    Get PDF
    Since security has been a growing concern in recent years, the field of biometrics has gained popularity and became an active research area. Beside new identity authentication and recognition methods, protection against theft of biometric data and potential privacy loss are current directions in biometric systems research. Biometric traits which are used for verification can be grouped into two: physical and behavioral traits. Physical traits such as fingerprints and iris patterns are characteristics that do not undergo major changes over time. On the other hand, behavioral traits such as voice, signature, and gait are more variable; they are therefore more suitable to lower security applications. Behavioral traits such as voice and signature also have the advantage of being able to generate numerous different biometric templates of the same modality (e.g. different pass-phrases or signatures), in order to provide cancelability of the biometric template and to prevent crossmatching of different databases. In this thesis, we present three new biometric verification systems based mainly on voice modality. First, we propose a text-dependent (TD) system where acoustic features are extracted from individual frames of the utterances, after they are aligned via phonetic HMMs. Data from 163 speakers from the TIDIGITS database are employed for this work and the best equal error rate (EER) is reported as 0.49% for 6-digit user passwords. Second, a text-independent (TI) speaker verification method is implemented inspired by the feature extraction method utilized for our text-dependent system. Our proposed TI system depends on creating speaker specific phoneme codebooks. Once phoneme codebooks are created on the enrollment stage using HMM alignment and segmentation to extract discriminative user information, test utterances are verified by calculating the total dissimilarity/distance to the claimed codebook. For benchmarking, a GMM-based TI system is implemented as a baseline. The results of the proposed TD system (0.22% EER for 7-digit passwords) is superior compared to the GMM-based system (0.31% EER for 7-digit sequences) whereas the proposed TI system yields worse results (5.79% EER for 7-digit sequences) using the data of 163 people from the TIDIGITS database . Finally, we introduce a new implementation of the multi-biometric template framework of Yanikoglu and Kholmatov [12], using fingerprint and voice modalities. In this framework, two biometric data are fused at the template level to create a multi-biometric template, in order to increase template security and privacy. The current work aims to also provide cancelability by exploiting the behavioral aspect of the voice modality

    Automatic speaker recognition: modelling, feature extraction and effects of clinical environment

    Get PDF
    Speaker recognition is the task of establishing identity of an individual based on his/her voice. It has a significant potential as a convenient biometric method for telephony applications and does not require sophisticated or dedicated hardware. The Speaker Recognition task is typically achieved by two-stage signal processing: training and testing. The training process calculates speaker-specific feature parameters from the speech. The features are used to generate statistical models of different speakers. In the testing phase, speech samples from unknown speakers are compared with the models and classified. Current state of the art speaker recognition systems use the Gaussian mixture model (GMM) technique in combination with the Expectation Maximization (EM) algorithm to build the speaker models. The most frequently used features are the Mel Frequency Cepstral Coefficients (MFCC). This thesis investigated areas of possible improvements in the field of speaker recognition. The identified drawbacks of the current speaker recognition systems included: slow convergence rates of the modelling techniques and feature鈥檚 sensitivity to changes due aging of speakers, use of alcohol and drugs, changing health conditions and mental state. The thesis proposed a new method of deriving the Gaussian mixture model (GMM) parameters called the EM-ITVQ algorithm. The EM-ITVQ showed a significant improvement of the equal error rates and higher convergence rates when compared to the classical GMM based on the expectation maximization (EM) method. It was demonstrated that features based on the nonlinear model of speech production (TEO based features) provided better performance compare to the conventional MFCCs features. For the first time the effect of clinical depression on the speaker verification rates was tested. It was demonstrated that the speaker verification results deteriorate if the speakers are clinically depressed. The deterioration process was demonstrated using conventional (MFCC) features. The thesis also showed that when replacing the MFCC features with features based on the nonlinear model of speech production (TEO based features), the detrimental effect of the clinical depression on speaker verification rates can be reduced

    Speaker verification using a novel set of dynamic features

    Full text link
    Dynamic cepstral features such as delta and delta-delta cepstra have been shown to play an essential role in capturing the transitional characteristics of the speech signal
    corecore