    Comparison GMM and SVM Classifier for Automatic Speaker Verification

    The objective of this thesis is to develop automatic text-independent speaker verification systems using unconstrained telephone conversational speech. We began by performing a Gaussian Mixture Model Likelihood ratio verification task in speaker independent system as described by MIT Lincoln Lab. We next introduced a speaker dependent verification system based on speaker dependent thresholds. We then implemented the same system applying Support Vector Machine. In SVM, we used polynomial kernels and radial basis function kernels and compared the performance. For training and testing the system, we used low-level spectral features. Finally, we provided a performance assessment of these systems using the National Institute of Standards and technology (NIST) speaker recognition evaluation 2008 telephone corpora

    Speech Recognition in noisy environment using Deep Learning Neural Network

    Recent researches in the field of automatic speaker recognition have shown that methods based on deep learning neural networks provide better performance than other statistical classifiers. On the other hand, these methods usually require adjustment of a significant number of parameters. The goal of this thesis is to show that selecting appropriate value of parameters can significantly improve speaker recognition performance of methods based on deep learning neural networks. The reported study introduces an approach to automatic speaker recognition based on deep neural networks and the stochastic gradient descent algorithm. It particularly focuses on three parameters of the stochastic gradient descent algorithm: the learning rate, and the hidden and input layer dropout rates. Additional attention was devoted to the research question of speaker recognition under noisy conditions. Thus, two experiments were conducted in the scope of this thesis. The first experiment was intended to demonstrate that the optimization of the observed parameters of the stochastic gradient descent algorithm can improve speaker recognition performance under no presence of noise. This experiment was conducted in two phases. In the first phase, the recognition rate is observed when the hidden layer dropout rate and the learning rate are varied, while the input layer dropout rate was constant. In the second phase of this experiment, the recognition rate is observed when the input layers dropout rate and learning rate are varied, while the hidden layer dropout rate was constant. The second experiment was intended to show that the optimization of the observed parameters of the stochastic gradient descent algorithm can improve speaker recognition performance even under noisy conditions. Thus, different noise levels were artificially applied on the original speech signal

    Approximate Bayesian inference for robust speech processing

    Speech processing applications such as speech enhancement and speaker identification rely on the estimation of relevant parameters from the speech signal. Theseparameters must often be estimated from noisy observations since speech signals are rarely obtained in ‘clean’ acoustic environments in the real world. As a result, the parameter estimation algorithms we employ must be robust to environmental factors such as additive noise and reverberation. In this work we derive and evaluate approximate Bayesian algorithms for the following speech processing tasks: 1) speech enhancement 2) speaker identification 3) speaker verification and 4) voice activity detection.Building on previous work in the field of statistical model based speech enhancement, we derive speech enhancement algorithms that rely on speaker dependent priors over linear prediction parameters. These speaker dependent priors allow us to handle speech enhancement and speaker identification in a joint framework. Furthermore, we show how these priors allow voice activity detection to be performed in a robust manner.We also develop algorithms in the log spectral domain with applications in robust speaker verification. The use of speaker dependent priors in the log spectral domain is shown to improve equal error rates in noisy environments and to compensate for mismatch between training and testing conditions.Ph.D., Electrical Engineering -- Drexel University, 201

    A fala sussurrada : alguns aspetos na fonética forense

    Um dos objetivos da Fonética Forense é a identificação do falante. A investigação tentou encontrar métodos e parâmetros necessários para reconhecer o autor de uma gravação produzida por meio do sussurro. Outro objetivo foi o de obter dados para confirmar ou não a eficácia dum método de investigação de caráter percetivo. Neste trabalho, são analisadas duas gravações por cada um dos seis autores (de língua materna portuguesa): uma produzida com fala normal e uma com fala sussurrada. Obtidas as gravações, o trabalho foi dividido em duas tarefas independentes mas complementares. Na primeira experiência (O sussurro no espectrograma), procedeu-se a uma análise das gravações por meio de espectrograma. A análise do sussurro não é considerada fácil porque este elimina alguns padrões fundamentais da fala. Por esta razão, a fala sussurrada funciona muitas vezes como a técnica mais eficaz para esconder a própria identidade. O objetivo foi, portanto, encontrar padrões que permitam reconhecer o autor das gravações, não obstante o disfarce. Na segunda experiência (A perceção do sussurro), de caráter percetivo, as gravações produzidas com o sussurro são ouvidas por pessoas que conhecem o autor, mas que ignoram que seja ele o autor da mesma. O objetivo foi obter dados para confirmar ou não a eficácia deste método de investigação (a perceção humana). Os dados das duas experiências foram depois confrontados em função dos resultados. Descobriu-se o seguinte. No que diz respeito à primeira tarefa (O sussurro no espectrograma), os resultados indicam que há padrões resistentes ao sussurro. Estes são: os valores do primeiro formante F1 e do segundo formante F2 das vogais, o valor da soma dos tempos de explosão e V.O.T. das consoantes oclusivas, o formante da nasalidade e o formante das consoantes fricativas. Os dados indicam que é preciso manter uma visão global, e não se focar somente num destes parâmetros. No que diz respeito à segunda tarefa (A perceção do sussurro), descobriu-se que, embora um método puramente percetivo possa fornecer indicações preciosas num primeiro momento, este não seja fidedigno e, portanto, seja preferível empregar métodos científicos.One of the goals of the Forensic Phonetics is to identify the speaker. The investigation tried to find out necessary parameters to recognize the author of a recording made with whisper. Other goal was to obtain data to confirm or not effectiveness of an investigation method based on perception. In this work, two recordings, for everyone of the six speakers (European Portuguese native) are analyzed: one made with normal speech e one made with whispered speech. Gained the recordings, the work has been divided into two sections independents but complementary. In the first task (Whisper in the spectrogram), it has proceeded to an analysis of the recording through the spectrogram. Whisper’s analysis isn’t considered simple because it deletes some fundamental parameters of speech. For this reason, whispered speech works out many times as the more efficient technic to hide one owns identity. The goal was, so, to find out parameters which allow recognize the author of the recordings, despite of disguise. In the second task (Perceiving whisper), a perceptive one, recordings made with whisper are heard by people who know the author, but ignore he is the author of it. The goal was to get data in order to confirm or not effectiveness of this investigation method (human perception). The two tasks’ data were after compared in function of the results. It was founded the following. Regarding the first task (Whisper in the spectrogram), results show there are parameters which resist to whisper. They are: values of the first formant F1 and the second formant F2 of the vowels, the value of the sum of the times of burst and V.O.T. of the plosive consonants, the formant of the nasality and the formant of the fricatives. Data show that it is necessary to keep a global vision, and not to focus only on one of these elements. Regarding the second task (Perceiving whisper), it was found out that although a merely perceptive method could give precious indications in a first moment, it is not trustworthy and, therefore, preferable to employ scientific methods

    Caractérisation des cris des nourrissons en vue du diagnostic précoce de différentes pathologies

    L’utilisation des signaux de cris dans le diagnostic se base sur les théories qui ont été proposées par les différents chercheurs dans le domaine. Le principal objectif de leurs travaux était l’analyse spectrographique ainsi que la modélisation des signaux de cris. Ils ont démontré que les caractéristiques acoustiques des cris des nouveau-nés sont liées à des conditions médicales particulières. Cette thèse est destinée à contribuer à l’amélioration de la précision de la reconnaissance des cris pathologiques par la combinaison de plusieurs paramètres acoustiques issus de l'analyse spectrographique et des paramètres qui qualifient les cordes et le conduit vocal. Car les caractéristiques acoustiques représentant le conduit vocal ont été largement utilisées pour la classification des cris, alors que les caractéristiques des cordes vocales pour la reconnaissance automatique des cris, ainsi que leurs techniques efficaces d’extraction n’ont pas été exploitées. Pour répondre à cet objectif, nous avons procédé en premier lieu à une caractérisation qualitative des cris des nouveau-nés sains et malades en utilisant les caractéristiques qui ont été définies dans la littérature et qui qualifient le comportement des cordes et du conduit vocal pendant le cri. Cette étape nous a permis d’identifier les caractéristiques les plus importantes dans la différenciation des cris pathologiques étudiés. Pour l’extraction des caractéristiques sélectionnées, nous avons implémenté des méthodes de mesures efficaces permettant de dépasser la surestimation et la sous-estimation des caractéristiques. L’approche de quantification proposée et utilisée dans ce travail facilite l’analyse automatique des cris et permet une utilisation efficace de ces caractéristiques dans le système de diagnostic. Nous avons procédé aussi à des tests expérimentaux pour la validation de toutes les approches introduites dans cette thèse. Les résultats sont satisfaisants et montrent une amélioration dans la reconnaissance des cris par pathologie. Les travaux réalisés sont présentés dans cette thèse sous forme de trois articles publiés dans différents journaux. Deux autres articles publiés dans des comptes rendus de conférences avec comité de lecture sont présentés en annexes