265 research outputs found

    Deep learning backend for single and multisession i-vector speaker recognition

    Get PDF
    The lack of labeled background data makes a big performance gap between cosine and Probabilistic Linear Discriminant Analysis (PLDA) scoring baseline techniques for i-vectors in speaker recognition. Although there are some unsupervised clustering techniques to estimate the labels, they cannot accurately predict the true labels and they also assume that there are several samples from the same speaker in the background data that could not be true in reality. In this paper, the authors make use of Deep Learning (DL) to fill this performance gap given unlabeled background data. To this goal, the authors have proposed an impostor selection algorithm and a universal model adaptation process in a hybrid system based on deep belief networks and deep neural networks to discriminatively model each target speaker. In order to have more insight into the behavior of DL techniques in both single- and multisession speaker enrollment tasks, some experiments have been carried out in this paper in both scenarios. Experiments on National Institute of Standards and Technology 2014 i-vector challenge show that 46% of this performance gap, in terms of minimum of the decision cost function, is filled by the proposed DL-based system. Furthermore, the score combination of the proposed DL-based system and PLDA with estimated labels covers 79% of this gap.Peer ReviewedPostprint (published version

    Hidden Markov models and neural networks for speech recognition

    Get PDF
    The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than first order dependencies in the observed data sequences. This is due to the first order state process and the assumption of state conditional independence between observations. Artificial Neural Networks (NNs) are almost the opposite: they cannot model dynamic, temporally extended phenomena very well, but are good at static classification and regression tasks. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classification abilities. The overall aim of this work has been to develop a probabilistic hybrid of hidden Markov models and neural networks and ..

    Deep learning for i-vector speaker and language recognition

    Get PDF
    Over the last few years, i-vectors have been the state-of-the-art technique in speaker and language recognition. Recent advances in Deep Learning (DL) technology have improved the quality of i-vectors but the DL techniques in use are computationally expensive and need speaker or/and phonetic labels for the background data, which are not easily accessible in practice. On the other hand, the lack of speaker-labeled background data makes a big performance gap, in speaker recognition, between two well-known cosine and Probabilistic Linear Discriminant Analysis (PLDA) i-vector scoring techniques. It has recently been a challenge how to fill this gap without speaker labels, which are expensive in practice. Although some unsupervised clustering techniques are proposed to estimate the speaker labels, they cannot accurately estimate the labels. This thesis tries to solve the problems above by using the DL technology in different ways, without any need of speaker or phonetic labels. In order to fill the performance gap between cosine and PLDA scoring given unlabeled background data, we have proposed an impostor selection algorithm and a universal model adaptation process in a hybrid system based on Deep Belief Networks (DBNs) and Deep Neural Networks (DNNs) to discriminatively model each target speaker. In order to have more insight into the behavior of DL techniques in both single and multi-session speaker enrollment tasks, some experiments have been carried out in both scenarios. Experiments on the National Institute of Standard and Technology (NIST) 2014 i-vector challenge show that 46% of this performance gap, in terms of minDCF, is filled by the proposed DL-based system. Furthermore, the score combination of the proposed DL-based system and PLDA with estimated labels covers 79% of this gap. In the second line of the research, we have developed an efficient alternative vector representation of speech by keeping the computational cost as low as possible and avoiding phonetic labels, which are not always accessible. The proposed vectors will be based on both Gaussian Mixture Models (GMMs) and Restricted Boltzmann Machines (RBMs) and will be referred to as GMM-RBM vectors. The role of RBM is to learn the total speaker and session variability among background GMM supervectors. This RBM, which will be referred to as Universal RBM (URBM), will then be used to transform unseen supervectors to the proposed low dimensional vectors. The use of different activation functions for training the URBM and different transformation functions for extracting the proposed vectors are investigated. At the end, a variant of Rectified Linear Unit (ReLU) which is referred to as Variable ReLU (VReLU) is proposed. Experiments on the core test condition 5 of the NIST Speaker Recognition Evaluation (SRE) 2010 show that comparable results with conventional i-vectors are achieved with a clearly lower computational load in the vector extraction process. Finally, for the Language Identification (LID) application, we have proposed a DNN architecture to model effectively the i-vector space of four languages, English, Spanish, German, and Finnish, in the car environment. Both raw i-vectors and session variability compensated i-vectors are evaluated as input vectors to DNN. The performance of the proposed DNN architecture is compared with both conventional GMM-UBM and i-vector/Linear Discriminant Analysis (LDA) systems considering the effect of duration of signals. It is shown that the signals with duration between 2 and 3 sec meet the accuracy and speed requirements of this application, in which the proposed DNN architecture outperforms GMM-UBM and i-vector/LDA systems by 37% and 28%, respectively.En los últimos años, los i-vectores han sido la técnica de referencia en el reconocimiento de hablantes y de idioma. Los últimos avances en la tecnología de Aprendizaje Profundo (Deep Learning. DL) han mejorado la calidad de los i-vectores, pero las técnicas DL en uso son computacionalmente costosas y necesitan datos etiquetados para cada hablante y/o unidad fon ética, los cuales no son fácilmente accesibles en la práctica. La falta de datos etiquetados provoca una gran diferencia de los resultados en el reconocimiento de hablante con i-vectors entre las dos técnicas de evaluación más utilizados: distancia coseno y Análisis Lineal Discriminante Probabilístico (PLDA). Por el momento, sigue siendo un reto cómo reducir esta brecha sin disponer de las etiquetas de los hablantes, que son costosas de obtener. Aunque se han propuesto algunas técnicas de agrupamiento sin supervisión para estimar las etiquetas de los hablantes, no pueden estimar las etiquetas con precisión. Esta tesis trata de resolver los problemas mencionados usando la tecnología DL de diferentes maneras, sin necesidad de etiquetas de hablante o fon éticas. Con el fin de reducir la diferencia de resultados entre distancia coseno y PLDA a partir de datos no etiquetados, hemos propuesto un algoritmo selección de impostores y la adaptación a un modelo universal en un sistema hibrido basado en Deep Belief Networks (DBN) y Deep Neural Networks (DNN) para modelar a cada hablante objetivo de forma discriminativa. Con el fin de tener más información sobre el comportamiento de las técnicas DL en las tareas de identificación de hablante en una única sesión y en varias sesiones, se han llevado a cabo algunos experimentos en ambos escenarios. Los experimentos utilizando los datos del National Institute of Standard and Technology (NIST) 2014 i-vector Challenge muestran que el 46% de esta diferencia de resultados, en términos de minDCF, se reduce con el sistema propuesto basado en DL. Además, la combinación de evaluaciones del sistema propuesto basado en DL y PLDA con etiquetas estimadas reduce el 79% de esta diferencia. En la segunda línea de la investigación, hemos desarrollado una representación vectorial alternativa eficiente de la voz manteniendo el coste computacional lo más bajo posible y evitando las etiquetas fon éticas, Los vectores propuestos se basan tanto en el Modelo de Mezcla de Gaussianas (GMM) y en las Maquinas Boltzmann Restringidas (RBM), a los que se hacer referencia como vectores GMM-RBM. El papel de la RBM es aprender la variabilidad total del hablante y de la sesión entre los supervectores del GMM gen érico. Este RBM, al que se hará referencia como RBM Universal (URBM), se utilizará para transformar supervectores ocultos en los vectores propuestos, de menor dimensión. Además, se estudia el uso de diferentes funciones de activación para el entrenamiento de la URBM y diferentes funciones de transformación para extraer los vectores propuestos. Finalmente, se propone una variante de la Unidad Lineal Rectificada (ReLU) a la que se hace referencia como Variable ReLU (VReLU). Los experimentos sobre los datos de la condición 5 del test de la NIST Speaker Recognition Evaluation (SRE) 2010 muestran que se han conseguidos resultados comparables con los i-vectores convencionales, con una carga computacional claramente inferior en el proceso de extracción de vectores. Por último, para la aplicación de Identificación de Idioma (LID), hemos propuesto una arquitectura DNN para modelar eficazmente en el entorno del coche el espacio i-vector de cuatro idiomas: inglés, español, alemán y finlandés. Tanto los i-vectores originales como los i-vectores propuestos son evaluados como vectores de entrada a DNN. El rendimiento de la arquitectura DNN propuesta se compara con los sistemas convencionales GMM-UBM y i-vector/Análisis Discriminante Lineal (LDA) considerando el efecto de la duración de las señales. Se muestra que en caso de señales con una duración entre 2 y 3 se obtienen resultados satisfactorios en cuanto a precisión y resultados, superando a los sistemas GMM-UBM y i-vector/LDA en un 37% y 28%, respectivament

    PHONOTACTIC AND ACOUSTIC LANGUAGE RECOGNITION

    Get PDF
    Práce pojednává o fonotaktickém a akustickém přístupu pro automatické rozpoznávání jazyka. První část práce pojednává o fonotaktickém přístupu založeném na výskytu fonémových sekvenci v řeči. Nejdříve je prezentován popis vývoje fonémového rozpoznávače jako techniky pro přepis řeči do sekvence smysluplných symbolů. Hlavní důraz je kladen na dobré natrénování fonémového rozpoznávače a kombinaci výsledků z několika fonémových rozpoznávačů trénovaných na různých jazycích (Paralelní fonémové rozpoznávání následované jazykovými modely (PPRLM)). Práce také pojednává o nové technice anti-modely v PPRLM a studuje použití fonémových grafů místo nejlepšího přepisu. Na závěr práce jsou porovnány dva přístupy modelování výstupu fonémového rozpoznávače -- standardní n-gramové jazykové modely a binární rozhodovací stromy. Hlavní přínos v akustickém přístupu je diskriminativní modelování cílových modelů jazyků a první experimenty s kombinací diskriminativního trénování a na příznacích, kde byl odstraněn vliv kanálu. Práce dále zkoumá různé druhy technik fúzi akustického a fonotaktického přístupu. Všechny experimenty jsou provedeny na standardních datech z NIST evaluaci konané v letech 2003, 2005 a 2007, takže jsou přímo porovnatelné s výsledky ostatních skupin zabývajících se automatickým rozpoznáváním jazyka. S fúzí uvedených technik jsme posunuli state-of-the-art výsledky a dosáhli vynikajících výsledků ve dvou NIST evaluacích.This thesis deals with phonotactic and acoustic techniques for automatic language recognition (LRE). The first part of the thesis deals with the phonotactic language recognition based on co-occurrences of phone sequences in speech. A thorough study of phone recognition as tokenization technique for LRE is done, with focus on the amounts of training data for phone recognizer and on the combination of phone recognizers trained on several language (Parallel Phone Recognition followed by Language Model - PPRLM). The thesis also deals with novel technique of anti-models in PPRLM and investigates into using phone lattices instead of strings. The work on phonotactic approach is concluded by a comparison of classical n-gram modeling techniques and binary decision trees. The acoustic LRE was addressed too, with the main focus on discriminative techniques for training target language acoustic models and on initial (but successful) experiments with removing channel dependencies. We have also investigated into the fusion of phonotactic and acoustic approaches. All experiments were performed on standard data from NIST 2003, 2005 and 2007 evaluations so that the results are directly comparable to other laboratories in the LRE community. With the above mentioned techniques, the fused systems defined the state-of-the-art in the LRE field and reached excellent results in NIST evaluations.

    Neural PLDA Modeling for End-to-End Speaker Verification

    Full text link
    While deep learning models have made significant advances in supervised classification problems, the application of these models for out-of-set verification tasks like speaker recognition has been limited to deriving feature embeddings. The state-of-the-art x-vector PLDA based speaker verification systems use a generative model based on probabilistic linear discriminant analysis (PLDA) for computing the verification score. Recently, we had proposed a neural network approach for backend modeling in speaker verification called the neural PLDA (NPLDA) where the likelihood ratio score of the generative PLDA model is posed as a discriminative similarity function and the learnable parameters of the score function are optimized using a verification cost. In this paper, we extend this work to achieve joint optimization of the embedding neural network (x-vector network) with the NPLDA network in an end-to-end (E2E) fashion. This proposed end-to-end model is optimized directly from the acoustic features with a verification cost function and during testing, the model directly outputs the likelihood ratio score. With various experiments using the NIST speaker recognition evaluation (SRE) 2018 and 2019 datasets, we show that the proposed E2E model improves significantly over the x-vector PLDA baseline speaker verification system.Comment: Accepted in Interspeech 2020. GitHub Implementation Repos: https://github.com/iiscleap/E2E-NPLDA and https://github.com/iiscleap/NeuralPld

    Connectionist probability estimators in HMM speech recognition

    Get PDF
    The authors are concerned with integrating connectionist networks into a hidden Markov model (HMM) speech recognition system. This is achieved through a statistical interpretation of connectionist networks as probability estimators. They review the basis of HMM speech recognition and point out the possible benefits of incorporating connectionist networks. Issues necessary to the construction of a connectionist HMM recognition system are discussed, including choice of connectionist probability estimator. They describe the performance of such a system using a multilayer perceptron probability estimator evaluated on the speaker-independent DARPA Resource Management database. In conclusion, they show that a connectionist component improves a state-of-the-art HMM system

    Patrol team language identification system for DARPA RATS P1 evaluation

    Get PDF
    This paper describes the language identification (LID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We show that techniques originally developed for LID on telephone speech (e.g., for the NIST language recognition evaluations) remain effective on the noisy RATS data, provided that careful consideration is applied when designing the training and development sets. In addition, we show significant improvements from the use of Wiener filtering, neural network based and language dependent i-vector modeling, and fusion
    corecore