11 research outputs found

    Factorization of Discriminatively Trained i-vector Extractor for Speaker Recognition

    Full text link
    In this work, we continue in our research on i-vector extractor for speaker verification (SV) and we optimize its architecture for fast and effective discriminative training. We were motivated by computational and memory requirements caused by the large number of parameters of the original generative i-vector model. Our aim is to preserve the power of the original generative model, and at the same time focus the model towards extraction of speaker-related information. We show that it is possible to represent a standard generative i-vector extractor by a model with significantly less parameters and obtain similar performance on SV tasks. We can further refine this compact model by discriminative training and obtain i-vectors that lead to better performance on various SV benchmarks representing different acoustic domains.Comment: Submitted to Interspeech 2019, Graz, Austria. arXiv admin note: substantial text overlap with arXiv:1810.1318

    Supervector extraction for encoding speaker and phrase information with neural networks for text-dependent speaker verification

    Get PDF
    In this paper, we propose a new differentiable neural network with an alignment mechanism for text-dependent speaker verification. Unlike previous works, we do not extract the embedding of an utterance from the global average pooling of the temporal dimension. Our system replaces this reduction mechanism by a phonetic phrase alignment model to keep the temporal structure of each phrase since the phonetic information is relevant in the verification task. Moreover, we can apply a convolutional neural network as front-end, and, thanks to the alignment process being differentiable, we can train the network to produce a supervector for each utterance that will be discriminative to the speaker and the phrase simultaneously. This choice has the advantage that the supervector encodes the phrase and speaker information providing good performance in text-dependent speaker verification tasks. The verification process is performed using a basic similarity metric. The new model using alignment to produce supervectors was evaluated on the RSR2015-Part I database, providing competitive results compared to similar size networks that make use of the global average pooling to extract embeddings. Furthermore, we also evaluated this proposal on the RSR2015-Part II. To our knowledge, this system achieves the best published results obtained on this second part

    Restricted Boltzmann machines for vector representation of speech in speaker recognition

    Get PDF
    Over the last few years, i-vectors have been the state-of-the-art technique in speaker recognition. Recent advances in Deep Learning (DL) technology have improved the quality of i-vectors but the DL techniques in use are computationally expensive and need phonetically labeled background data. The aim of this work is to develop an efficient alternative vector representation of speech by keeping the computational cost as low as possible and avoiding phonetic labels, which are not always accessible. The proposed vectors will be based on both Gaussian Mixture Models (GMM) and Restricted Boltzmann Machines (RBM) and will be referred to as GMM–RBM vectors. The role of RBM is to learn the total speaker and session variability among background GMM supervectors. This RBM, which will be referred to as Universal RBM (URBM), will then be used to transform unseen supervectors to the proposed low dimensional vectors. The use of different activation functions for training the URBM and different transformation functions for extracting the proposed vectors are investigated. At the end, a variant of Rectified Linear Units (ReLU) which is referred to as variable ReLU (VReLU) is proposed. Experiments on the core test condition 5 of NIST SRE 2010 show that comparable results with conventional i-vectors are achieved with a clearly lower computational load in the vector extraction process.Peer ReviewedPostprint (published version

    Extracción de características mediante redes neuronales multicapa para reconocimiento de idioma

    Full text link
    Este Trabajo Fin de Grado tiene como objetivo estudiar la utilización de características bottleneck extraídas de una red neuronal profunda entrenada para reconocimiento del habla, y reemplazar con ellas a las tradicionales características acústicas MFCC como parámetros de entrada a un sistema de reconocimiento automático de idioma UBM/Ivector. Para abordar esto, se implementará en primer lugar un sistema de reconocimiento de idioma siguiendo la aproximación clásica UBM/i-vector, basado en características acústicas MFCC que servirá como sistema de referencia. Después se entrenarán cuatro redes neuronales profundas con el objetivo de extraer características bottleneck que capturen información a cuatro niveles diferentes de abstracción y finalmente se implementarán cuatro sistemas de reconocimiento de idioma que usen estas características. La base de datos que se usará para el entrenamiento de las redes neuronales será Switchboard, y para el entrenamiento del sistema de reconocimiento de idioma UBM/ivector se hará uso de los audios proporcionados por el National Institute of Standards and Technology (NIST) para el Plan de Evaluación de Reconocimiento de Idioma de 2015 (LRE15). Las principales herramientas con las que se trabajarán son: Kaldi para la implementación de los sistemas de reconocimiento de idioma, Theano para el entrenamiento de las redes neuronales y extracción de características bottleneck, y también se hará uso de Matlab para generar los modelos, conseguir las puntuaciones (scores) y realizar la evaluación de los sistemas. Para evaluar la mejora que pueda suponer el uso de estas características bottleneck en reconocimiento de idioma, se compararán los resultados obtenidos en los cuatro sistemas que usen estas características con los resultados obtenidos con el sistema de referencia basado en MFCC. El rendimiento de los sistemas se medirá según el porcentaje de acierto en la predicción de idioma de los segmentos de evaluación en distintas pruebas previstas. A su vez, se hará una comparativa entre estos sistemas que emplean distintas características de entrada siguiendo la métrica del Equal Error Rate (EER) obtenido en cada uno de ellos.This Bachelor Thesis has the objective of studying the use of bottleneck features extracted from a deep neural network trained for speech recognition in order to replace the traditional MFCC acoustic features as input parameters for an automatic UBM/i-vector language recognition system. To do so, in the first place a language recognition system with a classical UBM/i-vector approach –based on acoustic features MFCC which works as reference system– will be implemented. After that, four deep neural networks will be implemented with the purpose of extracting bottleneck features that collect information in four different abstraction levels. Finally, four language recognition systems that us this features will be implemented. The database that will be used for the neural networks training will be Switchboard, and for the UBM/i-vector language recognition system audio recordings provided by the National Institute of Standards and Technology (NIST) for the 2015 NIST Language Recognition Evaluation Plan (LRE15) will be used. The main tools we will be working with are: Kaldi for the language recognition system implementation, Theano for the neural networks training and bottleneck features extraction. Matlab will also be used to generate models, get the scores and perform the systems evaluations. To evaluate the improvement, these bottleneck features can mean in language recognition, the obtained results will be compared in the four systems that use these features with those obtained with the referral system based on MFCC. The systems’ performance will be measured according to the success rate in the language prediction of the evaluation segments over various tests. Moreover, these systems with different input features will be compared following the Equal Error Rate (EER) obtained from each of them
    corecore