573 research outputs found

    Behavior of a Bayesian adaptation method for incremental enrollment in speaker verification

    Get PDF
    Classical adaptation approaches are generally used for speaker or environment adaptation of speech recognition systems. In this paper, we use such techniques for the incremental training of client models in a speaker verification system. The initial model is trained on a very limited amount of data and then progressively updated with access data, using a segmental-EM procedure. In supervised mode (i.e. when access utterances are certified), the incremental approach yields equivalent performance to the batch one. We also investigate on the impact of various scenarios of impostor attacks during the incremental enrollment phase. All results are obtained with the Picassoft platform - the state-of-the-art speaker verification system developed in the PICASSO project

    ROBUST INCREMENTAL ADAPTATION OF GMM IN SPEAKER RECOGNITION

    Get PDF
    ABSTRACT: Speaker model in speaker recognition system is to be trained from a large data set uttered in multiple sessions. The large data set requires larger amount of memory and computation, and practically it's hard to make users utter a large amount of data in several sessions. Recently proposed incremental adaptation methods cover the problems. However, the data set uttered from multiple sessions is vulnerable to outliers from irregular utterance variation and presence of noise, which results in inaccurate speaker model. In this paper, we propose a robust incremental adaptation method to minimize the influence of outliers on speaker model using Gaussian Mixture Model. The robust adaptation is obtained from an incremental version of M-estimation. Speaker model is initially trained from small amount of data and it is adapted recursively according to new data available. Experimental results from the data set gathered over seven months show that the proposed method is robust against outliers

    On adaptive decision rules and decision parameter adaptation for automatic speech recognition

    Get PDF
    Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and language training corpora. Maximum-likelihood point estimation is by far the most prevailing training method. However, due to the problems of unknown speech distributions, sparse training data, high spectral and temporal variabilities in speech, and possible mismatch between training and testing conditions, a dynamic training strategy is needed. To cope with the changing speakers and speaking conditions in real operational conditions for high-performance speech recognition, such paradigms incorporate a small amount of speaker and environment specific adaptation data into the training process. Bayesian adaptive learning is an optimal way to combine prior knowledge in an existing collection of general models with a new set of condition-specific adaptation data. In this paper, the mathematical framework for Bayesian adaptation of acoustic and language model parameters is first described. Maximum a posteriori point estimation is then developed for hidden Markov models and a number of useful parameters densities commonly used in automatic speech recognition and natural language processing.published_or_final_versio

    Advances in Subspace-based Solutions for Diarization in the Broadcast Domain

    Get PDF
    La motivación de esta tesis es la necesidad de soluciones robustas al problema de diarización. Estas técnicas de diarización deben proporcionar valor añadido a la creciente cantidad disponible de datos multimedia mediante la precisa discriminación de los locutores presentes en la señal de audio. Desafortunadamente, hasta tiempos recientes este tipo de tecnologías solamente era viable en condiciones restringidas, quedando por tanto lejos de una solución general. Las razones detrás de las limitadas prestaciones de los sistemas de diarización son múltiples. La primera causa a tener en cuenta es la alta complejidad de la producción de la voz humana, en particular acerca de los procesos fisiológicos necesarios para incluir las características discriminativas de locutor en la señal de voz. Esta complejidad hace del proceso inverso, la estimación de dichas características a partir del audio, una tarea ineficiente por medio de las técnicas actuales del estado del arte. Consecuentemente, en su lugar deberán tenerse en cuenta aproximaciones. Los esfuerzos en la tarea de modelado han proporcionado modelos cada vez más elaborados, aunque no buscando la explicación última de naturaleza fisiológica de la señal de voz. En su lugar estos modelos aprenden relaciones entre la señales acústicas a partir de un gran conjunto de datos de entrenamiento. El desarrollo de modelos aproximados genera a su vez una segunda razón, la variabilidad de dominio. Debido al uso de relaciones aprendidas a partir de un conjunto de entrenamiento concreto, cualquier cambio de dominio que modifique las condiciones acústicas con respecto a los datos de entrenamiento condiciona las relaciones asumidas, pudiendo causar fallos consistentes en los sistemas.Nuestra contribución a las tecnologías de diarización se ha centrado en el entorno de radiodifusión. Este dominio es actualmente un entorno todavía complejo para los sistemas de diarización donde ninguna simplificación de la tarea puede ser tenida en cuenta. Por tanto, se deberá desarrollar un modelado eficiente del audio para extraer la información de locutor y como inferir el etiquetado correspondiente. Además, la presencia de múltiples condiciones acústicas debido a la existencia de diferentes programas y/o géneros en el domino requiere el desarrollo de técnicas capaces de adaptar el conocimiento adquirido en un determinado escenario donde la información está disponible a aquellos entornos donde dicha información es limitada o sencillamente no disponible.Para este propósito el trabajo desarrollado a lo largo de la tesis se ha centrado en tres subtareas: caracterización de locutor, agrupamiento y adaptación de modelos. La primera subtarea busca el modelado de un fragmento de audio para obtener representaciones precisas de los locutores involucrados, poniendo de manifiesto sus propiedades discriminativas. En este área se ha llevado a cabo un estudio acerca de las actuales estrategias de modelado, especialmente atendiendo a las limitaciones de las representaciones extraídas y poniendo de manifiesto el tipo de errores que pueden generar. Además, se han propuesto alternativas basadas en redes neuronales haciendo uso del conocimiento adquirido. La segunda tarea es el agrupamiento, encargado de desarrollar estrategias que busquen el etiquetado óptimo de los locutores. La investigación desarrollada durante esta tesis ha propuesto nuevas estrategias para estimar el mejor reparto de locutores basadas en técnicas de subespacios, especialmente PLDA. Finalmente, la tarea de adaptación de modelos busca transferir el conocimiento obtenido de un conjunto de entrenamiento a dominios alternativos donde no hay datos para extraerlo. Para este propósito los esfuerzos se han centrado en la extracción no supervisada de información de locutor del propio audio a diarizar, sinedo posteriormente usada en la adaptación de los modelos involucrados.<br /

    Analysis of Large-Scale SVM Training Algorithms for Language and Speaker Recognition

    Get PDF
    This paper compares a set of large scale support vector machine (SVM) training algorithms for language and speaker recognition tasks.We analyze five approaches for training phonetic and acoustic SVM models for language recognition. We compare the performance of these approaches as a function of the training time required by each of them to reach convergence, and we discuss their scalability towards large corpora. Two of these algorithms can be used in speaker recognition to train a SVM that classifies pairs of utterances as either belonging to the same speaker or to two different speakers. Our results show that the accuracy of these algorithms is asymptotically equivalent, but they have different behavior with respect to the time required to converge. Some of these algorithms not only scale linearly with the training set size, but are also able to give their best results after just a few iterations. State-of-the-art performance has been obtained in the female subset of the NIST 2010 Speaker Recognition Evaluation extended core test using a single SVM syste

    Multiple classifiers in biometrics. Part 2: Trends and challenges

    Full text link
    The present paper is Part 2 in this series of two papers. In Part 1 we provided an introduction to Multiple Classifier Systems (MCS) with a focus into the fundamentals: basic nomenclature, key elements, architecture, main methods, and prevalent theory and framework. Part 1 then overviewed the application of MCS to the particular field of multimodal biometric person authentication in the last 25 years, as a prototypical area in which MCS has resulted in important achievements. Here in Part 2 we present in more technical detail recent trends and developments in MCS coming from multimodal biometrics that incorporate context information in an adaptive way. These new MCS architectures exploit input quality measures and pattern-specific particularities that move apart from general population statistics, resulting in robust multimodal biometric systems. Similarly as in Part 1, methods here are described in a general way so they can be applied to other information fusion problems as well. Finally, we also discuss here open challenges in biometrics in which MCS can play a key roleThis work was funded by projects CogniMetrics (TEC2015-70627-R) from MINECO/FEDER and RiskTrakc (JUST-2015-JCOO-AG-1). Part of this work was conducted during a research visit of J.F. to Prof. Ludmila Kuncheva at Bangor University (UK) with STSM funding from COST CA16101 (MULTI-FORESEE

    Face Detection in a Mixed-Subject Document

    Get PDF
    Before you recognize anyone, it is essential to identify various characteristics variations from one person to another. among of this characteristics, we have those relating to the face. Nowadays the detection of skin regions in an image has become an important research topic for the location of a face in the image. In this research study, unlike previous research studies  related  to  this  topic  which  have  focused  on  images  inputs  data  faces,  we  are  more interested to the fields face detection in mixed-subject documents (text + images). The face detection system developed is based on the hybrid method to distinguish two categories of objects from the mixed document. The first category is all that is text or images containing figures having no skin color, and the second category is any figure with the same color as the skin. In the second phase the detection system is based on Template Matching method to distinguish among the figures of the second category only those that contain faces to detect them. To validate this study, the system developed is tested on the various documents which including text and image
    corecore