Search CORE

636 research outputs found

Análisis de compensación de variabilidad en reconocimiento de locutor aplicado a duraciones cortas

Author: Zazo Candil Rubén
Publication venue
Publication date: 01/07/2014
Field of study

En este proyecto se estudian, implementan y evalúan sistemas automáticos de reconocimiento de locutor en presencia de locuciones de duración corta. Para llevarlo a cabo se han utilizado y comparado diversas técnicas del estado del arte en reconocimiento de locutor así como su adaptación a locuciones de corta duración. Como punto de partida del proyecto se ha realizado un estudio de las diferentes técnicas que han ido marcando el estado del arte, destacando las que han conseguido una mejoría notable en evaluaciones promovidas por el National Institute of Standards and Technology (NIST) de reconocimiento de locutor durante la última década. Una vez entendido el estado del arte desde el punto de vista teórico el siguiente paso se define la tarea sobre la que se evaluarán las diferentes técnicas. Históricamente, la tarea principal en evaluaciones NIST consiste en entrenar el modelo de locutor con una conversación, de aproximadamente 150 segundos, y realizar la verificación de usuario frente a una locución de la misma duración. En la tarea que se desarrolla durante la realización de este proyecto disponemos de locuciones con una duración mucho más limitada, aproximadamente 10 segundos, provenientes de evaluaciones NIST de reconocimiento de locutor. Para la parte experimental se llevaron a cabo dos fases de experimentos. Durante la primera fase el objetivo ha sido comparar y analizar las diferencias entre dos técnicas del estado del arte basadas en Factor Analysis (FA), Total Variability (TV) y Probabilistic Linear Discriminant Analysis (PLDA), evaluando principalmente el rendimiento de éstas técnicas sobre nuestro entorno experimental que seguirá el protocolo de las evaluaciones NIST. En la segunda fase se hace un ajuste de los parámetros de dichas técnicas para comprobar el impacto de los mismos en presencia de duraciones cortas y mejorar el rendimiento de los sistemas con escasez de datos. Para ello evaluamos el sistema en base a dos medidas, la tasa de error y la función de coste que suele emplearse en dicha evaluación, que será detallada en los siguientes capítulos. Finalmente, se presentan las conclusiones extraídas a lo largo de este trabajo, así como las líneas de trabajo futuro. Parte del trabajo llevado a cabo durante la ejecución de este Proyecto Final de Carrera ha sido publicado en la conferencia de carácter internacional IberSpeech 2012 [1]: Javier Gonzalez-Dominguez, Ruben Zazo, and Joaquin Gonzalez-Rodriguez. “On the use of total variability and probabilistic linear discriminant analysis for speaker verification on short utterances”. i Análisis de compensación de variabilidad en reconocimiento de locutor aplicado a duracionesThis project is focused on automatic speaker verification (SV) systems dealing with short duration utterances ( 10s). Despite the enormous advances in the field, the broad use of SV in real scenarios remains a challenge mostly due to two factors. First, the session variability; that is, the set of difference among utterances belonging to the same speaker. Second, the system performance degradation when dealing with short duration utterances. As an starting point of this project, an exhaustive study of the state-of-the-art speaker verification techniques has been conducted. This, with special focus on those methods, which achieved outstanding results and open the door to better SV systems. In that sense, we put particular emphasis on the recent methods based on Factor Analysis (FA) namely, Total Variability (TV) and Probabilistic Linear Discriminant Analysis (PLDA). Those methods have become the state of the art in the field due to their ability of mitigating the session variability problem In order to assess the behaviour of those systems, we use the data and follow the protocol defined by the US National Institute of Standards and Technology (NIST) in its Speaker Recognition Evaluation series (SRE). Particularly, we follow the SRE2010 protocol, but adapted to the short durations problems. Thus, instead of using 150s duration utterances as defined in the core task of SRE2010, we experiment with 10s duration utterance in both training and testing. The experiments conducted can be divided in two phases. During the first phase we study, compare and evaluate the use of TV and PLDA as effective methods to perform SV. Second phase is then devoted to adapt those methods to the short duration scenarios. We analyse in this point the effect and importance of the multiple parameters of the systems when facing to limited data for both training and testing. Conclusions and future lines of this work are then presented. Part of this work has been published on the international conference IberSpeech 2012 [1]: Javier Gonzalez-Dominguez, Ruben Zazo, and Joaquin Gonzalez-Rodriguez. “On the use of total variability and probabilistic linear discriminant analysis for speaker verification on short utterances”

Biblos-e Archivo

Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification

Author: Glass James
Sarkar Achintya kr.
Shon Suwon
Tan Zheng-Hua
Tang Hao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/05/2019
Field of study

There are a number of studies about extraction of bottleneck (BN) features from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases and triphone states for improving the performance of text-dependent speaker verification (TD-SV). However, a moderate success has been achieved. A recent study [1] presented a time contrastive learning (TCL) concept to explore the non-stationarity of brain signals for classification of brain states. Speech signals have similar non-stationarity property, and TCL further has the advantage of having no need for labeled data. We therefore present a TCL based BN feature extraction method. The method uniformly partitions each speech utterance in a training dataset into a predefined number of multi-frame segments. Each segment in an utterance corresponds to one class, and class labels are shared across utterances. DNNs are then trained to discriminate all speech frames among the classes to exploit the temporal structure of speech. In addition, we propose a segment-based unsupervised clustering algorithm to re-assign class labels to the segments. TD-SV experiments were conducted on the RedDots challenge database. The TCL-DNNs were trained using speech data of fixed pass-phrases that were excluded from the TD-SV evaluation set, so the learned features can be considered phrase-independent. We compare the performance of the proposed TCL bottleneck (BN) feature with those of short-time cepstral features and BN features extracted from DNNs discriminating speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels and boundaries are generated by three different automatic speech recognition (ASR) systems. Experimental results show that the proposed TCL-BN outperforms cepstral features and speaker+pass-phrase discriminant BN features, and its performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other work

arXiv.org e-Print Archive

VBN