4 research outputs found
Speedup of kernel eigenvoice speaker adaptation by embedded kernel PCA
Recently, we proposed an improvement to the eigenvoice (EV) speaker adaptation called kernel eigenvoice (KEV) speaker adaptation. In KEV adaptation, eigenvoices are computed using kernel PCA, and a new speaker’s adapted model is implicitly computed in the kernel-induced feature space. Due to many online kernel evaluations, both adaptation and subsequent recognition of KEV adaptation are slower than EV adaptation. In this paper, we eliminate all online kernel computations by finding an approximate pre-image of the implicit adapted model found by KEV adaptation. Furthermore, the two steps of finding the implicit adapted model and its approximate pre-image are integrated by embedding the kernel PCA procedure in our new embedded kernel eigenvoice (eKEV) speaker adaptation method. When tested in an TIDIGITS task with less than 10s of adaptation speech, eKEV adaptation obtained a speedup of 6–14 times in adaptation and 136 times in recognition over KEV adaptation with 12–13 % relative improvement in recognition accuracy. 1
Deep Neural Network Architectures for Large-scale, Robust and Small-Footprint Speaker and Language Recognition
Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura : 27-04-2017Artificial neural networks are powerful learners of the information embedded in speech signals.
They can provide compact, multi-level, nonlinear representations of temporal sequences
and holistic optimization algorithms capable of surpassing former leading paradigms. Artificial
neural networks are, therefore, a promising technology that can be used to enhance our
ability to recognize speakers and languages–an ability increasingly in demand in the context
of new, voice-enabled interfaces used today by millions of users. The aim of this thesis is to
advance the state-of-the-art of language and speaker recognition through the formulation,
implementation and empirical analysis of novel approaches for large-scale and portable
speech interfaces. Its major contributions are: (1) novel, compact network architectures
for language and speaker recognition, including a variety of network topologies based on
fully-connected, recurrent, convolutional, and locally connected layers; (2) a bottleneck combination
strategy for classical and neural network approaches for long speech sequences; (3)
the architectural design of the first, public, multilingual, large vocabulary continuous speech
recognition system; and (4) a novel, end-to-end optimization algorithm for text-dependent
speaker recognition that is applicable to a range of verification tasks. Experimental results
have demonstrated that artificial neural networks can substantially reduce the number of
model parameters and surpass the performance of previous approaches to language and
speaker recognition, particularly in the cases of long short-term memory recurrent networks
(used to model the input speech signal), end-to-end optimization algorithms (used to predict
languages or speakers), short testing utterances, and large training data collections.Las redes neuronales artificiales son sistemas de aprendizaje capaces de extraer la información
embebida en las señales de voz. Son capaces de modelar de forma eficiente secuencias
temporales complejas, con información no lineal y distribuida en distintos niveles semanticos,
mediante el uso de algoritmos de optimización integral con la capacidad potencial de mejorar
los sistemas aprendizaje automático existentes. Las redes neuronales artificiales son, pues,
una tecnología prometedora para mejorar el reconocimiento automático de locutores e
idiomas; siendo el reconocimiento de de locutores e idiomas, tareas con cada vez más
demanda en los nuevos sistemas de control por voz, que ya utilizan millones de personas. Esta
tesis tiene como objetivo la mejora del estado del arte de las tecnologías de reconocimiento
de locutor y de idioma mediante la formulación, implementación y análisis empírico de
nuevos enfoques basados en redes neuronales, aplicables a dispositivos portátiles y a su uso
en gran escala. Las principales contribuciones de esta tesis incluyen la propuesta original de:
(1) arquitecturas eficientes que hacen uso de capas neuronales densas, localmente densas,
recurrentes y convolucionales; (2) una nueva estrategia de combinación de enfoques clásicos
y enfoques basados en el uso de las denominadas redes de cuello de botella; (3) el diseño del
primer sistema público de reconocimiento de voz, de vocabulario abierto y continuo, que es
además multilingüe; y (4) la propuesta de un nuevo algoritmo de optimización integral para
tareas de reconocimiento de locutor, aplicable también a otras tareas de verificación. Los
resultados experimentales extraídos de esta tesis han demostrado que las redes neuronales
artificiales son capaces de reducir el número de parámetros usados por los algoritmos de
reconocimiento tradicionales, así como de mejorar el rendimiento de dichos sistemas de
forma substancial. Dicha mejora relativa puede acentuarse a través del modelado de voz
mediante redes recurrentes de memoria a largo plazo, el uso de algoritmos de optimización
integral, el uso de locuciones de evaluation de corta duración y mediante la optimización del
sistema con grandes cantidades de datos de entrenamiento
IberSPEECH 2020: XI Jornadas en Tecnología del Habla and VII Iberian SLTech
IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, “IberSPEECH 2020: Speech and Language Technologies for Iberian Languages”, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Española de Tecnologías del Habla. Universidad de Valladoli