Search CORE

2 research outputs found

Recommended from our members

On-device mobile speech recognition

Author: Mustafa MK
Publication venue
Publication date: 01/05/2016
Field of study

Despite many years of research, Speech Recognition remains an active area of research in Artificial Intelligence. Currently, the most common commercial application of this technology on mobile devices uses a wireless client – server approach to meet the computational and memory demands of the speech recognition process. Unfortunately, such an approach is unlikely to remain viable when fully applied over the approximately 7.22 Billion mobile phones currently in circulation. In this thesis we present an On – Device Speech recognition system. Such a system has the potential to completely eliminate the wireless client-server bottleneck. For the Voice Activity Detection part of this work, this thesis presents two novel algorithms used to detect speech activity within an audio signal. The first algorithm is based on the Log Linear Predictive Cepstral Coefficients Residual signal. These LLPCCRS feature vectors were then classified into voice signal and non-voice signal segments using a modified K-means clustering algorithm. This VAD algorithm is shown to provide a better performance as compared to a conventional energy frame analysis based approach. The second algorithm developed is based on the Linear Predictive Cepstral Coefficients. This algorithm uses the frames within the speech signal with the minimum and maximum standard deviation, as candidates for a linear cross correlation against the rest of the frames within the audio signal. The cross correlated frames are then classified using the same modified K-means clustering algorithm. The resulting output provides a cluster for Speech frames and another cluster for Non–speech frames. This novel application of the linear cross correlation technique to linear predictive cepstral coefficients feature vectors provides a fast computation method for use on the mobile platform; as shown by the results presented in this thesis. The Speech recognition part of this thesis presents two novel Neural Network approaches to mobile Speech recognition. Firstly, a recurrent neural networks architecture is developed to accommodate the output of the VAD stage. Specifically, an Echo State Network (ESN) is used for phoneme level recognition. The drawbacks and advantages of this method are explained further within the thesis. Secondly, a dynamic Multi-Layer Perceptron approach is developed. This builds on the drawbacks of the ESN and provides a dynamic way of handling speech signal length variabilities within its architecture. This novel Dynamic Multi-Layer Perceptron uses both the Linear Predictive Cepstral Coefficients (LPC) and the Mel Frequency Cepstral Coefficients (MFCC) as input features. A speaker dependent approach is presented using the Centre for spoken Language and Understanding (CSLU) database. The results show a very distinct behaviour from conventional speech recognition approaches because the LPC shows performance figures very close to the MFCC. A speaker independent system, using the standard TIMIT dataset, is then implemented on the dynamic MLP for further confirmation of this. In this mode of operation the MFCC outperforms the LPC. Finally, all the results, with emphasis on the computation time of both these novel neural network approaches are compared directly to a conventional hidden Markov model on the CSLU and TIMIT standard datasets

Nottingham Trent Institutional Repository (IRep)

Reconocimiento de voz a través de técnicas híbridas utilizando modelos Markovianos y nuevos tipos de redes neuronales

Author: Becerra Sánchez Aldonso
Publication venue: 'Universidad Autonoma de Zacatecas - Francisco Garcia Salinas'
Publication date: 01/11/2017
Field of study

The speech recognition module within a spoken dialogue system has become a key factor over time. The improvements that can be made with the new approaches and techniques have shown the evolutionary path that can be carried out in many processes of training and architecture definition in order to obtain superior recognition rates. In this sense, the present research has as objective to investigate new schemes to improve the word error rates (WER). The present work is based on the idea of using the deep neural networks and hidden Markov models (DNNHMM) architecture, which relies heavily on the behavior of the Gaussian mixture models and hidden Markov models (GMM-HMM) approach. First, experimental comparisons are made taking into consideration both approaches. The research process has been performed by using a corpus of personalized voices in Spanish from the northern central part of Mexico, based on a connected-words phone dialing task through the recognition of digit strings and personal name lists. The specified recognition task is defined as speaker-independent, text-dependent and mid-vocabulary. In the first experimental case study, a relative improvement of 30% was obtained using the acoustic model based on neural networks (WER of 1:49%), compared to the classic acoustic model based on Gaussian mixtures (2:12%). In the second case study, a relative improvement of 20:71% was achieved with the connectionist approach (neural networks, WER of 3:33%) with regard to the Gaussian mixture model (4:20%). The presented recognition task shows that the current approaches based on connectionist models, originated in artificial intelligence, surpass the traditional approaches of Gaussian mixtures in most of the speech recognition tasks. With the purpose of obtaining improvements in the recent speech recognition models, the second part of the thesis proposes new cost functions to train a neural network, calling these functions as non-uniform mapped criteria. These functions allow superior recognition rates in comparison with the conventional cross-entropy function within the training of a deep neural network, by using the back-propagation algorithm and an optimization with the gradient descent procedure. The obtained results (a relative improvement of 12:3% and 10:7% was achieved with the two proposed approaches, with respect to the conventional model of cross-entropy) have shown improvements in the word error rates, suggesting that the proposed cost functions have arguments to be considered as interesting alternatives in this type of tasks. Nevertheless, we must continue with the work of testing this and new cost function mechanisms with different voice corpus in several conditions with and without environmental noise, in addition to considering radical variations in the speakers’ speech sources.El módulo de reconocimiento de voz dentro de un sistema de dialogo hablado se ha convertido en un punto clave con el paso del tiempo. Las mejoras que se le pueden hacer con los nuevos enfoques y técnicas han mostrado el camino evolutivo que se puede dar en muchos procesos de entrenamiento y definición de arquitecturas con el fin de obtener mejores tasas de reconocimiento. En este sentido, el presente trabajo tiene como objetivo investigar esquemas que permitan mejorar las tasas de error por palabra (WER). El trabajo se fundamenta en la idea del uso de la arquitectura de red neuronal profunda y modelos ocultos de Markov (RNP-MOM), la cual se basa en gran medida en el comportamiento del enfoque de modelo de mezclas Gaussianas y modelos ocultos de Markov (MMG-MOM). En primera instancia se hacen comparaciones experimentales en el funcionamiento de ambos enfoques tomando como punto de partida un corpus de voces personalizado en Español de la parte norte central de México, basado en una tarea de marcado telefónico a través de reconocimiento de dígitos numéricos y nombres completos de personas, con independencia de locutor, con dependencia de texto, de tamaño mediano y con palabras conectadas. En el primer caso de estudio experimental se obtuvo una mejora relativa del 30% usando el modelo acústico de redes neuronales (WER de 1:49%), en comparación con el modelo clásico de mezclas Gaussianas (2:12%). En el segundo caso de estudio se consiguió una mejora relativa de 20:71% en la tasa de error por palabras del enfoque conexionista (redes neuronales, WER de 3:33%) con respecto al modelo de mezclas Gaussianas (4:20%). En las tareas de reconocimiento presentadas se muestra que los enfoques actuales cimentados en modelos conexionistas, con origen en la inteligencia artificial, superan en la mayoría de los procesos de reconocimiento a los enfoques tradicionales de mezclas Gaussianas. Con el fin de conseguir mejoras en los modelos recientes de reconocimiento de voz, en la segunda parte del trabajo se proponen nuevas funciones de costo para entrenar una red neuronal, denominando a estas funciones como mapeadas no uniformes. Estas funciones permiten obtener mejores tasas de reconocimiento en comparación con la función convencional de entropía cruzada dentro del entrenamiento de una red neuronal profunda, utilizando para ello el algoritmo de retro-propagación y una optimización con el gradiente descendente. Los resultados obtenidos (se consiguió una mejora relativa de 12:3% y 10:7% con los dos enfoques planteados, con respecto al modelo base de entropía cruzada) han mostrado mejoras en las tasas de error por palabra, sugiriendo que las funciones de costo propuestas tienen argumentos para ser consideradas como alternativas interesantes en este tipo de tareas. No obstante, se debe seguir en la labor de probar este y nuevos mecanismos de función de costo con diferentes corpus de voces y en diversos entornos con y sin ruido ambiental, además de considerar variaciones radicales en los origenes de voz de los locutores

Caxcan Repositorio Institucional de la Universidad Autónoma de Zacatecas