1,474 research outputs found

    Long short-term memory networks for noise robust speech recognition

    Get PDF

    Deep Spoken Keyword Spotting:An Overview

    Get PDF
    Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS

    Advances in deep learning methods for speech recognition and understanding

    Full text link
    Ce travail expose plusieurs études dans les domaines de la reconnaissance de la parole et compréhension du langage parlé. La compréhension sémantique du langage parlé est un sous-domaine important de l'intelligence artificielle. Le traitement de la parole intéresse depuis longtemps les chercheurs, puisque la parole est une des charactéristiques qui definit l'être humain. Avec le développement du réseau neuronal artificiel, le domaine a connu une évolution rapide à la fois en terme de précision et de perception humaine. Une autre étape importante a été franchie avec le développement d'approches bout en bout. De telles approches permettent une coadaptation de toutes les parties du modèle, ce qui augmente ainsi les performances, et ce qui simplifie la procédure d'entrainement. Les modèles de bout en bout sont devenus réalisables avec la quantité croissante de données disponibles, de ressources informatiques et, surtout, avec de nombreux développements architecturaux innovateurs. Néanmoins, les approches traditionnelles (qui ne sont pas bout en bout) sont toujours pertinentes pour le traitement de la parole en raison des données difficiles dans les environnements bruyants, de la parole avec un accent et de la grande variété de dialectes. Dans le premier travail, nous explorons la reconnaissance de la parole hybride dans des environnements bruyants. Nous proposons de traiter la reconnaissance de la parole, qui fonctionne dans un nouvel environnement composé de différents bruits inconnus, comme une tâche d'adaptation de domaine. Pour cela, nous utilisons la nouvelle technique à l'époque de l'adaptation du domaine antagoniste. En résumé, ces travaux antérieurs proposaient de former des caractéristiques de manière à ce qu'elles soient distinctives pour la tâche principale, mais non-distinctive pour la tâche secondaire. Cette tâche secondaire est conçue pour être la tâche de reconnaissance de domaine. Ainsi, les fonctionnalités entraînées sont invariantes vis-à-vis du domaine considéré. Dans notre travail, nous adoptons cette technique et la modifions pour la tâche de reconnaissance de la parole dans un environnement bruyant. Dans le second travail, nous développons une méthode générale pour la régularisation des réseaux génératif récurrents. Il est connu que les réseaux récurrents ont souvent des difficultés à rester sur le même chemin, lors de la production de sorties longues. Bien qu'il soit possible d'utiliser des réseaux bidirectionnels pour une meilleure traitement de séquences pour l'apprentissage des charactéristiques, qui n'est pas applicable au cas génératif. Nous avons développé un moyen d'améliorer la cohérence de la production de longues séquences avec des réseaux récurrents. Nous proposons un moyen de construire un modèle similaire à un réseau bidirectionnel. L'idée centrale est d'utiliser une perte L2 entre les réseaux récurrents génératifs vers l'avant et vers l'arrière. Nous fournissons une évaluation expérimentale sur une multitude de tâches et d'ensembles de données, y compris la reconnaissance vocale, le sous-titrage d'images et la modélisation du langage. Dans le troisième article, nous étudions la possibilité de développer un identificateur d'intention de bout en bout pour la compréhension du langage parlé. La compréhension sémantique du langage parlé est une étape importante vers le développement d'une intelligence artificielle de type humain. Nous avons vu que les approches de bout en bout montrent des performances élevées sur les tâches, y compris la traduction automatique et la reconnaissance de la parole. Nous nous inspirons des travaux antérieurs pour développer un système de bout en bout pour la reconnaissance de l'intention.This work presents several studies in the areas of speech recognition and understanding. The semantic speech understanding is an important sub-domain of the broader field of artificial intelligence. Speech processing has had interest from the researchers for long time because language is one of the defining characteristics of a human being. With the development of neural networks, the domain has seen rapid progress both in terms of accuracy and human perception. Another important milestone was achieved with the development of end-to-end approaches. Such approaches allow co-adaptation of all the parts of the model thus increasing the performance, as well as simplifying the training procedure. End-to-end models became feasible with the increasing amount of available data, computational resources, and most importantly with many novel architectural developments. Nevertheless, traditional, non end-to-end, approaches are still relevant for speech processing due to challenging data in noisy environments, accented speech, and high variety of dialects. In the first work, we explore the hybrid speech recognition in noisy environments. We propose to treat the recognition in the unseen noise condition as the domain adaptation task. For this, we use the novel at the time technique of the adversarial domain adaptation. In the nutshell, this prior work proposed to train features in such a way that they are discriminative for the primary task, but non-discriminative for the secondary task. This secondary task is constructed to be the domain recognition task. Thus, the features trained are invariant towards the domain at hand. In our work, we adopt this technique and modify it for the task of noisy speech recognition. In the second work, we develop a general method for regularizing the generative recurrent networks. It is known that the recurrent networks frequently have difficulties staying on same track when generating long outputs. While it is possible to use bi-directional networks for better sequence aggregation for feature learning, it is not applicable for the generative case. We developed a way improve the consistency of generating long sequences with recurrent networks. We propose a way to construct a model similar to bi-directional network. The key insight is to use a soft L2 loss between the forward and the backward generative recurrent networks. We provide experimental evaluation on a multitude of tasks and datasets, including speech recognition, image captioning, and language modeling. In the third paper, we investigate the possibility of developing an end-to-end intent recognizer for spoken language understanding. The semantic spoken language understanding is an important step towards developing a human-like artificial intelligence. We have seen that the end-to-end approaches show high performance on the tasks including machine translation and speech recognition. We draw the inspiration from the prior works to develop an end-to-end system for intent recognition

    Spoken command recognition for robotics

    Get PDF
    In this thesis, I investigate spoken command recognition technology for robotics. While high robustness is expected, the distant and noisy conditions in which the system has to operate make the task very challenging. Unlike commercial systems which all rely on a "wake-up" word to initiate the interaction, the pipeline proposed here directly detect and recognizes commands from the continuous audio stream. In order to keep the task manageable despite low-resource conditions, I propose to focus on a limited set of commands, thus trading off flexibility of the system against robustness. Domain and speaker adaptation strategies based on a multi-task regularization paradigm are first explored. More precisely, two different methods are proposed which rely on a tied loss function which penalizes the distance between the output of several networks. The first method considers each speaker or domain as a task. A canonical task-independent network is jointly trained with task-dependent models, allowing both types of networks to improve by learning from one another. While an improvement of 3.2% on the frame error rate (FER) of the task-independent network is obtained, this only partially carried over to the phone error rate (PER), with 1.5% of improvement. Similarly, a second method explored the parallel training of the canonical network with a privileged model having access to i-vectors. This method proved less effective with only 1.2% of improvement on the FER. In order to make the developed technology more accessible, I also investigated the use of a sequence-to-sequence (S2S) architecture for command classification. The use of an attention-based encoder-decoder model reduced the classification error by 40% relative to a strong convolutional neural network (CNN)-hidden Markov model (HMM) baseline, showing the relevance of S2S architectures in such context. In order to improve the flexibility of the trained system, I also explored strategies for few-shot learning, which allow to extend the set of commands with minimum requirements in terms of data. Retraining a model on the combination of original and new commands, I managed to achieve 40.5% of accuracy on the new commands with only 10 examples for each of them. This scores goes up to 81.5% of accuracy with a larger set of 100 examples per new command. An alternative strategy, based on model adaptation achieved even better scores, with 68.8% and 88.4% of accuracy with 10 and 100 examples respectively, while being faster to train. This high performance is obtained at the expense of the original categories though, on which the accuracy deteriorated. Those results are very promising as the methods allow to easily extend an existing S2S model with minimal resources. Finally, a full spoken command recognition system (named iCubrec) has been developed for the iCub platform. The pipeline relies on a voice activity detection (VAD) system to propose a fully hand-free experience. By segmenting only regions that are likely to contain commands, the VAD module also allows to reduce greatly the computational cost of the pipeline. Command candidates are then passed to the deep neural network (DNN)-HMM command recognition system for transcription. The VoCub dataset has been specifically gathered to train a DNN-based acoustic model for our task. Through multi-condition training with the CHiME4 dataset, an accuracy of 94.5% is reached on VoCub test set. A filler model, complemented by a rejection mechanism based on a confidence score, is finally added to the system to reject non-command speech in a live demonstration of the system
    • …
    corecore