1,569 research outputs found

    Neural approaches to spoken content embedding

    Full text link
    Comparing spoken segments is a central operation to speech processing. Traditional approaches in this area have favored frame-level dynamic programming algorithms, such as dynamic time warping, because they require no supervision, but they are limited in performance and efficiency. As an alternative, acoustic word embeddings -- fixed-dimensional vector representations of variable-length spoken word segments -- have begun to be considered for such tasks as well. However, the current space of such discriminative embedding models, training approaches, and their application to real-world downstream tasks is limited. We start by considering ``single-view" training losses where the goal is to learn an acoustic word embedding model that separates same-word and different-word spoken segment pairs. Then, we consider ``multi-view" contrastive losses. In this setting, acoustic word embeddings are learned jointly with embeddings of character sequences to generate acoustically grounded embeddings of written words, or acoustically grounded word embeddings. In this thesis, we contribute new discriminative acoustic word embedding (AWE) and acoustically grounded word embedding (AGWE) approaches based on recurrent neural networks (RNNs). We improve model training in terms of both efficiency and performance. We take these developments beyond English to several low-resource languages and show that multilingual training improves performance when labeled data is limited. We apply our embedding models, both monolingual and multilingual, to the downstream tasks of query-by-example speech search and automatic speech recognition. Finally, we show how our embedding approaches compare with and complement more recent self-supervised speech models.Comment: PhD thesi

    ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION

    Get PDF
    Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria. Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal. Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system

    Ten years after ImageNet: a 360° perspective on artificial intelligence

    Get PDF
    It is 10 years since neural networks made their spectacular comeback. Prompted by this anniversary, we take a holistic perspective on artificial intelligence (AI). Supervised learning for cognitive tasks is effectively solved—provided we have enough high-quality labelled data. However, deep neural network models are not easily interpretable, and thus the debate between blackbox and whitebox modelling has come to the fore. The rise of attention networks, self-supervised learning, generative modelling and graph neural networks has widened the application space of AI. Deep learning has also propelled the return of reinforcement learning as a core building block of autonomous decision-making systems. The possible harms made possible by new AI technologies have raised socio-technical issues such as transparency, fairness and accountability. The dominance of AI by Big Tech who control talent, computing resources, and most importantly, data may lead to an extreme AI divide. Despite the recent dramatic and unexpected success in AI-driven conversational agents, progress in much-heralded flagship projects like self-driving vehicles remains elusive. Care must be taken to moderate the rhetoric surrounding the field and align engineering progress with scientific principles

    Learning and time : on using memory and curricula for language understanding

    Full text link
    Cette thèse présente quelques-unes des étapes entreprises pour pouvoir un jour résoudre le problème de la compréhension du langage naturel et d’apprentissage de dépendances à long terme, dans le but de développer de meilleurs algorithmes d’intelligence artificielle. Cette thèse est écrite comme une thèse par articles, et contient cinq publications scientifiques. Chacun de ces articles propose un nouveau modèle ou algorithme et démontre leur efficacité sur des problèmes qui impliquent des dépendances à long terme ou la compréhension du langage naturel. Malgré le fait que quelque uns de ces modèles n’ont été testés que sur une seule tâche (comme la traduction automatique neuronale), les méthodes proposées sont généralement applicables dans d’autres domaines et sur d’autres tâches. Dans l’introduction de la thèse, nous expliquons quelques concepts fondamentaux de l'entraînement de réseaux de neurones appliqués sur des données séquentielles. Tout d'abord, nous présentons succinctement les réseaux de neurones, puis, de façon plus détaillé, certains algorithmes et méthodes utilisés à travers cette thèse. Dans notre premier article, nous proposons une nouvelle méthode permettant d'utiliser la grande quantité de données monolingue disponible afin d'entraîner des modèles de traduction. Nous avons accompli cela en entraînant d’abord un modèle Long short-term memory (LSTM) sur un large corpus monolingue. Nous lions ensuite la sortie de la couche cachée du modèle avec celle d’un décodeur d’un modèle de traduction automatique. Ce dernier utilise un mécanisme d’attention et est entièrement entraîné par descente de gradient. Nous avons montré que la méthode proposée peut augmenter la performance des modèles de traduction automatique neuronale de façon significative sur les tâches où peu de données multilingues sont disponibles. Notre approche augmente également l’efficacité de l’utilisation des données dans les systèmes de traduction automatique. Nous montrons aussi des améliorations sur les paires de langues suivantes: turc-anglais, allemand-anglais, chinois-anglais et tchèque-anglais. Dans notre deuxième article, nous proposons une approche pour aborder le problème des mots rares dans plusieurs tâches du traitement des langages. Notre approche modifie l’architecture habituelle des modèles encodeur-décodeur avec attention, en remplaçant la couche softmax du décodeur par notre couche pointer-softmax. Celle-ci permet au décodeur de pointer à différents endroits dans la phrase d’origine. Notre modèle apprend à alterner entre copier un mot de la phrase d’origine et prédire un mot provenant d’une courte liste de mots prédéfinie, de manière probabiliste. L’approche que nous avons proposée est entièrement entraînable par descente de gradient et n’utilise qu’un objectif de maximum de vraisemblance sur les tâches de traduction. Nous avons aussi montré que le pointer-softmax aide de manière significative aux tâches de traduction et de synthèse de documents. Dans notre article "Plan, Attend, Generate: Planning for Sequence-to-Sequence Models", nous proposons deux approches pour apprendre l’alignement dans les modèles entraînés sur des séquences. Lorsque la longueur de l’entrée et celle de la sortie sont trop grandes, apprendre les alignements peut être très difficile. La raison est que lorsque le décodeur est trop puissant, il a tendance à ignorer l’alignement des mots pour ne se concentrer que sur le dernier mot de la séquence d’entrée. Nous avons proposé une nouvelle approche, inspirée d’un algorithme d’apprentissage par renforcement, en ajoutant explicitement un mécanisme de planification au décodeur. Ce nouveau mécanisme planifie à l’avance l’alignement pour les k prochaines prédictions. Notre modèle apprend également un plan de correction pour déterminer lorsqu’il est nécessaire de recalculer les alignements. Notre approche peut apprendre de haut niveaux d’abstraction au point de vue temporel et nous montrons que les alignements sont généralement de meilleure qualité. Nous obtenons également des gains de performance significatifs comparativement à notre modèle de référence, malgré le fait que nos modèles ont moins de paramètres et qu’ils aient été entraînés moins longtemps. Dans notre article "Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes", nous proposons une nouvelle approche pour ajouter de manière explicite un mécanisme de mémoire aux réseaux de neurones. Contrairement aux RNNs conventionnels, la mémoire n’est pas seulement représentée au niveau des activations du réseau, mais également dans une mémoire externe. Notre modèle, D-NTM, utilise un mécanisme d’adressage plus simple que les Neural Turing Machine (NTM) en utilisant des paires clé-valeur. Nous montrons que les modèles disposant de ce nouveau mécanisme peuvent plus efficacement apprendre les dépendances à long terme, en plus de mieux généraliser. Nous obtenons des améliorations sur plusieurs tâches incluant entre autres la réponse aux questions sur bAbI, le raisonnement avec implication, MNIST permuté, ainsi que des tâches synthétiques. Dans notre article "Noisy Activation Functions", nous proposons une nouvelle fonction d’activation, qui rend les activations stochastiques en leur ajoutant du bruit. Notre motivation dans cet article est d’aborder les problèmes d’optimisation qui surviennent lorsque nous utilisons des fonctions d’activation qui saturent, comme celles généralement utilisées dans les RNNs. Notre approche permet d’utiliser des fonctions d’activation linéaires par morceaux sur les RNNs à porte. Nous montrons des améliorations pour un grand nombre de tâches sans effectuer de recherche d'hyper paramètres intensive. Nous montrons également que supprimer le bruit dans les fonctions d’activation a un profond impact sur l’optimisation.The goal of this thesis is to present some of the small steps taken on the path towards solving natural language understanding and learning long-term dependencies to develop artificial intelligence algorithms that can reason with language. This thesis is written as a thesis by articles and contains five articles. Each article in this thesis proposes a new model or algorithm and demonstrates the efficiency of the proposed approach to solve problems that involve long-term dependencies or require natural language understanding. Although some of the models are tested on a particular task (such as neural machine translation), the proposed methods in this thesis are generally applicable to other domains and tasks (and have been used in the literature). In the introduction of this thesis, we introduce some of the fundamental concepts behind training sequence models using neural networks. We first provide a brief introduction to neural networks and then dive into details of the some of approaches and algorithms that are used throughout this thesis. In our first article, we propose a novel method to utilize the abundant amount of available monolingual data for training neural machine translation models. We have accomplished this goal by first training a long short-term memory (LSTM) language model on a large monolingual corpus and then fusing the outputs or the hidden states of the LSTM language model with the decoder of the neural machine translation model. Our neural machine translation model is trained end to end with an attention mechanism. We have shown that our proposed approaches can improve the performance of the neural machine translation models significantly on the rare resource translation tasks and our approach improved the data-efficiency of the end to end neural machine translation systems. We report improvements on Turkish-English (Tr-En), German-English (De-En), Chinese-English (Zh-En) and Czech-English (Cz-En) translation tasks. In our second paper, we propose an approach to address the problem of rare words for natural language processing tasks. Our approach augments the encoder-decoder architecture with attention model by replacing the final softmax layer with our proposed pointer-softmax layer that creates pointers to the source sentences as the decoder translates. In the case of pointer-softmax, our model learns to switch between copying a word from the source and predicting a word from a shortlist vocabulary in a probabilistic manner. Our proposed approach is end-to-end trainable with a single maximum likelihood objective of the NMT model. We have also shown that it improves the performance of summarization and the neural machine translation model. We report significant improvements in machine translation and summarization tasks. In our "Plan, Attend, Generate: Planning for Sequence-to-Sequence Models" paper, we propose two new approaches to learn alignments in a sequence to sequence model. If the input and the source context is very long, learning the alignments for a sequence to sequence model can be difficult. In particular, because when the decoder is a large network, it can learn to ignore the alignments and attend more on the last token of the input sequence. We propose a new approach which is inspired by a hierarchical reinforcement learning algorithm and extend our model with an explicit planning mechanism. The proposed alignment mechanism plans and computes the alignments for the next kk tokens in the decoder. Our model also learns a commitment plan to decide when to recompute the alignment matrix. Our proposed approach can learn high-level temporal abstractions, and we show that it qualitatively learns better alignments. We also achieve significant improvements over our baseline despite using smaller models and with less training. In "Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes," we propose a new approach for augmenting neural networks with an explicit memory mechanism. As opposed to conventional RNNs, the memory is not only represented in the activations of the neural network but in an external memory that can be accessed via the neural network controller. Our model, D-NTM uses a more straightforward memory addressing mechanism than NTM which is achieved by using key-value pairs for each memory cell. We find out that the models augmented with an external memory mechanism can learn tasks that involve long-term dependencies more efficiently and achieve better generalization. We achieve improvements on many tasks including but not limited to episodic question answering on bAbI, reasoning with entailment, permuted MNIST task and synthetic tasks. In our "Noisy Activation Functions" paper, we propose a novel activation function that makes the activations stochastic by injecting a particular form of noise to them. Our motivation in this paper is to address the optimization problem of commonly used saturating activation functions that are used with the recurrent neural networks. Our approach enables us to use piece-wise linear activation functions on the gated recurrent neural network models. We show improvements in a wide range of tasks without doing any extensive hyperparameter search by a drop-in replacement. We also show that annealing the noise of the activation function can have a profound continuation-like effect on the optimization of the network

    Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

    Get PDF
    We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation. We present a meta-analysis of the performance of speech recognition adaptation algorithms, based on relative error rate reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27 figure

    A Survey of Quantum-Cognitively Inspired Sentiment Analysis Models

    Full text link
    Quantum theory, originally proposed as a physical theory to describe the motions of microscopic particles, has been applied to various non-physics domains involving human cognition and decision-making that are inherently uncertain and exhibit certain non-classical, quantum-like characteristics. Sentiment analysis is a typical example of such domains. In the last few years, by leveraging the modeling power of quantum probability (a non-classical probability stemming from quantum mechanics methodology) and deep neural networks, a range of novel quantum-cognitively inspired models for sentiment analysis have emerged and performed well. This survey presents a timely overview of the latest developments in this fascinating cross-disciplinary area. We first provide a background of quantum probability and quantum cognition at a theoretical level, analyzing their advantages over classical theories in modeling the cognitive aspects of sentiment analysis. Then, recent quantum-cognitively inspired models are introduced and discussed in detail, focusing on how they approach the key challenges of the sentiment analysis task. Finally, we discuss the limitations of the current research and highlight future research directions
    • …
    corecore