Search CORE

35 research outputs found

Why Learning of Large-Scale Neural Networks Behaves Like Convex Optimization

Author: Jiang Hui
Publication venue
Publication date: 05/03/2019
Field of study

In this paper, we present some theoretical work to explain why simple gradient descent methods are so successful in solving non-convex optimization problems in learning large-scale neural networks (NN). After introducing a mathematical tool called canonical space, we have proved that the objective functions in learning NNs are convex in the canonical model space. We further elucidate that the gradients between the original NN model space and the canonical space are related by a pointwise linear transformation, which is represented by the so-called disparity matrix. Furthermore, we have proved that gradient descent methods surely converge to a global minimum of zero loss provided that the disparity matrices maintain full rank. If this full-rank condition holds, the learning of NNs behaves in the same way as normal convex optimization. At last, we have shown that the chance to have singular disparity matrices is extremely slim in large NNs. In particular, when over-parameterized NNs are randomly initialized, the gradient decent algorithms converge to a global minimum of zero loss in probability.Comment: 10 page

arXiv.org e-Print Archive

Recommended from our members

Joint Training Methods for Tandem and Hybrid Speech Recognition Systems using Deep Neural Networks

Author: Zhang Chao
Publication venue: University of Cambridge
Publication date: 11/04/2019
Field of study

Hidden Markov models (HMMs) have been the mainstream acoustic modelling approach for state-of-the-art automatic speech recognition (ASR) systems over the past few decades. Recently, due to the rapid development of deep learning technologies, deep neural networks (DNNs) have become an essential part of nearly all kinds of ASR approaches. Among HMM-based ASR approaches, DNNs are most commonly used to extract features (tandem system configuration) or to directly produce HMM output probabilities (hybrid system configuration). Although DNN tandem and hybrid systems have been shown to have superior performance to traditional ASR systems without any DNN models, there are still issues with such systems. First, some of the DNN settings, such as the choice of the context-dependent (CD) output targets set and hidden activation functions, are usually determined independently from the DNN training process. Second, different ASR modules are separately optimised based on different criteria following a greedy build strategy. For instance, for tandem systems, the features are often extracted by a DNN trained to classify individual speech frames while acoustic models are built upon such features according to a sequence level criterion. These issues mean that the best performance is not theoretically guaranteed. This thesis focuses on alleviating both issues using joint training methods. In DNN acoustic model joint training, the decision tree HMM state tying approach is extended to cluster DNN-HMM states. Based on this method, an alternative CD-DNN training procedure without relying on any additional system is proposed, which can produce DNN acoustic models comparable in word error rate (WER) with those trained by the conventional procedure. Meanwhile, the most common hidden activation functions, the sigmoid and rectified linear unit (ReLU), are parameterised to enable automatic learning of function forms. Experiments using conversational telephone speech (CTS) Mandarin data result in an average of 3.4% and 2.2% relative character error rate (CER) reduction with sigmoid and ReLU parameterisations. Such parameterised functions can also be applied to speaker adaptation tasks. At the ASR system level, DNN acoustic model and corresponding speaker dependent (SD) input feature transforms are jointly learned through minimum phone error (MPE) training as an example of hybrid system joint training, which outperforms the conventional hybrid system speaker adaptive training (SAT) method. MPE based speaker independent (SI) tandem system joint training is also studied. Experiments on multi-genre broadcast (MGB) English data show that this method gives a reduction in tandem system WER of 11.8% (relative), and the resulting tandem systems are comparable to MPE hybrid systems in both WER and the number of parameters. In addition, all approaches in this thesis have been implemented using the hidden Markov model toolkit (HTK) and the related source code has been or will be made publicly available with either recent or future HTK releases, to increase the reproducibility of the work presented in this thesis.Cambridge International Scholarship, Cambridge Overseas Trust Research funding, EPSRC Natural Speech Technology Project Research funding, DARPA BOLT Program Research funding, iARPA Babel Progra

Apollo (Cambridge)

Advances in deep learning methods for speech recognition and understanding

Author: Serdyuk Dmitriy
Publication venue
Publication date: 01/10/2020
Field of study

Ce travail expose plusieurs études dans les domaines de la reconnaissance de la parole et compréhension du langage parlé. La compréhension sémantique du langage parlé est un sous-domaine important de l'intelligence artificielle. Le traitement de la parole intéresse depuis longtemps les chercheurs, puisque la parole est une des charactéristiques qui definit l'être humain. Avec le développement du réseau neuronal artificiel, le domaine a connu une évolution rapide à la fois en terme de précision et de perception humaine. Une autre étape importante a été franchie avec le développement d'approches bout en bout. De telles approches permettent une coadaptation de toutes les parties du modèle, ce qui augmente ainsi les performances, et ce qui simplifie la procédure d'entrainement. Les modèles de bout en bout sont devenus réalisables avec la quantité croissante de données disponibles, de ressources informatiques et, surtout, avec de nombreux développements architecturaux innovateurs. Néanmoins, les approches traditionnelles (qui ne sont pas bout en bout) sont toujours pertinentes pour le traitement de la parole en raison des données difficiles dans les environnements bruyants, de la parole avec un accent et de la grande variété de dialectes. Dans le premier travail, nous explorons la reconnaissance de la parole hybride dans des environnements bruyants. Nous proposons de traiter la reconnaissance de la parole, qui fonctionne dans un nouvel environnement composé de différents bruits inconnus, comme une tâche d'adaptation de domaine. Pour cela, nous utilisons la nouvelle technique à l'époque de l'adaptation du domaine antagoniste. En résumé, ces travaux antérieurs proposaient de former des caractéristiques de manière à ce qu'elles soient distinctives pour la tâche principale, mais non-distinctive pour la tâche secondaire. Cette tâche secondaire est conçue pour être la tâche de reconnaissance de domaine. Ainsi, les fonctionnalités entraînées sont invariantes vis-à-vis du domaine considéré. Dans notre travail, nous adoptons cette technique et la modifions pour la tâche de reconnaissance de la parole dans un environnement bruyant. Dans le second travail, nous développons une méthode générale pour la régularisation des réseaux génératif récurrents. Il est connu que les réseaux récurrents ont souvent des difficultés à rester sur le même chemin, lors de la production de sorties longues. Bien qu'il soit possible d'utiliser des réseaux bidirectionnels pour une meilleure traitement de séquences pour l'apprentissage des charactéristiques, qui n'est pas applicable au cas génératif. Nous avons développé un moyen d'améliorer la cohérence de la production de longues séquences avec des réseaux récurrents. Nous proposons un moyen de construire un modèle similaire à un réseau bidirectionnel. L'idée centrale est d'utiliser une perte L2 entre les réseaux récurrents génératifs vers l'avant et vers l'arrière. Nous fournissons une évaluation expérimentale sur une multitude de tâches et d'ensembles de données, y compris la reconnaissance vocale, le sous-titrage d'images et la modélisation du langage. Dans le troisième article, nous étudions la possibilité de développer un identificateur d'intention de bout en bout pour la compréhension du langage parlé. La compréhension sémantique du langage parlé est une étape importante vers le développement d'une intelligence artificielle de type humain. Nous avons vu que les approches de bout en bout montrent des performances élevées sur les tâches, y compris la traduction automatique et la reconnaissance de la parole. Nous nous inspirons des travaux antérieurs pour développer un système de bout en bout pour la reconnaissance de l'intention.This work presents several studies in the areas of speech recognition and understanding. The semantic speech understanding is an important sub-domain of the broader field of artificial intelligence. Speech processing has had interest from the researchers for long time because language is one of the defining characteristics of a human being. With the development of neural networks, the domain has seen rapid progress both in terms of accuracy and human perception. Another important milestone was achieved with the development of end-to-end approaches. Such approaches allow co-adaptation of all the parts of the model thus increasing the performance, as well as simplifying the training procedure. End-to-end models became feasible with the increasing amount of available data, computational resources, and most importantly with many novel architectural developments. Nevertheless, traditional, non end-to-end, approaches are still relevant for speech processing due to challenging data in noisy environments, accented speech, and high variety of dialects. In the first work, we explore the hybrid speech recognition in noisy environments. We propose to treat the recognition in the unseen noise condition as the domain adaptation task. For this, we use the novel at the time technique of the adversarial domain adaptation. In the nutshell, this prior work proposed to train features in such a way that they are discriminative for the primary task, but non-discriminative for the secondary task. This secondary task is constructed to be the domain recognition task. Thus, the features trained are invariant towards the domain at hand. In our work, we adopt this technique and modify it for the task of noisy speech recognition. In the second work, we develop a general method for regularizing the generative recurrent networks. It is known that the recurrent networks frequently have difficulties staying on same track when generating long outputs. While it is possible to use bi-directional networks for better sequence aggregation for feature learning, it is not applicable for the generative case. We developed a way improve the consistency of generating long sequences with recurrent networks. We propose a way to construct a model similar to bi-directional network. The key insight is to use a soft L2 loss between the forward and the backward generative recurrent networks. We provide experimental evaluation on a multitude of tasks and datasets, including speech recognition, image captioning, and language modeling. In the third paper, we investigate the possibility of developing an end-to-end intent recognizer for spoken language understanding. The semantic spoken language understanding is an important step towards developing a human-like artificial intelligence. We have seen that the end-to-end approaches show high performance on the tasks including machine translation and speech recognition. We draw the inspiration from the prior works to develop an end-to-end system for intent recognition

Dépôt Institutionnel Numérique

DeepEar: Robust smartphone audio sensing in unconstrained acoustic environments using deep learning

Author: Georgiev P
Lane ND
Qendro L
Publication venue: UbiComp 2015 - Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing
Publication date: 01/01/2015
Field of study

Microphones are remarkably powerful sensors of human behavior and context. However, audio sensing is highly susceptible to wild fluctuations in accuracy when used in diverse acoustic environments (such as, bedrooms, vehicles, or cafes), that users encounter on a daily basis. Towards addressing this challenge, we turn to the field of deep learning; an area of machine learning that has radically changed related audio modeling domains like speech recognition. In this paper, we present DeepEar – the first mobile audio sensing framework built from coupled Deep Neural Networks (DNNs) that simultaneously perform common audio sensing tasks. We train DeepEar with a large-scale dataset including unlabeled data from 168 place visits. The resulting learned model, involving 2.3M parameters, enables DeepEar to significantly increase inference robustness to background noise beyond conventional approaches present in mobile devices. Finally, we show DeepEar is feasible for smartphones by building a cloud-free DSP-based prototype that runs continuously, using only 6% of the smartphone’s battery dailyThis is the author accepted manuscript. The final version is available from ACM via http://dx.doi.org/10.1145/2750858.280426

UCL Discovery

Apollo (Cambridge)

Representation learning for unsupervised speech processing

Author: Renshaw Daniel Ian
Publication venue: The University of Edinburgh
Publication date: 27/06/2016
Field of study

Automatic speech recognition for our most widely used languages has recently seen substantial improvements, driven by improved training procedures for deep artificial neural networks, cost-effective availability of computational power at large scale, and, crucially, availability of large quantities of labelled training data. This success cannot be transferred to low and zero resource languages where the requisite transcriptions are unavailable. Unsupervised speech processing promises better methods for dealing with under-resourced languages. Here we investigate unsupervised neural network based models for learning frame- and sequence- level representations with the goal of improving zero-resource speech processing. Good representations eliminate differences in accent, gender, channel characteristics, and other factors to model subword or whole-term units for within- and across- speaker speech unit discrimination. We present two contributions focussing on unsupervised learning of frame-level representations: (1) an improved version of the correspondence autoencoder applied to the INTERSPEECH 2015 Zero Resource Challenge, and (2) a proposed model for learning representations that explicitly optimize speech unit discrimination. We also present two contributions focussing on efficiency and scalability of unsupervised speech processing: (1) a proposed model and pilot experiments for learning a linear-time approximation of the quadratic-time dynamic time warping algorithm, and (2) a series of model proposals for learning fixed size representations of variable length speech segments enabling efficient vector space similarity measures

Edinburgh Research Archive

Machine learning-based fault detection and diagnosis in electric motors

Author: RIBEIRO JUNIOR Ronny Francis
Publication venue: UNIFEI
Publication date: 18/02/2022
Field of study

Fault diagnosis is critical to any maintenance industry, as early fault detection can prevent catastrophic failures as well as a waste of time and money. In view of these objectives, vibration analysis in the frequency domain is a mature technique. Although well established, traditional methods involve a high cost of time and people to identify failures, causing machine learning methods to grow in recent years. The Machine learning (ML) methods can be divided into two large learning groups: supervised and unsupervised, with the main difference between them being whether the dataset is labeled or not. This study presents a total of four different methods for fault detection and diagnosis. The frequency analysis of the vibration signal was the first approach employed. This analysis was chosen to validate the future results of the ML methods. The Gaussian Mixture model (GMM) was employed for the unsupervised technique. A GMM is a probabilistic model in which all data points are assumed to be generated by a finite number of Gaussian distributions with unknown parameters. For supervised learning, the Convolution neural network (CNN) was used. CNNs are feedforward networks that were inspired by biological pattern recognition processes. All methods were tested through a series of experiments with real electric motors. Results showed that all methods can detect and classify the motors in several induced operation conditions: healthy, unbalanced, mechanical looseness, misalignment, bent shaft, broken bar, and bearing fault condition. Although all approaches are able to identify the fault, each technique has benefits and limitations that make them better for certain types of applications, therefore, a comparison is also made between the methods.O diagnóstico de falhas é fundamental para qualquer indústria de manutenção, a detecção precoce de falhas pode evitar falhas catastróficas, bem como perda de tempo e dinheiro. Tendo em vista esses objetivos, a análise de vibração através do domínio da frequência é uma técnica madura. Embora bem estabelecidos, os métodos tradicionais envolvem um alto custo de tempo e pessoas para identificar falhas, fazendo com que os métodos de aprendizado de máquina cresçam nos últimos anos. Os métodos de Machine learning (ML) podem ser divididos em dois grandes grupos de aprendizagem: supervisionado e não supervisionado, sendo a principal diferença entre eles é o conjunto de dados que está rotulado ou não. Este estudo apresenta um total de quatro métodos diferentes para detecção e diagnóstico de falhas. A análise da frequência do sinal de vibração foi a primeira abordagem empregada. foi escolhida para validar os resultados futuros dos métodos de ML. O Gaussian Mixture Model (GMM) foi empregado para a técnica não supervisionada. O GMM é um modelo probabilístico em que todos os pontos de dados são considerados gerados por um número finito de distribuições gaussianas com parâmetros desconhecidos. Para a aprendizagem supervisionada, foi utilizada a Convolutional Neural Network (CNN). CNNs são redes feedforward que foram inspiradas por processos de reconhecimento de padrões biológicos. Todos os métodos foram testados por meio de uma série de experimentos com motores elétricos reais. Os resultados mostraram que todos os métodos podem detectar e classificar os motores em várias condições de operação induzida: íntegra, desequilibrado, folga mecânica, desalinhamento, eixo empenado, barra quebrada e condição de falha do rolamento. Embora todas as abordagens sejam capazes de identificar a falha, cada técnica tem benefícios e limitações que as tornam melhores para certos tipos de aplicações, por isso, também e feita uma comparação entre os métodos

Repositório UNIFEI