35 research outputs found
Why Learning of Large-Scale Neural Networks Behaves Like Convex Optimization
In this paper, we present some theoretical work to explain why simple
gradient descent methods are so successful in solving non-convex optimization
problems in learning large-scale neural networks (NN). After introducing a
mathematical tool called canonical space, we have proved that the objective
functions in learning NNs are convex in the canonical model space. We further
elucidate that the gradients between the original NN model space and the
canonical space are related by a pointwise linear transformation, which is
represented by the so-called disparity matrix. Furthermore, we have proved that
gradient descent methods surely converge to a global minimum of zero loss
provided that the disparity matrices maintain full rank. If this full-rank
condition holds, the learning of NNs behaves in the same way as normal convex
optimization. At last, we have shown that the chance to have singular disparity
matrices is extremely slim in large NNs. In particular, when over-parameterized
NNs are randomly initialized, the gradient decent algorithms converge to a
global minimum of zero loss in probability.Comment: 10 page
Recommended from our members
Joint Training Methods for Tandem and Hybrid Speech Recognition Systems using Deep Neural Networks
Hidden Markov models (HMMs) have been the mainstream acoustic modelling approach for state-of-the-art automatic speech recognition (ASR) systems over the
past few decades. Recently, due to the rapid development of deep learning technologies, deep neural networks (DNNs) have become an essential part of nearly all kinds of ASR approaches. Among HMM-based ASR approaches, DNNs are most commonly used to extract features (tandem system configuration) or to directly produce HMM output probabilities (hybrid system configuration).
Although DNN tandem and hybrid systems have been shown to have superior
performance to traditional ASR systems without any DNN models, there are still
issues with such systems. First, some of the DNN settings, such as the choice of
the context-dependent (CD) output targets set and hidden activation functions, are
usually determined independently from the DNN training process. Second, different
ASR modules are separately optimised based on different criteria following a greedy
build strategy. For instance, for tandem systems, the features are often extracted by a
DNN trained to classify individual speech frames while acoustic models are built upon
such features according to a sequence level criterion. These issues mean that the best performance is not theoretically guaranteed.
This thesis focuses on alleviating both issues using joint training methods. In DNN
acoustic model joint training, the decision tree HMM state tying approach is extended
to cluster DNN-HMM states. Based on this method, an alternative CD-DNN training
procedure without relying on any additional system is proposed, which can produce
DNN acoustic models comparable in word error rate (WER) with those trained by the
conventional procedure. Meanwhile, the most common hidden activation functions,
the sigmoid and rectified linear unit (ReLU), are parameterised to enable automatic
learning of function forms. Experiments using conversational telephone speech (CTS)
Mandarin data result in an average of 3.4% and 2.2% relative character error rate (CER) reduction with sigmoid and ReLU parameterisations. Such parameterised functions can also be applied to speaker adaptation tasks.
At the ASR system level, DNN acoustic model and corresponding speaker dependent (SD) input feature transforms are jointly learned through minimum phone error
(MPE) training as an example of hybrid system joint training, which outperforms the
conventional hybrid system speaker adaptive training (SAT) method. MPE based speaker independent (SI) tandem system joint training is also studied. Experiments on
multi-genre broadcast (MGB) English data show that this method gives a reduction
in tandem system WER of 11.8% (relative), and the resulting tandem systems are
comparable to MPE hybrid systems in both WER and the number of parameters. In
addition, all approaches in this thesis have been implemented using the hidden Markov model toolkit (HTK) and the related source code has been or will be made publicly available with either recent or future HTK releases, to increase the reproducibility of the work presented in this thesis.Cambridge International Scholarship, Cambridge Overseas Trust
Research funding, EPSRC Natural Speech Technology Project
Research funding, DARPA BOLT Program
Research funding, iARPA Babel Progra
Advances in deep learning methods for speech recognition and understanding
Ce travail expose plusieurs études dans les domaines de
la reconnaissance de la parole et
compréhension du langage parlé.
La compréhension sémantique du langage parlé est un sous-domaine important
de l'intelligence artificielle.
Le traitement de la parole intéresse depuis longtemps les chercheurs,
puisque la parole est une des charactéristiques qui definit l'être humain.
Avec le développement du réseau neuronal artificiel,
le domaine a connu une évolution rapide
à la fois en terme de précision et de perception humaine.
Une autre étape importante a été franchie avec le développement
d'approches bout en bout.
De telles approches permettent une coadaptation de toutes
les parties du modèle, ce qui augmente ainsi les performances,
et ce qui simplifie la procédure d'entrainement.
Les modèles de bout en bout sont devenus réalisables avec la quantité croissante
de données disponibles, de ressources informatiques et,
surtout, avec de nombreux développements architecturaux innovateurs.
Néanmoins, les approches traditionnelles (qui ne sont pas bout en bout)
sont toujours pertinentes pour le traitement de la parole en raison
des données difficiles dans les environnements bruyants,
de la parole avec un accent et de la grande variété de dialectes.
Dans le premier travail, nous explorons la reconnaissance de la parole hybride
dans des environnements bruyants.
Nous proposons de traiter la reconnaissance de la parole,
qui fonctionne dans
un nouvel environnement composé de différents bruits inconnus,
comme une tâche d'adaptation de domaine.
Pour cela, nous utilisons la nouvelle technique à l'époque
de l'adaptation du domaine antagoniste.
En résumé, ces travaux antérieurs proposaient de former
des caractéristiques de manière à ce qu'elles soient distinctives
pour la tâche principale, mais non-distinctive pour la tâche secondaire.
Cette tâche secondaire est conçue pour être la tâche de reconnaissance de domaine.
Ainsi, les fonctionnalités entraînées sont invariantes vis-à -vis du domaine considéré.
Dans notre travail, nous adoptons cette technique et la modifions pour
la tâche de reconnaissance de la parole dans un environnement bruyant.
Dans le second travail, nous développons une méthode générale
pour la régularisation des réseaux génératif récurrents.
Il est connu que les réseaux récurrents ont souvent des difficultés à rester
sur le même chemin, lors de la production de sorties longues.
Bien qu'il soit possible d'utiliser des réseaux bidirectionnels pour
une meilleure traitement de séquences pour l'apprentissage des charactéristiques,
qui n'est pas applicable au cas génératif.
Nous avons développé un moyen d'améliorer la cohérence de
la production de longues séquences avec des réseaux récurrents.
Nous proposons un moyen de construire un modèle similaire à un réseau bidirectionnel.
L'idée centrale est d'utiliser une perte L2 entre
les réseaux récurrents génératifs vers l'avant et vers l'arrière.
Nous fournissons une évaluation expérimentale sur
une multitude de tâches et d'ensembles de données,
y compris la reconnaissance vocale,
le sous-titrage d'images et la modélisation du langage.
Dans le troisième article, nous étudions la possibilité de développer
un identificateur d'intention de bout en bout pour la compréhension du langage parlé.
La compréhension sémantique du langage parlé est une étape importante vers
le développement d'une intelligence artificielle de type humain.
Nous avons vu que les approches de bout en bout montrent
des performances élevées sur les tâches, y compris la traduction automatique et
la reconnaissance de la parole.
Nous nous inspirons des travaux antérieurs pour développer
un système de bout en bout pour la reconnaissance de l'intention.This work presents several studies in the areas of speech recognition and
understanding.
The semantic speech understanding is an important sub-domain of the
broader field of artificial intelligence.
Speech processing has had interest from the researchers for long time
because language is one of the defining characteristics of a human being.
With the development of neural networks, the domain has seen rapid progress
both in terms of accuracy and human perception.
Another important milestone was achieved with the development of
end-to-end approaches.
Such approaches allow co-adaptation of all the parts of the model
thus increasing the performance, as well as simplifying the training
procedure.
End-to-end models became feasible with the increasing amount of available
data, computational resources, and most importantly with many novel
architectural developments.
Nevertheless, traditional, non end-to-end, approaches are still relevant
for speech processing due to challenging data in noisy environments,
accented speech, and high variety of dialects.
In the first work, we explore the hybrid speech recognition in noisy
environments.
We propose to treat the recognition in the unseen noise condition
as the domain adaptation task.
For this, we use the novel at the time technique of the adversarial
domain adaptation.
In the nutshell, this prior work proposed to train features in such
a way that they are discriminative for the primary task,
but non-discriminative for the secondary task.
This secondary task is constructed to be the domain recognition task.
Thus, the features trained are invariant towards the domain at hand.
In our work, we adopt this technique and modify it for the task of
noisy speech recognition.
In the second work, we develop a general method for regularizing
the generative recurrent networks.
It is known that the recurrent networks frequently have difficulties
staying on same track when generating long outputs.
While it is possible to use bi-directional networks for better
sequence aggregation for feature learning, it is not applicable
for the generative case.
We developed a way improve the consistency of generating long sequences
with recurrent networks.
We propose a way to construct a model similar to bi-directional network.
The key insight is to use a soft L2 loss between the forward and
the backward generative recurrent networks.
We provide experimental evaluation on a multitude of tasks and datasets,
including speech recognition, image captioning, and language modeling.
In the third paper, we investigate the possibility of developing
an end-to-end intent recognizer for spoken language understanding.
The semantic spoken language understanding is an important
step towards developing a human-like artificial intelligence.
We have seen that the end-to-end approaches show high
performance on the tasks including machine translation and speech recognition.
We draw the inspiration from the prior works to develop
an end-to-end system for intent recognition
DeepEar: Robust smartphone audio sensing in unconstrained acoustic environments using deep learning
Microphones are remarkably powerful sensors of human behavior and context. However, audio sensing is highly susceptible to wild fluctuations in accuracy when used in diverse acoustic environments (such as, bedrooms, vehicles, or cafes), that users encounter on a daily basis. Towards addressing this challenge, we turn to the field of deep learning; an area of machine learning that has radically changed related audio modeling domains like speech recognition. In this paper, we present DeepEar – the first mobile audio sensing framework built from coupled Deep Neural Networks (DNNs) that simultaneously perform common audio sensing tasks. We train DeepEar with a large-scale dataset including unlabeled data from 168 place visits. The resulting learned model, involving 2.3M parameters, enables DeepEar to significantly increase inference robustness to background noise beyond conventional approaches present in mobile devices. Finally, we show DeepEar is feasible for smartphones by building a cloud-free DSP-based prototype that runs continuously, using only 6% of the smartphone’s battery dailyThis is the author accepted manuscript. The final version is available from ACM via http://dx.doi.org/10.1145/2750858.280426
Representation learning for unsupervised speech processing
Automatic speech recognition for our most widely used languages has recently seen
substantial improvements, driven by improved training procedures for deep artificial
neural networks, cost-effective availability of computational power at large scale, and,
crucially, availability of large quantities of labelled training data. This success cannot
be transferred to low and zero resource languages where the requisite transcriptions are
unavailable.
Unsupervised speech processing promises better methods for dealing with under-resourced
languages. Here we investigate unsupervised neural network based models
for learning frame- and sequence- level representations with the goal of improving
zero-resource speech processing. Good representations eliminate differences in accent,
gender, channel characteristics, and other factors to model subword or whole-term units
for within- and across- speaker speech unit discrimination.
We present two contributions focussing on unsupervised learning of frame-level
representations: (1) an improved version of the correspondence autoencoder applied
to the INTERSPEECH 2015 Zero Resource Challenge, and (2) a proposed model for
learning representations that explicitly optimize speech unit discrimination.
We also present two contributions focussing on efficiency and scalability of unsupervised
speech processing: (1) a proposed model and pilot experiments for learning a
linear-time approximation of the quadratic-time dynamic time warping algorithm, and
(2) a series of model proposals for learning fixed size representations of variable length
speech segments enabling efficient vector space similarity measures
Machine learning-based fault detection and diagnosis in electric motors
Fault diagnosis is critical to any maintenance industry, as early fault detection can prevent
catastrophic failures as well as a waste of time and money. In view of these objectives,
vibration analysis in the frequency domain is a mature technique. Although well
established, traditional methods involve a high cost of time and people to identify failures,
causing machine learning methods to grow in recent years. The Machine learning (ML)
methods can be divided into two large learning groups: supervised and unsupervised, with
the main difference between them being whether the dataset is labeled or not. This study
presents a total of four different methods for fault detection and diagnosis. The frequency
analysis of the vibration signal was the first approach employed. This analysis was chosen
to validate the future results of the ML methods. The Gaussian Mixture model (GMM)
was employed for the unsupervised technique. A GMM is a probabilistic model in which
all data points are assumed to be generated by a finite number of Gaussian distributions
with unknown parameters. For supervised learning, the Convolution neural network
(CNN) was used. CNNs are feedforward networks that were inspired by biological pattern
recognition processes. All methods were tested through a series of experiments with real
electric motors. Results showed that all methods can detect and classify the motors in
several induced operation conditions: healthy, unbalanced, mechanical looseness,
misalignment, bent shaft, broken bar, and bearing fault condition. Although all
approaches are able to identify the fault, each technique has benefits and limitations that
make them better for certain types of applications, therefore, a comparison is also made
between the methods.O diagnóstico de falhas é fundamental para qualquer indústria de manutenção, a detecção
precoce de falhas pode evitar falhas catastróficas, bem como perda de tempo e dinheiro.
Tendo em vista esses objetivos, a análise de vibração através do domÃnio da frequência é
uma técnica madura. Embora bem estabelecidos, os métodos tradicionais envolvem um
alto custo de tempo e pessoas para identificar falhas, fazendo com que os métodos de
aprendizado de máquina cresçam nos últimos anos. Os métodos de Machine learning
(ML) podem ser divididos em dois grandes grupos de aprendizagem: supervisionado e
não supervisionado, sendo a principal diferença entre eles é o conjunto de dados que está
rotulado ou não. Este estudo apresenta um total de quatro métodos diferentes para
detecção e diagnóstico de falhas. A análise da frequência do sinal de vibração foi a
primeira abordagem empregada. foi escolhida para validar os resultados futuros dos
métodos de ML. O Gaussian Mixture Model (GMM) foi empregado para a técnica não
supervisionada. O GMM é um modelo probabilÃstico em que todos os pontos de dados
são considerados gerados por um número finito de distribuições gaussianas com
parâmetros desconhecidos. Para a aprendizagem supervisionada, foi utilizada a
Convolutional Neural Network (CNN). CNNs são redes feedforward que foram
inspiradas por processos de reconhecimento de padrões biológicos. Todos os métodos
foram testados por meio de uma série de experimentos com motores elétricos reais. Os
resultados mostraram que todos os métodos podem detectar e classificar os motores em
várias condições de operação induzida: Ãntegra, desequilibrado, folga mecânica,
desalinhamento, eixo empenado, barra quebrada e condição de falha do rolamento.
Embora todas as abordagens sejam capazes de identificar a falha, cada técnica tem
benefÃcios e limitações que as tornam melhores para certos tipos de aplicações, por isso,
também e feita uma comparação entre os métodos