9 research outputs found
Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation
This work presents a broad study on the adaptation of neural network acoustic
models by means of learning hidden unit contributions (LHUC) -- a method that
linearly re-combines hidden units in a speaker- or environment-dependent manner
using small amounts of unsupervised adaptation data. We also extend LHUC to a
speaker adaptive training (SAT) framework that leads to a more adaptable DNN
acoustic model, working both in a speaker-dependent and a speaker-independent
manner, without the requirements to maintain auxiliary speaker-dependent
feature extractors or to introduce significant speaker-dependent changes to the
DNN structure. Through a series of experiments on four different speech
recognition benchmarks (TED talks, Switchboard, AMI meetings, and Aurora4)
comprising 270 test speakers, we show that LHUC in both its test-only and SAT
variants results in consistent word error rate reductions ranging from 5% to
23% relative depending on the task and the degree of mismatch between training
and test data. In addition, we have investigated the effect of the amount of
adaptation data per speaker, the quality of unsupervised adaptation targets,
the complementarity to other adaptation techniques, one-shot adaptation, and an
extension to adapting DNNs trained in a sequence discriminative manner.Comment: 14 pages, 9 Tables, 11 Figues in IEEE/ACM Transactions on Audio,
Speech and Language Processing, Vol. 24, Num. 8, 201
Learning representations for speech recognition using artificial neural networks
Learning representations is a central challenge in machine learning. For speech
recognition, we are interested in learning robust representations that are stable
across different acoustic environments, recording equipment and irrelevant inter–
and intra– speaker variabilities. This thesis is concerned with representation
learning for acoustic model adaptation to speakers and environments, construction
of acoustic models in low-resource settings, and learning representations from
multiple acoustic channels. The investigations are primarily focused on the hybrid
approach to acoustic modelling based on hidden Markov models and artificial
neural networks (ANN).
The first contribution concerns acoustic model adaptation. This comprises
two new adaptation transforms operating in ANN parameters space. Both operate
at the level of activation functions and treat a trained ANN acoustic model as
a canonical set of fixed-basis functions, from which one can later derive variants
tailored to the specific distribution present in adaptation data. The first technique,
termed Learning Hidden Unit Contributions (LHUC), depends on learning
distribution-dependent linear combination coefficients for hidden units. This
technique is then extended to altering groups of hidden units with parametric and
differentiable pooling operators. We found the proposed adaptation techniques
pose many desirable properties: they are relatively low-dimensional, do not overfit
and can work in both a supervised and an unsupervised manner. For LHUC we
also present extensions to speaker adaptive training and environment factorisation.
On average, depending on the characteristics of the test set, 5-25% relative
word error rate (WERR) reductions are obtained in an unsupervised two-pass
adaptation setting.
The second contribution concerns building acoustic models in low-resource
data scenarios. In particular, we are concerned with insufficient amounts of
transcribed acoustic material for estimating acoustic models in the target language
– thus assuming resources like lexicons or texts to estimate language models
are available. First we proposed an ANN with a structured output layer
which models both context–dependent and context–independent speech units,
with the context-independent predictions used at runtime to aid the prediction
of context-dependent states. We also propose to perform multi-task adaptation
with a structured output layer. We obtain consistent WERR reductions up to
6.4% in low-resource speaker-independent acoustic modelling. Adapting those
models in a multi-task manner with LHUC decreases WERRs by an additional
13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that
one can build better acoustic models with unsupervised multi– and cross– lingual
initialisation and find that pre-training is a largely language-independent. Up to
14.4% WERR reductions are observed, depending on the amount of the available
transcribed acoustic data in the target language.
The third contribution concerns building acoustic models from multi-channel
acoustic data. For this purpose we investigate various ways of integrating and
learning multi-channel representations. In particular, we investigate channel concatenation
and the applicability of convolutional layers for this purpose. We
propose a multi-channel convolutional layer with cross-channel pooling, which
can be seen as a data-driven non-parametric auditory attention mechanism. We
find that for unconstrained microphone arrays, our approach is able to match the
performance of the comparable models trained on beamform-enhanced signals
A Study on Deep Learning: Training, Models and Applications
In the past few years, deep learning has become a very important research field that has attracted a lot of research interests, attributing to the development of the computational hardware like high performance GPUs, training deep models, such as fully-connected deep neural networks (DNNs) and convolutional neural networks (CNNs), from scratch becomes practical, and using well-trained deep models to deal with real-world large scale problems also becomes possible. This dissertation mainly focuses on three important problems in deep learning, i.e., training algorithm, computational models and applications, and provides several methods to improve the performances of different deep learning methods.
The first method is a DNN training algorithm called Annealed Gradient Descent (AGD). This dissertation presents a theoretical analysis on the convergence properties and learning speed of AGD to show its benefits. Experimental results have shown that AGD yields comparable performance as SGD but it can significantly expedite training of DNNs in big data sets.
Secondly, this dissertation proposes to apply a novel model, namely Hybrid Orthogonal Projection and Estimation (HOPE), to CNNs. HOPE can be viewed as a hybrid model to combine feature extraction with mixture models. The experimental results have shown that HOPE layers can significantly improve the performance of CNNs in the image classification tasks.
The third proposed method is to apply CNNs to image saliency detection. In this approach, a gradient descent method is used to iteratively modify the input images based on pixel-wise gradients to reduce a pre-defined cost function. Moreover, SLIC superpixels and low level saliency features are applied to smooth and refine the saliency maps. Experimental results have shown that the proposed methods can generate high-quality salience maps.
The last method is also for image saliency detection. However, this method is based on Generative Adversarial Network (GAN). Different from GAN, the proposed method uses fully supervised learning to learn G-Network and D-Network. Therefore, it is called Supervised Adversarial Network (SAN). Moreover, SAN introduces a different G-Network and conv-comparison layers to further improve the saliency performance. Experimental results show that the SAN model can also generate state-of-the-art saliency maps for complicate images
Advances in deep learning methods for speech recognition and understanding
Ce travail expose plusieurs études dans les domaines de
la reconnaissance de la parole et
compréhension du langage parlé.
La compréhension sémantique du langage parlé est un sous-domaine important
de l'intelligence artificielle.
Le traitement de la parole intéresse depuis longtemps les chercheurs,
puisque la parole est une des charactéristiques qui definit l'être humain.
Avec le développement du réseau neuronal artificiel,
le domaine a connu une évolution rapide
à la fois en terme de précision et de perception humaine.
Une autre étape importante a été franchie avec le développement
d'approches bout en bout.
De telles approches permettent une coadaptation de toutes
les parties du modèle, ce qui augmente ainsi les performances,
et ce qui simplifie la procédure d'entrainement.
Les modèles de bout en bout sont devenus réalisables avec la quantité croissante
de données disponibles, de ressources informatiques et,
surtout, avec de nombreux développements architecturaux innovateurs.
Néanmoins, les approches traditionnelles (qui ne sont pas bout en bout)
sont toujours pertinentes pour le traitement de la parole en raison
des données difficiles dans les environnements bruyants,
de la parole avec un accent et de la grande variété de dialectes.
Dans le premier travail, nous explorons la reconnaissance de la parole hybride
dans des environnements bruyants.
Nous proposons de traiter la reconnaissance de la parole,
qui fonctionne dans
un nouvel environnement composé de différents bruits inconnus,
comme une tâche d'adaptation de domaine.
Pour cela, nous utilisons la nouvelle technique à l'époque
de l'adaptation du domaine antagoniste.
En résumé, ces travaux antérieurs proposaient de former
des caractéristiques de manière à ce qu'elles soient distinctives
pour la tâche principale, mais non-distinctive pour la tâche secondaire.
Cette tâche secondaire est conçue pour être la tâche de reconnaissance de domaine.
Ainsi, les fonctionnalités entraînées sont invariantes vis-à -vis du domaine considéré.
Dans notre travail, nous adoptons cette technique et la modifions pour
la tâche de reconnaissance de la parole dans un environnement bruyant.
Dans le second travail, nous développons une méthode générale
pour la régularisation des réseaux génératif récurrents.
Il est connu que les réseaux récurrents ont souvent des difficultés à rester
sur le même chemin, lors de la production de sorties longues.
Bien qu'il soit possible d'utiliser des réseaux bidirectionnels pour
une meilleure traitement de séquences pour l'apprentissage des charactéristiques,
qui n'est pas applicable au cas génératif.
Nous avons développé un moyen d'améliorer la cohérence de
la production de longues séquences avec des réseaux récurrents.
Nous proposons un moyen de construire un modèle similaire à un réseau bidirectionnel.
L'idée centrale est d'utiliser une perte L2 entre
les réseaux récurrents génératifs vers l'avant et vers l'arrière.
Nous fournissons une évaluation expérimentale sur
une multitude de tâches et d'ensembles de données,
y compris la reconnaissance vocale,
le sous-titrage d'images et la modélisation du langage.
Dans le troisième article, nous étudions la possibilité de développer
un identificateur d'intention de bout en bout pour la compréhension du langage parlé.
La compréhension sémantique du langage parlé est une étape importante vers
le développement d'une intelligence artificielle de type humain.
Nous avons vu que les approches de bout en bout montrent
des performances élevées sur les tâches, y compris la traduction automatique et
la reconnaissance de la parole.
Nous nous inspirons des travaux antérieurs pour développer
un système de bout en bout pour la reconnaissance de l'intention.This work presents several studies in the areas of speech recognition and
understanding.
The semantic speech understanding is an important sub-domain of the
broader field of artificial intelligence.
Speech processing has had interest from the researchers for long time
because language is one of the defining characteristics of a human being.
With the development of neural networks, the domain has seen rapid progress
both in terms of accuracy and human perception.
Another important milestone was achieved with the development of
end-to-end approaches.
Such approaches allow co-adaptation of all the parts of the model
thus increasing the performance, as well as simplifying the training
procedure.
End-to-end models became feasible with the increasing amount of available
data, computational resources, and most importantly with many novel
architectural developments.
Nevertheless, traditional, non end-to-end, approaches are still relevant
for speech processing due to challenging data in noisy environments,
accented speech, and high variety of dialects.
In the first work, we explore the hybrid speech recognition in noisy
environments.
We propose to treat the recognition in the unseen noise condition
as the domain adaptation task.
For this, we use the novel at the time technique of the adversarial
domain adaptation.
In the nutshell, this prior work proposed to train features in such
a way that they are discriminative for the primary task,
but non-discriminative for the secondary task.
This secondary task is constructed to be the domain recognition task.
Thus, the features trained are invariant towards the domain at hand.
In our work, we adopt this technique and modify it for the task of
noisy speech recognition.
In the second work, we develop a general method for regularizing
the generative recurrent networks.
It is known that the recurrent networks frequently have difficulties
staying on same track when generating long outputs.
While it is possible to use bi-directional networks for better
sequence aggregation for feature learning, it is not applicable
for the generative case.
We developed a way improve the consistency of generating long sequences
with recurrent networks.
We propose a way to construct a model similar to bi-directional network.
The key insight is to use a soft L2 loss between the forward and
the backward generative recurrent networks.
We provide experimental evaluation on a multitude of tasks and datasets,
including speech recognition, image captioning, and language modeling.
In the third paper, we investigate the possibility of developing
an end-to-end intent recognizer for spoken language understanding.
The semantic spoken language understanding is an important
step towards developing a human-like artificial intelligence.
We have seen that the end-to-end approaches show high
performance on the tasks including machine translation and speech recognition.
We draw the inspiration from the prior works to develop
an end-to-end system for intent recognition
Regularization and Compression of Deep Neural Networks
Deep neural networks (DNN) are the state-of-the-art machine learning models outperforming traditional machine learning methods in a number of domains from vision and speech to natural language understanding and autonomous control. With large amounts of data becoming available, the task performance of DNNs in these domains predictably scales with the size of the DNNs. However, in data-scarce scenarios, large DNNs overfit to the training dataset resulting in inferior performance. Additionally, in scenarios where enormous amounts of data is available, large DNNs incur large inference latencies and memory costs. Thus, while imperative for achieving state-of-the-art performances, large DNNs require large amounts of data for training and large computational resources during inference.
These two problems could be mitigated by sparsely training large DNNs. Imposing sparsity constraints during training limits the capacity of the model to overfit to the training set while still being able to obtain good generalization. Sparse DNNs have most of their weights close to zero after training. Therefore, most of the weights could be removed resulting in smaller inference costs. To effectively train sparse DNNs, this thesis proposes two new sparse stochastic regularization techniques called Bridgeout and Sparseout. Furthermore, Bridgeout is used to prune convolutional neural networks for low-cost inference.
Bridgeout randomly perturbs the weights of a parametric model such as a DNN. It is theoretically shown that Bridgeout constrains the weights of linear models to a sparse subspace. Empirically, Bridgeout has been shown to perform better, on image classification tasks, than state-of-the-art DNNs when the data is limited.
Sparseout is an activations counter-part of Bridgeout, operating on the outputs of the neurons instead of the weights of the neurons. Theoretically, Sparseout has been shown to be a general case of the commonly used Dropout regularization method. Empirical evidence suggests that Sparseout is capable of controlling the level of activations sparsity in neural networks. This flexibility allows Sparseout to perform better than Dropout on image classification and language modelling tasks. Furthermore, using Sparseout, it is found that activation sparsity is beneficial to recurrent neural networks for language modeling but densification of activations favors convolutional neural networks for image classification.
To address the problem of high computational cost during inference, this thesis evaluates Bridgeout for pruning convolutional neural networks (CNN). It is shown that recent CNN architectures such as VGG, ResNet and Wide-ResNet trained with Bridgeout are more robust to one-shot filter pruning compared to non-sparse stochastic regularization
Hydrocarbon quantification using neural networks and deep learning based hyperspectral unmixing
Hydrocarbon (HC) spills are a global issue, which can seriously impact human life and the environment, therefore early identification and remedial measures taken at an early stage are important. Thus, current research efforts aim at remotely quantifying incipient quantities of HC mixed with soils. The increased spectral and spatial resolution of hyperspectral sensors has opened ground-breaking perspectives in many industries including remote inspection of large areas and the environment. The use of subpixel detection algorithms, and in particular the use of the mixture models, has been identified as a future advance that needs to be incorporated in remote sensing. However, there are some challenging tasks since the spectral signatures of the targets of interest may not be immediately available. Moreover, real time processing and analysis is required to support fast decision-making. Progressing in this direction, this thesis pioneers and researches novel methodologies for HC quantification capable of exceeding the limitations of existing systems in terms of reduced cost and processing time with improved accuracy. Therefore the goal of this research is to develop, implement and test different methods for improving HC detection and quantification using spectral unmixing and machine learning. An efficient hybrid switch method employing neural networks and hyperspectral is proposed and investigated. This robust method switches between state of the art hyperspectral unmixing linear and nonlinear models, respectively. This procedure is well suited for the quantification of small quantities of substances within a pixel with high accuracy as the most appropriate model is employed. Central to the proposed approach is a novel method for extracting parameters to characterise the non-linearity of the data. These parameters are fed into a feedforward neural network which decides in a pixel by pixel fashion which model is more suitable. The quantification process is fully automated by applying further classification techniques to the acquired hyperspectral images. A deep learning neural network model is designed for the quantification of HC quantities mixed with soils. A three-term backpropagation algorithm with dropout is proposed to avoid overfitting and reduce the computational complexity of the model.
The above methods have been evaluated using classical repository datasets from the literature and a laboratory controlled dataset. For that, an experimental procedure has been designed to produce a labelled dataset. The data was obtained by mixing and homogenizing different soil types with HC substances, respectively and measuring the reflectance with a hyperspectral sensor.
Findings from the research study reveal that the two proposed models have high performance, they are suitable for the detection and quantification of HC mixed with soils, and surpass existing methods. Improvements in sensitivity, accuracy, computational time are achieved. Thus, the proposed approaches can be used to detect HC spills at an early stage in order to mitigate significant pollution from the spill areas
Manifold Learning Approaches to Compressing Latent Spaces of Unsupervised Feature Hierarchies
Field robots encounter dynamic unstructured environments containing a vast array of unique objects. In order to make sense of the world in which they are placed, they collect large quantities of unlabelled data with a variety of sensors. Producing robust and reliable applications depends entirely on the ability of the robot to understand the unlabelled data it obtains. Deep Learning techniques have had a high level of success in learning powerful unsupervised representations for a variety of discriminative and generative models. Applying these techniques to problems encountered in field robotics remains a challenging endeavour. Modern Deep Learning methods are typically trained with a substantial labelled dataset, while datasets produced in a field robotics context contain limited labelled training data. The primary motivation for this thesis stems from the problem of applying large scale Deep Learning models to field robotics datasets that are label poor. While the lack of labelled ground truth data drives the desire for unsupervised methods, the need for improving the model scaling is driven by two factors, performance and computational requirements. When utilising unsupervised layer outputs as representations for classification, the classification performance increases with layer size. Scaling up models with multiple large layers of features is problematic, as the sizes of subsequent hidden layers scales with the size of the previous layer. This quadratic scaling, and the associated time required to train such networks has prevented adoption of large Deep Learning models beyond cluster computing. The contributions in this thesis are developed from the observation that parameters or filter el- ements learnt in Deep Learning systems are typically highly structured, and contain related ele- ments. Firstly, the structure of unsupervised filters is utilised to construct a mapping from the high dimensional filter space to a low dimensional manifold. This creates a significantly smaller repre- sentation for subsequent feature learning. This mapping, and its effect on the resulting encodings, highlights the need for the ability to learn highly overcomplete sets of convolutional features. Driven by this need, the unsupervised pretraining of Deep Convolutional Networks is developed to include a number of modern training and regularisation methods. These pretrained models are then used to provide initialisations for supervised convolutional models trained on low quantities of labelled data. By utilising pretraining, a significant increase in classification performance on a number of publicly available datasets is achieved. In order to apply these techniques to outdoor 3D Laser Illuminated Detection And Ranging data, we develop a set of resampling techniques to provide uniform input to Deep Learning models. The features learnt in these systems outperform the high effort hand engineered features developed specifically for 3D data. The representation of a given signal is then reinterpreted as a combination of modes that exist on the learnt low dimensional filter manifold. From this, we develop an encoding technique that allows the high dimensional layer output to be represented as a combination of low dimensional components. This allows the growth of subsequent layers to only be dependent on the intrinsic dimensionality of the filter manifold and not the number of elements contained in the previous layer. Finally, the resulting unsupervised convolutional model, the encoding frameworks and the em- bedding methodology are used to produce a new unsupervised learning stratergy that is able to encode images in terms of overcomplete filter spaces, without producing an explosion in the size of the intermediate parameter spaces. This model produces classification results on par with state of the art models, yet requires significantly less computational resources and is suitable for use in the constrained computation environment of a field robot
Continual deep learning via progressive learning
Machine learning is one of several approaches to artificial intelligence. It allows us to build machines that can learn from experience as opposed to being explicitly programmed. Current machine learning formulations are mostly designed for learning and performing a particular task from a tabula rasa using data available for that task. For machine learning to converge to artificial intelligence, in addition to other desiderata, it must be in a state of continual learning, i.e., have the ability to be in a continuous learning process, such that when a new task is presented, the system can leverage prior knowledge from prior tasks, in learning and performing this new task, and augment the prior knowledge with the newly acquired knowledge without having a significant adverse effect on the prior knowledge. Continual learning is key to advancing machine learning and artificial intelligence. Deep learning is a powerful general-purpose approach to machine learning that is able to solve numerous and various tasks with minimal modification. Deep learning extends machine learning, and specially neural networks, to learn multiple levels of distributed representations together with the required mapping function into a single composite function. The emergence of deep learning and neural networks as a generic approach to machine learning, coupled with their ability to learn versatile hierarchical representations, has paved the way for continual learning. The main aim of this thesis is the study and development of a structured approach to continual learning, leveraging the success of deep learning and neural networks. This thesis studies the application of deep learning to a number of supervised learning tasks, and in particular, classification tasks in machine perception, e.g., image recognition, automatic speech recognition, and speech emotion recognition. The relation between the systems developed for these tasks is investigated to illuminate the layer-wise relevance of features in deep networks trained for these tasks via transfer learning, and these independent systems are unified into continual learning systems. The main contribution of this thesis is the construction and formulation of a deep learning framework, denoted progressive learning, that allows a holistic and systematic approach to continual learning. Progressive learning comprises a number of procedures that address the continual learning desiderata. It is shown that, when tasks are related, progressive learning leads to faster learning that converges to better generalization performance using less amounts of data and a smaller number of dedicated parameters, for the tasks studied in this thesis, by accumulating and leveraging knowledge learned across tasks in a continuous manner. It is envisioned that progressive learning is a step towards a fully general continual learning framework
Connectionist multivariate density-estimation and its application to speech synthesis
Autoregressive models factorize a multivariate joint probability distribution into a
product of one-dimensional conditional distributions. The variables are assigned
an ordering, and the conditional distribution of each variable modelled using all
variables preceding it in that ordering as predictors.
Calculating normalized probabilities and sampling has polynomial computational
complexity under autoregressive models. Moreover, binary autoregressive
models based on neural networks obtain statistical performances similar to that of
some intractable models, like restricted Boltzmann machines, on several datasets.
The use of autoregressive probability density estimators based on neural
networks to model real-valued data, while proposed before, has never been properly
investigated and reported. In this thesis we extend the formulation of neural
autoregressive distribution estimators (NADE) to real-valued data; a model we call
the real-valued neural autoregressive density estimator (RNADE). Its statistical
performance on several datasets, including visual and auditory data, is reported
and compared to that of other models. RNADE obtained higher test likelihoods
than other tractable models, while retaining all the attractive computational
properties of autoregressive models.
However, autoregressive models are limited by the ordering of the variables
inherent to their formulation. Marginalization and imputation tasks can only be
solved analytically if the missing variables are at the end of the ordering. We
present a new training technique that obtains a set of parameters that can be
used for any ordering of the variables. By choosing a model with a convenient
ordering of the dimensions at test time, it is possible to solve any marginalization
and imputation tasks analytically.
The same training procedure also makes it practical to train NADEs and
RNADEs with several hidden layers. The resulting deep and tractable models
display higher test likelihoods than the equivalent one-hidden-layer models for all
the datasets tested.
Ensembles of NADEs or RNADEs can be created inexpensively by combining
models that share their parameters but differ in the ordering of the variables. These
ensembles of autoregressive models obtain state-of-the-art statistical performances
for several datasets.
Finally, we demonstrate the application of RNADE to speech synthesis, and
confirm that capturing the phone-conditional dependencies of acoustic features
improves the quality of synthetic speech. Our model generates synthetic speech
that was judged by naive listeners as being of higher quality than that generated
by mixture density networks, which are considered a state-of-the-art synthesis
techniqu