112 research outputs found
Advances in deep learning methods for speech recognition and understanding
Ce travail expose plusieurs Ă©tudes dans les domaines de
la reconnaissance de la parole et
compréhension du langage parlé.
La compréhension sémantique du langage parlé est un sous-domaine important
de l'intelligence artificielle.
Le traitement de la parole intéresse depuis longtemps les chercheurs,
puisque la parole est une des charactéristiques qui definit l'être humain.
Avec le développement du réseau neuronal artificiel,
le domaine a connu une Ă©volution rapide
à la fois en terme de précision et de perception humaine.
Une autre étape importante a été franchie avec le développement
d'approches bout en bout.
De telles approches permettent une coadaptation de toutes
les parties du modèle, ce qui augmente ainsi les performances,
et ce qui simplifie la procédure d'entrainement.
Les modèles de bout en bout sont devenus réalisables avec la quantité croissante
de données disponibles, de ressources informatiques et,
surtout, avec de nombreux développements architecturaux innovateurs.
NĂ©anmoins, les approches traditionnelles (qui ne sont pas bout en bout)
sont toujours pertinentes pour le traitement de la parole en raison
des données difficiles dans les environnements bruyants,
de la parole avec un accent et de la grande variété de dialectes.
Dans le premier travail, nous explorons la reconnaissance de la parole hybride
dans des environnements bruyants.
Nous proposons de traiter la reconnaissance de la parole,
qui fonctionne dans
un nouvel environnement composé de différents bruits inconnus,
comme une tâche d'adaptation de domaine.
Pour cela, nous utilisons la nouvelle technique Ă l'Ă©poque
de l'adaptation du domaine antagoniste.
En résumé, ces travaux antérieurs proposaient de former
des caractéristiques de manière à ce qu'elles soient distinctives
pour la tâche principale, mais non-distinctive pour la tâche secondaire.
Cette tâche secondaire est conçue pour être la tâche de reconnaissance de domaine.
Ainsi, les fonctionnalités entraînées sont invariantes vis-à -vis du domaine considéré.
Dans notre travail, nous adoptons cette technique et la modifions pour
la tâche de reconnaissance de la parole dans un environnement bruyant.
Dans le second travail, nous développons une méthode générale
pour la régularisation des réseaux génératif récurrents.
Il est connu que les réseaux récurrents ont souvent des difficultés à rester
sur le mĂŞme chemin, lors de la production de sorties longues.
Bien qu'il soit possible d'utiliser des réseaux bidirectionnels pour
une meilleure traitement de séquences pour l'apprentissage des charactéristiques,
qui n'est pas applicable au cas génératif.
Nous avons développé un moyen d'améliorer la cohérence de
la production de longues séquences avec des réseaux récurrents.
Nous proposons un moyen de construire un modèle similaire à un réseau bidirectionnel.
L'idée centrale est d'utiliser une perte L2 entre
les réseaux récurrents génératifs vers l'avant et vers l'arrière.
Nous fournissons une évaluation expérimentale sur
une multitude de tâches et d'ensembles de données,
y compris la reconnaissance vocale,
le sous-titrage d'images et la modélisation du langage.
Dans le troisième article, nous étudions la possibilité de développer
un identificateur d'intention de bout en bout pour la compréhension du langage parlé.
La compréhension sémantique du langage parlé est une étape importante vers
le développement d'une intelligence artificielle de type humain.
Nous avons vu que les approches de bout en bout montrent
des performances élevées sur les tâches, y compris la traduction automatique et
la reconnaissance de la parole.
Nous nous inspirons des travaux antérieurs pour développer
un système de bout en bout pour la reconnaissance de l'intention.This work presents several studies in the areas of speech recognition and
understanding.
The semantic speech understanding is an important sub-domain of the
broader field of artificial intelligence.
Speech processing has had interest from the researchers for long time
because language is one of the defining characteristics of a human being.
With the development of neural networks, the domain has seen rapid progress
both in terms of accuracy and human perception.
Another important milestone was achieved with the development of
end-to-end approaches.
Such approaches allow co-adaptation of all the parts of the model
thus increasing the performance, as well as simplifying the training
procedure.
End-to-end models became feasible with the increasing amount of available
data, computational resources, and most importantly with many novel
architectural developments.
Nevertheless, traditional, non end-to-end, approaches are still relevant
for speech processing due to challenging data in noisy environments,
accented speech, and high variety of dialects.
In the first work, we explore the hybrid speech recognition in noisy
environments.
We propose to treat the recognition in the unseen noise condition
as the domain adaptation task.
For this, we use the novel at the time technique of the adversarial
domain adaptation.
In the nutshell, this prior work proposed to train features in such
a way that they are discriminative for the primary task,
but non-discriminative for the secondary task.
This secondary task is constructed to be the domain recognition task.
Thus, the features trained are invariant towards the domain at hand.
In our work, we adopt this technique and modify it for the task of
noisy speech recognition.
In the second work, we develop a general method for regularizing
the generative recurrent networks.
It is known that the recurrent networks frequently have difficulties
staying on same track when generating long outputs.
While it is possible to use bi-directional networks for better
sequence aggregation for feature learning, it is not applicable
for the generative case.
We developed a way improve the consistency of generating long sequences
with recurrent networks.
We propose a way to construct a model similar to bi-directional network.
The key insight is to use a soft L2 loss between the forward and
the backward generative recurrent networks.
We provide experimental evaluation on a multitude of tasks and datasets,
including speech recognition, image captioning, and language modeling.
In the third paper, we investigate the possibility of developing
an end-to-end intent recognizer for spoken language understanding.
The semantic spoken language understanding is an important
step towards developing a human-like artificial intelligence.
We have seen that the end-to-end approaches show high
performance on the tasks including machine translation and speech recognition.
We draw the inspiration from the prior works to develop
an end-to-end system for intent recognition
Regularization and Compression of Deep Neural Networks
Deep neural networks (DNN) are the state-of-the-art machine learning models outperforming traditional machine learning methods in a number of domains from vision and speech to natural language understanding and autonomous control. With large amounts of data becoming available, the task performance of DNNs in these domains predictably scales with the size of the DNNs. However, in data-scarce scenarios, large DNNs overfit to the training dataset resulting in inferior performance. Additionally, in scenarios where enormous amounts of data is available, large DNNs incur large inference latencies and memory costs. Thus, while imperative for achieving state-of-the-art performances, large DNNs require large amounts of data for training and large computational resources during inference.
These two problems could be mitigated by sparsely training large DNNs. Imposing sparsity constraints during training limits the capacity of the model to overfit to the training set while still being able to obtain good generalization. Sparse DNNs have most of their weights close to zero after training. Therefore, most of the weights could be removed resulting in smaller inference costs. To effectively train sparse DNNs, this thesis proposes two new sparse stochastic regularization techniques called Bridgeout and Sparseout. Furthermore, Bridgeout is used to prune convolutional neural networks for low-cost inference.
Bridgeout randomly perturbs the weights of a parametric model such as a DNN. It is theoretically shown that Bridgeout constrains the weights of linear models to a sparse subspace. Empirically, Bridgeout has been shown to perform better, on image classification tasks, than state-of-the-art DNNs when the data is limited.
Sparseout is an activations counter-part of Bridgeout, operating on the outputs of the neurons instead of the weights of the neurons. Theoretically, Sparseout has been shown to be a general case of the commonly used Dropout regularization method. Empirical evidence suggests that Sparseout is capable of controlling the level of activations sparsity in neural networks. This flexibility allows Sparseout to perform better than Dropout on image classification and language modelling tasks. Furthermore, using Sparseout, it is found that activation sparsity is beneficial to recurrent neural networks for language modeling but densification of activations favors convolutional neural networks for image classification.
To address the problem of high computational cost during inference, this thesis evaluates Bridgeout for pruning convolutional neural networks (CNN). It is shown that recent CNN architectures such as VGG, ResNet and Wide-ResNet trained with Bridgeout are more robust to one-shot filter pruning compared to non-sparse stochastic regularization
An efficient semi-sigmoidal non-linear activation function approach for deep neural networks
A non-linear activation function is one of the key contributing factors to the success
of Deep Learning (DL). Since the revival of DL takes place in 2012, Rectified Linear
Unit (ReLU) has been regarded as a de facto standard for many DL models by the
community. Despite its popularity, however, ReLU contains several shortcomings that
could result in inefficient learning of the DL models. These shortcomings are: 1) the
inherent negative cancellation property in ReLU tends to remove all negative inputs
and causes massive information lost to the network; 2) the derivative of ReLU
potentially causes the occurrence of dead neurons problem to the networks; 3) the
mean activation generated by ReLU is highly positive and lead to bias shift effect in
the network layers; 4) the inherent multilinear structure of ReLU restricts the nonlinear
capability of the networks; 5) the predefined nature of ReLU limits the flexibility
of the networks. To address these shortcomings, this study proposed a new variant of
activation function based on the Semi-sigmoidal (Sig) approach. Based on this
approach, three variants of activation functions are introduced, namely, Shifted Semisigmoidal
(SSig), Adaptive Shifted Semi-sigmoidal (ASSig), and Bi-directional
Adaptive Shifted Semi-sigmoidal (BiASSig). The proposed activation functions were
tested against the ReLU (baseline) and state-of-the-art methods using eight Deep
Neural Networks (DNNs) on seven benchmark image datasets. Further, Adaptive
Moment Estimation (ADAM) and Stochastic Gradient Descent (SGD) were selected
as optimizers to train the DNNs. The baseline comparison score and mean rank were
used to consolidate and analyse the experimental results effectively. The experimental
results in terms of the overall baseline comparison score shown that SSig, ASSig, and
BiASSig obtained the score of 79, 87, and 86 out of 112, respectively, which achieving
outstanding performance than ReLU in more than 70% of the cases. In terms of overall
mean rank (OMR), ReLU ranked at tenth (10th), whereas SSig, ASSig, and BiASSig
ranked at fifth (5th), first (1st), and second (2nd), showing remarkable performance than
ReLU and other comparing methods
- …