50 research outputs found
Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
We present a structured overview of adaptation algorithms for neural
network-based speech recognition, considering both hybrid hidden Markov model /
neural network systems and end-to-end neural network systems, with a focus on
speaker adaptation, domain adaptation, and accent adaptation. The overview
characterizes adaptation algorithms as based on embeddings, model parameter
adaptation, or data augmentation. We present a meta-analysis of the performance
of speech recognition adaptation algorithms, based on relative error rate
reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27
figure
Discriminative and adaptive training for robust speech recognition and understanding
Robust automatic speech recognition (ASR) and understanding (ASU) under various conditions remains to be a challenging problem even with the advances of deep learning. To achieve robust ASU, two discriminative training objectives are proposed for keyword spotting and topic classification: (1) To accurately recognize the semantically important keywords, the non-uniform error cost minimum classification error training of deep neural network (DNN) and bi-directional long short-term memory (BLSTM) acoustic models is proposed to minimize the recognition errors of only the keywords. (2) To compensate for the mismatched objectives of speech recognition and understanding, minimum semantic error cost training of the BLSTM acoustic model is proposed to generate semantically accurate lattices for topic classification. Further, to expand the application of the ASU system to various conditions, four adaptive training approaches are proposed to improve the robustness of the ASR under different conditions: (1) To suppress the effect of inter-speaker variability on speaker-independent DNN acoustic model, speaker-invariant training is proposed to learn a deep representation in the DNN that is both senone-discriminative and speaker-invariant through adversarial multi-task training (2) To achieve condition-robust unsupervised adaptation with parallel data, adversarial teacher-student learning is proposed to suppress multiple factors of condition variability in the procedure of knowledge transfer from a well-trained source domain LSTM acoustic model to the target domain. (3) To further improve the adversarial learning for unsupervised adaptation with unparallel data, domain separation networks are used to enhance the domain-invariance of the senone-discriminative deep representation by explicitly modeling the private component that is unique to each domain. (4) To achieve robust far-field ASR, an LSTM adaptive beamforming network is proposed to estimate the real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions.Ph.D
Learning to adapt: meta-learning approaches for speaker adaptation
The performance of automatic speech recognition systems degrades rapidly when there
is a mismatch between training and testing conditions. One way to compensate for this
mismatch is to adapt an acoustic model to test conditions, for example by performing
speaker adaptation. In this thesis we focus on the discriminative model-based speaker
adaptation approach. The success of this approach relies on having a robust speaker
adaptation procedure – we need to specify which parameters should be adapted and
how they should be adapted. Unfortunately, tuning the speaker adaptation procedure
requires considerable manual effort.
In this thesis we propose to formulate speaker adaptation as a meta-learning task. In
meta-learning, learning occurs on two levels: a learner learns a task specific model and
a meta-learner learns how to train these task specific models. In our case, the learner is
a speaker dependent-model and the meta-learner learns to adapt a speaker-independent
model into the speaker dependent model. By using this formulation, we can automatically learn robust speaker adaptation procedures using gradient descent. In the exper iments, we demonstrate that the meta-learning approach learns competitive adaptation
schedules compared to adaptation procedures with handcrafted hyperparameters.
Subsequently, we show that speaker adaptive training can be formulated as a meta-learning task as well. In contrast to the traditional approach, which maintains and optimises a copy of speaker dependent parameters for each speaker during training, we
embed the gradient based adaptation directly into the training of the acoustic model.
We hypothesise that this formulation should steer the training of the acoustic model
into finding parameters better suited for test-time speaker adaptation. We experimentally compare our approach with test-only adaptation of a standard baseline model and
with SAT-LHUC, which represents a traditional speaker adaptive training method. We
show that the meta-learning speaker-adaptive training approach achieves comparable
results with SAT-LHUC. However, neither the meta-learning approach nor SAT-LHUC
outperforms the baseline approach after adaptation.
Consequently, we run a series of experimental ablations to determine why SAT-LHUC does not yield any improvements compared to the baseline approach. In these
experiments we explored multiple factors such as using various neural network architectures, normalisation techniques, activation functions or optimisers. We find that
SAT-LHUC interferes with batch normalisation, and that it benefits from an increased
hidden layer width and an increased model size. However, the baseline model benefits from increased capacity too, therefore in order to obtain the best model it is still
favourable to train a speaker independent model with batch normalisation. As such, an
effective way of training state-of-the-art SAT-LHUC models remains an open question.
Finally, we show that the performance of unsupervised speaker adaptation can be
further improved by using discriminative adaptation with lattices as supervision obtained from a first pass decoding, instead of traditionally used one-best path tran scriptions. We find that this proposed approach enables many more parameters to
be adapted without overfitting being observed, and is successful even when the initial
transcription has a WER in excess of 50%
Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems
Speaker adaptation techniques provide a powerful solution to customise
automatic speech recognition (ASR) systems for individual users. Practical
application of unsupervised model-based speaker adaptation techniques to data
intensive end-to-end ASR systems is hindered by the scarcity of speaker-level
data and performance sensitivity to transcription errors. To address these
issues, a set of compact and data efficient speaker-dependent (SD) parameter
representations are used to facilitate both speaker adaptive training and
test-time unsupervised speaker adaptation of state-of-the-art Conformer ASR
systems. The sensitivity to supervision quality is reduced using a confidence
score-based selection of the less erroneous subset of speaker-level adaptation
data. Two lightweight confidence score estimation modules are proposed to
produce more reliable confidence scores. The data sparsity issue, which is
exacerbated by data selection, is addressed by modelling the SD parameter
uncertainty using Bayesian learning. Experiments on the benchmark 300-hour
Switchboard and the 233-hour AMI datasets suggest that the proposed confidence
score-based adaptation schemes consistently outperformed the baseline
speaker-independent (SI) Conformer model and conventional non-Bayesian, point
estimate-based adaptation using no speaker data selection. Similar consistent
performance improvements were retained after external Transformer and LSTM
language model rescoring. In particular, on the 300-hour Switchboard corpus,
statistically significant WER reductions of 1.0%, 1.3%, and 1.4% absolute
(9.5%, 10.9%, and 11.3% relative) were obtained over the baseline SI Conformer
on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Similar WER
reductions of 2.7% and 3.3% absolute (8.9% and 10.2% relative) were also
obtained on the AMI development and evaluation sets.Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processin
Interaction intermodale dans les réseaux neuronaux profonds pour la classification et la localisation d'évènements audiovisuels
La compréhension automatique du monde environnant a de nombreuses applications
telles que la surveillance et sécurité, l'interaction Homme-Machine,
la robotique, les soins de santé, etc. Plus précisément, la compréhension peut
s'exprimer par le biais de différentes taches telles que la classification et localisation
dans l'espace d'évènements. Les êtres vivants exploitent un maximum
de l'information disponible pour comprendre ce qui les entoure. En s'inspirant
du comportement des êtres vivants, les réseaux de neurones artificiels devraient
également utiliser conjointement plusieurs modalités, par exemple, la vision et
l'audition.
Premièrement, les modèles de classification et localisation, basés sur l'information
audio-visuelle, doivent être évalués de façon objective. Nous avons donc
enregistré une nouvelle base de données pour compléter les bases actuellement
disponibles. Comme aucun modèle audio-visuel de classification et localisation
n'existe, seule la partie sonore de la base est évaluée avec un modèle de la
littérature.
Deuxièmement, nous nous concentrons sur le cœur de la thèse: comment
utiliser conjointement de l'information visuelle et sonore pour résoudre une
tâche spécifique, la reconnaissance d'évènements. Le cerveau n'est pas constitué d'une "simple" fusion mais comprend de multiples interactions entre
les deux modalités. Il y a un couplage important entre le traitement de
l'information visuelle et sonore. Les réseaux de neurones offrent la possibilité de créer des interactions entre les modalités en plus de la fusion. Dans
cette thèse, nous explorons plusieurs stratégies pour fusionner les modalités
visuelles et sonores et pour créer des interactions entre les modalités. Ces techniques
ont les meilleures performances en comparaison aux architectures de
l'état de l'art au moment de la publication. Ces techniques montrent l'utilité
de la fusion audio-visuelle mais surtout l'importance des interactions entre les
modalités.
Pour conclure la thèse, nous proposons un réseau de référence pour la classification et localisation d'évènements audio-visuels. Ce réseau a été testé avec
la nouvelle base de données. Les modèles précédents de classification sont
modifiés pour prendre en compte la localisation dans l'espace en plus de la
classification.Abstract: The automatic understanding of the surrounding world has a wide range of applications, including surveillance, human-computer interaction, robotics, health care, etc. The understanding can be expressed in several ways such as event classification and its localization in space. Living beings exploit a maximum of the available information to understand the surrounding world. Artificial neural networks should build on this behavior and jointly use several modalities such as vision and hearing. First, audio-visual networks for classification and localization must be evaluated objectively. We recorded a new audio-visual dataset to fill a gap in the current available datasets. We were not able to find audio-visual models for classification and localization. Only the dataset audio part is evaluated with a state-of-the-art model. Secondly, we focus on the main challenge of the thesis: How to jointly use visual and audio information to solve a specific task, event recognition. The brain does not comprise a simple fusion but has multiple interactions between the two modalities to create a strong coupling between them. The neural networks offer the possibility to create interactions between the two modalities in addition to the fusion. We explore several strategies to fuse the audio and visual modalities and to create interactions between modalities. These techniques have the best performance compared to the state-of-the-art architectures at the time of publishing. They show the usefulness of audio-visual fusion but above all the contribution of the interaction between modalities. To conclude, we propose a benchmark for audio-visual classification and localization on the new dataset. Previous models for the audio-visual classification are modified to address the localization in addition to the classification