509 research outputs found
Semi-tied Units for Efficient Gating in LSTM and Highway Networks
Gating is a key technique used for integrating information from multiple
sources by long short-term memory (LSTM) models and has recently also been
applied to other models such as the highway network. Although gating is
powerful, it is rather expensive in terms of both computation and storage as
each gating unit uses a separate full weight matrix. This issue can be severe
since several gates can be used together in e.g. an LSTM cell. This paper
proposes a semi-tied unit (STU) approach to solve this efficiency issue, which
uses one shared weight matrix to replace those in all the units in the same
layer. The approach is termed "semi-tied" since extra parameters are used to
separately scale each of the shared output values. These extra scaling factors
are associated with the network activation functions and result in the use of
parameterised sigmoid, hyperbolic tangent, and rectified linear unit functions.
Speech recognition experiments using British English multi-genre broadcast data
showed that using STUs can reduce the calculation and storage cost by a factor
of three for highway networks and four for LSTMs, while giving similar word
error rates to the original models.Comment: To appear in Proc. INTERSPEECH 2018, September 2-6, 2018, Hyderabad,
Indi
Speaker adaptation and adaptive training for jointly optimised tandem systems
Speaker independent (SI) Tandem systems trained by joint optimisation
of bottleneck (BN) deep neural networks (DNNs) and
Gaussian mixture models (GMMs) have been found to produce
similar word error rates (WERs) to Hybrid DNN systems. A
key advantage of using GMMs is that existing speaker adaptation
methods, such as maximum likelihood linear regression
(MLLR), can be used which to account for diverse speaker
variations and improve system robustness. This paper investigates
speaker adaptation and adaptive training (SAT) schemes
for jointly optimised Tandem systems. Adaptation techniques
investigated include constrained MLLR (CMLLR) transforms
based on BN features for SAT as well as MLLR and parameterised
sigmoid functions for unsupervised test-time adaptation.
Experiments using English multi-genre broadcast (MGB3) data
show that CMLLR SAT yields a 4% relative WER reduction
over jointly trained Tandem and Hybrid SI systems, and further
reductions in WER are obtained by system combination
Learning to Adapt: a Meta-learning Approach for Speaker Adaptation
The performance of automatic speech recognition systems can be improved by
adapting an acoustic model to compensate for the mismatch between training and
testing conditions, for example by adapting to unseen speakers. The success of
speaker adaptation methods relies on selecting weights that are suitable for
adaptation and using good adaptation schedules to update these weights in order
not to overfit to the adaptation data. In this paper we investigate a
principled way of adapting all the weights of the acoustic model using a
meta-learning. We show that the meta-learner can learn to perform supervised
and unsupervised speaker adaptation and that it outperforms a strong baseline
adapting LHUC parameters when adapting a DNN AM with 1.5M parameters. We also
report initial experiments on adapting TDNN AMs, where the meta-learner
achieves comparable performance with LHUC.Comment: Interspeech 201
Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation
This work presents a broad study on the adaptation of neural network acoustic
models by means of learning hidden unit contributions (LHUC) -- a method that
linearly re-combines hidden units in a speaker- or environment-dependent manner
using small amounts of unsupervised adaptation data. We also extend LHUC to a
speaker adaptive training (SAT) framework that leads to a more adaptable DNN
acoustic model, working both in a speaker-dependent and a speaker-independent
manner, without the requirements to maintain auxiliary speaker-dependent
feature extractors or to introduce significant speaker-dependent changes to the
DNN structure. Through a series of experiments on four different speech
recognition benchmarks (TED talks, Switchboard, AMI meetings, and Aurora4)
comprising 270 test speakers, we show that LHUC in both its test-only and SAT
variants results in consistent word error rate reductions ranging from 5% to
23% relative depending on the task and the degree of mismatch between training
and test data. In addition, we have investigated the effect of the amount of
adaptation data per speaker, the quality of unsupervised adaptation targets,
the complementarity to other adaptation techniques, one-shot adaptation, and an
extension to adapting DNNs trained in a sequence discriminative manner.Comment: 14 pages, 9 Tables, 11 Figues in IEEE/ACM Transactions on Audio,
Speech and Language Processing, Vol. 24, Num. 8, 201
How does the brain extract acoustic patterns? A behavioural and neural study
In complex auditory scenes the brain exploits statistical regularities to group sound elements into streams. Previous studies using tones that transition from being randomly drawn to regularly repeating, have highlighted a network of brain regions involved during this process of regularity detection, including auditory cortex (AC) and hippocampus (HPC; Barascud et al., 2016). In this thesis, I seek to understand how the neurons within AC and HPC detect and maintain a representation of deterministic acoustic regularity.
I trained ferrets (n = 6) on a GO/NO-GO task to detect the transition from a random sequence of tones to a repeating pattern of tones, with increasing pattern lengths (3, 5 and 7). All animals performed significantly above chance, with longer reaction times and declining performance as the pattern length increased. During performance of the behavioural task, or passive listening, I recorded from primary and secondary fields of AC with multi-electrode arrays (behaving: n = 3), or AC and HPC using Neuropixels probes (behaving: n = 1; passive: n = 1).
In the local field potential, I identified no differences in the evoked response between presentations of random or regular sequences. Instead, I observed significant increases in oscillatory power at the rate of the repeating pattern, and decreases at the tone presentation rate, during regularity. Neurons in AC, across the population, showed higher firing with more repetitions of the pattern and for shorter pattern lengths. Single-units within AC showed higher precision in their firing when responding to their best frequency during regularity. Neurons in AC and HPC both entrained to the pattern rate during presentation of the regular sequence when compared to the random sequence. Lastly, development of an optogenetic approach to inactivate AC in the ferret paves the way for future work to probe the causal involvement of these brain regions
Discriminative and adaptive training for robust speech recognition and understanding
Robust automatic speech recognition (ASR) and understanding (ASU) under various conditions remains to be a challenging problem even with the advances of deep learning. To achieve robust ASU, two discriminative training objectives are proposed for keyword spotting and topic classification: (1) To accurately recognize the semantically important keywords, the non-uniform error cost minimum classification error training of deep neural network (DNN) and bi-directional long short-term memory (BLSTM) acoustic models is proposed to minimize the recognition errors of only the keywords. (2) To compensate for the mismatched objectives of speech recognition and understanding, minimum semantic error cost training of the BLSTM acoustic model is proposed to generate semantically accurate lattices for topic classification. Further, to expand the application of the ASU system to various conditions, four adaptive training approaches are proposed to improve the robustness of the ASR under different conditions: (1) To suppress the effect of inter-speaker variability on speaker-independent DNN acoustic model, speaker-invariant training is proposed to learn a deep representation in the DNN that is both senone-discriminative and speaker-invariant through adversarial multi-task training (2) To achieve condition-robust unsupervised adaptation with parallel data, adversarial teacher-student learning is proposed to suppress multiple factors of condition variability in the procedure of knowledge transfer from a well-trained source domain LSTM acoustic model to the target domain. (3) To further improve the adversarial learning for unsupervised adaptation with unparallel data, domain separation networks are used to enhance the domain-invariance of the senone-discriminative deep representation by explicitly modeling the private component that is unique to each domain. (4) To achieve robust far-field ASR, an LSTM adaptive beamforming network is proposed to estimate the real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions.Ph.D
Learning representations for speech recognition using artificial neural networks
Learning representations is a central challenge in machine learning. For speech
recognition, we are interested in learning robust representations that are stable
across different acoustic environments, recording equipment and irrelevant inter–
and intra– speaker variabilities. This thesis is concerned with representation
learning for acoustic model adaptation to speakers and environments, construction
of acoustic models in low-resource settings, and learning representations from
multiple acoustic channels. The investigations are primarily focused on the hybrid
approach to acoustic modelling based on hidden Markov models and artificial
neural networks (ANN).
The first contribution concerns acoustic model adaptation. This comprises
two new adaptation transforms operating in ANN parameters space. Both operate
at the level of activation functions and treat a trained ANN acoustic model as
a canonical set of fixed-basis functions, from which one can later derive variants
tailored to the specific distribution present in adaptation data. The first technique,
termed Learning Hidden Unit Contributions (LHUC), depends on learning
distribution-dependent linear combination coefficients for hidden units. This
technique is then extended to altering groups of hidden units with parametric and
differentiable pooling operators. We found the proposed adaptation techniques
pose many desirable properties: they are relatively low-dimensional, do not overfit
and can work in both a supervised and an unsupervised manner. For LHUC we
also present extensions to speaker adaptive training and environment factorisation.
On average, depending on the characteristics of the test set, 5-25% relative
word error rate (WERR) reductions are obtained in an unsupervised two-pass
adaptation setting.
The second contribution concerns building acoustic models in low-resource
data scenarios. In particular, we are concerned with insufficient amounts of
transcribed acoustic material for estimating acoustic models in the target language
– thus assuming resources like lexicons or texts to estimate language models
are available. First we proposed an ANN with a structured output layer
which models both context–dependent and context–independent speech units,
with the context-independent predictions used at runtime to aid the prediction
of context-dependent states. We also propose to perform multi-task adaptation
with a structured output layer. We obtain consistent WERR reductions up to
6.4% in low-resource speaker-independent acoustic modelling. Adapting those
models in a multi-task manner with LHUC decreases WERRs by an additional
13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that
one can build better acoustic models with unsupervised multi– and cross– lingual
initialisation and find that pre-training is a largely language-independent. Up to
14.4% WERR reductions are observed, depending on the amount of the available
transcribed acoustic data in the target language.
The third contribution concerns building acoustic models from multi-channel
acoustic data. For this purpose we investigate various ways of integrating and
learning multi-channel representations. In particular, we investigate channel concatenation
and the applicability of convolutional layers for this purpose. We
propose a multi-channel convolutional layer with cross-channel pooling, which
can be seen as a data-driven non-parametric auditory attention mechanism. We
find that for unconstrained microphone arrays, our approach is able to match the
performance of the comparable models trained on beamform-enhanced signals
- …