30,780 research outputs found
Towards age-independent acoustic modeling
International audienceIn automatic speech recognition applications, due to significant differences in voice characteristics, adults and children are usually treated as two population groups, for which different acoustic models are trained. In this paper, age-independent acoustic modeling is investigated in the context of large vocabulary speech recognition. Exploiting a small amount (9 hours) of children's speech and a more significant amount (57 hours) of adult speech, age-independent acoustic models are trained using several methods for speaker adaptive acoustic modeling. Recognition results achieved using these models are compared with those achieved using age-dependent acoustic models for children and adults, respectively. Recognition experiments are performed on four Italian speech corpora, two consisting of children's speech and two of adult speech, using 64k word and 11k word trigram language models. Methods for speaker adaptive acoustic modeling prove to be effective for training age-independent acoustic models ensuring recognition results at least as good as those achieved with age-dependent acoustic models for adults and children
Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation
This work presents a broad study on the adaptation of neural network acoustic
models by means of learning hidden unit contributions (LHUC) -- a method that
linearly re-combines hidden units in a speaker- or environment-dependent manner
using small amounts of unsupervised adaptation data. We also extend LHUC to a
speaker adaptive training (SAT) framework that leads to a more adaptable DNN
acoustic model, working both in a speaker-dependent and a speaker-independent
manner, without the requirements to maintain auxiliary speaker-dependent
feature extractors or to introduce significant speaker-dependent changes to the
DNN structure. Through a series of experiments on four different speech
recognition benchmarks (TED talks, Switchboard, AMI meetings, and Aurora4)
comprising 270 test speakers, we show that LHUC in both its test-only and SAT
variants results in consistent word error rate reductions ranging from 5% to
23% relative depending on the task and the degree of mismatch between training
and test data. In addition, we have investigated the effect of the amount of
adaptation data per speaker, the quality of unsupervised adaptation targets,
the complementarity to other adaptation techniques, one-shot adaptation, and an
extension to adapting DNNs trained in a sequence discriminative manner.Comment: 14 pages, 9 Tables, 11 Figues in IEEE/ACM Transactions on Audio,
Speech and Language Processing, Vol. 24, Num. 8, 201
Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems
Speaker adaptation techniques provide a powerful solution to customise
automatic speech recognition (ASR) systems for individual users. Practical
application of unsupervised model-based speaker adaptation techniques to data
intensive end-to-end ASR systems is hindered by the scarcity of speaker-level
data and performance sensitivity to transcription errors. To address these
issues, a set of compact and data efficient speaker-dependent (SD) parameter
representations are used to facilitate both speaker adaptive training and
test-time unsupervised speaker adaptation of state-of-the-art Conformer ASR
systems. The sensitivity to supervision quality is reduced using a confidence
score-based selection of the less erroneous subset of speaker-level adaptation
data. Two lightweight confidence score estimation modules are proposed to
produce more reliable confidence scores. The data sparsity issue, which is
exacerbated by data selection, is addressed by modelling the SD parameter
uncertainty using Bayesian learning. Experiments on the benchmark 300-hour
Switchboard and the 233-hour AMI datasets suggest that the proposed confidence
score-based adaptation schemes consistently outperformed the baseline
speaker-independent (SI) Conformer model and conventional non-Bayesian, point
estimate-based adaptation using no speaker data selection. Similar consistent
performance improvements were retained after external Transformer and LSTM
language model rescoring. In particular, on the 300-hour Switchboard corpus,
statistically significant WER reductions of 1.0%, 1.3%, and 1.4% absolute
(9.5%, 10.9%, and 11.3% relative) were obtained over the baseline SI Conformer
on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Similar WER
reductions of 2.7% and 3.3% absolute (8.9% and 10.2% relative) were also
obtained on the AMI development and evaluation sets.Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processin
Speaker normalisation for large vocabulary multiparty conversational speech recognition
One of the main problems faced by automatic speech recognition is the variability of
the testing conditions. This is due both to the acoustic conditions (different transmission
channels, recording devices, noises etc.) and to the variability of speech
across different speakers (i.e. due to different accents, coarticulation of phonemes
and different vocal tract characteristics). Vocal tract length normalisation (VTLN)
aims at normalising the acoustic signal, making it independent from the vocal tract
length. This is done by a speaker specific warping of the frequency axis parameterised
through a warping factor. In this thesis the application of VTLN to multiparty
conversational speech was investigated focusing on the meeting domain. This
is a challenging task showing a great variability of the speech acoustics both across
different speakers and across time for a given speaker. VTL, the distance between
the lips and the glottis, varies over time. We observed that the warping factors estimated
using Maximum Likelihood seem to be context dependent: appearing to be
influenced by the current conversational partner and being correlated with the behaviour
of formant positions and the pitch. This is because VTL also influences the
frequency of vibration of the vocal cords and thus the pitch. In this thesis we also
investigated pitch-adaptive acoustic features with the goal of further improving the
speaker normalisation provided by VTLN.
We explored the use of acoustic features obtained using a pitch-adaptive analysis
in combination with conventional features such as Mel frequency cepstral coefficients.
These spectral representations were combined both at the acoustic feature
level using heteroscedastic linear discriminant analysis (HLDA), and at the system
level using ROVER. We evaluated this approach on a challenging large vocabulary
speech recognition task: multiparty meeting transcription. We found that VTLN
benefits the most from pitch-adaptive features. Our experiments also suggested that
combining conventional and pitch-adaptive acoustic features using HLDA results in
a consistent, significant decrease in the word error rate across all the tasks. Combining
at the system level using ROVER resulted in a further significant improvement.
Further experiments compared the use of pitch adaptive spectral representation with
the adoption of a smoothed spectrogram for the extraction of cepstral coefficients.
It was found that pitch adaptive spectral analysis, providing a representation which
is less affected by pitch artefacts (especially for high pitched speakers), delivers features with an improved speaker independence. Furthermore this has also shown to
be advantageous when HLDA is applied. The combination of a pitch adaptive spectral
representation and VTLN based speaker normalisation in the context of LVCSR
for multiparty conversational speech led to more speaker independent acoustic models
improving the overall recognition performances
Learning to adapt: meta-learning approaches for speaker adaptation
The performance of automatic speech recognition systems degrades rapidly when there
is a mismatch between training and testing conditions. One way to compensate for this
mismatch is to adapt an acoustic model to test conditions, for example by performing
speaker adaptation. In this thesis we focus on the discriminative model-based speaker
adaptation approach. The success of this approach relies on having a robust speaker
adaptation procedure – we need to specify which parameters should be adapted and
how they should be adapted. Unfortunately, tuning the speaker adaptation procedure
requires considerable manual effort.
In this thesis we propose to formulate speaker adaptation as a meta-learning task. In
meta-learning, learning occurs on two levels: a learner learns a task specific model and
a meta-learner learns how to train these task specific models. In our case, the learner is
a speaker dependent-model and the meta-learner learns to adapt a speaker-independent
model into the speaker dependent model. By using this formulation, we can automatically learn robust speaker adaptation procedures using gradient descent. In the exper iments, we demonstrate that the meta-learning approach learns competitive adaptation
schedules compared to adaptation procedures with handcrafted hyperparameters.
Subsequently, we show that speaker adaptive training can be formulated as a meta-learning task as well. In contrast to the traditional approach, which maintains and optimises a copy of speaker dependent parameters for each speaker during training, we
embed the gradient based adaptation directly into the training of the acoustic model.
We hypothesise that this formulation should steer the training of the acoustic model
into finding parameters better suited for test-time speaker adaptation. We experimentally compare our approach with test-only adaptation of a standard baseline model and
with SAT-LHUC, which represents a traditional speaker adaptive training method. We
show that the meta-learning speaker-adaptive training approach achieves comparable
results with SAT-LHUC. However, neither the meta-learning approach nor SAT-LHUC
outperforms the baseline approach after adaptation.
Consequently, we run a series of experimental ablations to determine why SAT-LHUC does not yield any improvements compared to the baseline approach. In these
experiments we explored multiple factors such as using various neural network architectures, normalisation techniques, activation functions or optimisers. We find that
SAT-LHUC interferes with batch normalisation, and that it benefits from an increased
hidden layer width and an increased model size. However, the baseline model benefits from increased capacity too, therefore in order to obtain the best model it is still
favourable to train a speaker independent model with batch normalisation. As such, an
effective way of training state-of-the-art SAT-LHUC models remains an open question.
Finally, we show that the performance of unsupervised speaker adaptation can be
further improved by using discriminative adaptation with lattices as supervision obtained from a first pass decoding, instead of traditionally used one-best path tran scriptions. We find that this proposed approach enables many more parameters to
be adapted without overfitting being observed, and is successful even when the initial
transcription has a WER in excess of 50%
Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State vowel Categorization
Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624
Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State Vowel Identification
Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. Such a transformation enables speech to be understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitchindependent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624
Text-Independent Speaker Verification Using 3D Convolutional Neural Networks
In this paper, a novel method using 3D Convolutional Neural Network (3D-CNN)
architecture has been proposed for speaker verification in the text-independent
setting. One of the main challenges is the creation of the speaker models. Most
of the previously-reported approaches create speaker models based on averaging
the extracted features from utterances of the speaker, which is known as the
d-vector system. In our paper, we propose an adaptive feature learning by
utilizing the 3D-CNNs for direct speaker model creation in which, for both
development and enrollment phases, an identical number of spoken utterances per
speaker is fed to the network for representing the speakers' utterances and
creation of the speaker model. This leads to simultaneously capturing the
speaker-related information and building a more robust system to cope with
within-speaker variation. We demonstrate that the proposed method significantly
outperforms the traditional d-vector verification system. Moreover, the
proposed system can also be an alternative to the traditional d-vector system
which is a one-shot speaker modeling system by utilizing 3D-CNNs.Comment: Accepted to be published in IEEE International Conference on
Multimedia and Expo (ICME) 201
A Generative Model for Score Normalization in Speaker Recognition
We propose a theoretical framework for thinking about score normalization,
which confirms that normalization is not needed under (admittedly fragile)
ideal conditions. If, however, these conditions are not met, e.g. under
data-set shift between training and runtime, our theory reveals dependencies
between scores that could be exploited by strategies such as score
normalization. Indeed, it has been demonstrated over and over experimentally,
that various ad-hoc score normalization recipes do work. We present a first
attempt at using probability theory to design a generative score-space
normalization model which gives similar improvements to ZT-norm on the
text-dependent RSR 2015 database
- …