18 research outputs found
Vocal Tract Length Normalization for Statistical Parametric Speech Synthesis
Vocal tract length normalization (VTLN) has been successfully used in automatic speech recognition for improved performance. The same technique can be implemented in statistical parametric speech synthesis for rapid speaker adaptation during synthesis. This paper presents an efficient implementation of VTLN using expectation maximization and addresses the key challenges faced in implementing VTLN for synthesis. Jacobian normalization, high dimensionality features and truncation of the transformation matrix are a few challenges presented with the appropriate solutions. Detailed evaluations are performed to estimate the most suitable technique for using VTLN in speech synthesis. Evaluating VTLN in the framework of speech synthesis is also not an easy task since the technique does not work equally well for all speakers. Speakers have been selected based on different objective and subjective criteria to demonstrate the difference between systems. The best method for implementing VTLN is confirmed to be use of the lower order features for estimating warping factors
Speaker normalisation for large vocabulary multiparty conversational speech recognition
One of the main problems faced by automatic speech recognition is the variability of
the testing conditions. This is due both to the acoustic conditions (different transmission
channels, recording devices, noises etc.) and to the variability of speech
across different speakers (i.e. due to different accents, coarticulation of phonemes
and different vocal tract characteristics). Vocal tract length normalisation (VTLN)
aims at normalising the acoustic signal, making it independent from the vocal tract
length. This is done by a speaker specific warping of the frequency axis parameterised
through a warping factor. In this thesis the application of VTLN to multiparty
conversational speech was investigated focusing on the meeting domain. This
is a challenging task showing a great variability of the speech acoustics both across
different speakers and across time for a given speaker. VTL, the distance between
the lips and the glottis, varies over time. We observed that the warping factors estimated
using Maximum Likelihood seem to be context dependent: appearing to be
influenced by the current conversational partner and being correlated with the behaviour
of formant positions and the pitch. This is because VTL also influences the
frequency of vibration of the vocal cords and thus the pitch. In this thesis we also
investigated pitch-adaptive acoustic features with the goal of further improving the
speaker normalisation provided by VTLN.
We explored the use of acoustic features obtained using a pitch-adaptive analysis
in combination with conventional features such as Mel frequency cepstral coefficients.
These spectral representations were combined both at the acoustic feature
level using heteroscedastic linear discriminant analysis (HLDA), and at the system
level using ROVER. We evaluated this approach on a challenging large vocabulary
speech recognition task: multiparty meeting transcription. We found that VTLN
benefits the most from pitch-adaptive features. Our experiments also suggested that
combining conventional and pitch-adaptive acoustic features using HLDA results in
a consistent, significant decrease in the word error rate across all the tasks. Combining
at the system level using ROVER resulted in a further significant improvement.
Further experiments compared the use of pitch adaptive spectral representation with
the adoption of a smoothed spectrogram for the extraction of cepstral coefficients.
It was found that pitch adaptive spectral analysis, providing a representation which
is less affected by pitch artefacts (especially for high pitched speakers), delivers features with an improved speaker independence. Furthermore this has also shown to
be advantageous when HLDA is applied. The combination of a pitch adaptive spectral
representation and VTLN based speaker normalisation in the context of LVCSR
for multiparty conversational speech led to more speaker independent acoustic models
improving the overall recognition performances
Contributions to speech processing and ambient sound analysis
We are constantly surrounded by sounds that we continuously exploit to adapt our actions to situations we are facing. Some of the sounds like speech can have a particular structure from which we can infer some information, explicit or not. This is one reason why speech is possibly that is the most intuitive way to communicate between humans. Within the last decade, there has been significant progress in the domain of speech andaudio processing and in particular in the domain of machine learning applied to speech and audio processing. Thanks to these progresses, speech has become a central element in many human to human distant communication tools as well as in human to machine communication systems. These solutions work pretty well on clean speech or under controlled condition. However, in scenarios that involve the presence of acoustic perturbation such as noise or reverberation systems performance tends to degrade severely. In this thesis we focus on processing speech and its environments from an audio perspective. The algorithms proposed here are relying on a variety of solutions from signal processing based approaches to data-driven solutions based on supervised matrix factorization or deep neural networks. We propose solutions to problems ranging from speech recognition, to speech enhancement or ambient sound analysis. The target is to offer a panorama of the different aspects that could improve a speech processing algorithm working in a real environments. We start by describing automatic speech recognition as a potential end application and progressively unravel the limitations and the proposed solutions ending-up to the more general ambient sound analysis.Nous sommes constamment entourĂ©s de sons que nous exploitons pour adapter nos actions aux situations auxquelles nous sommes confrontĂ©s. Certains sons comme la parole peuvent avoir une structure particuliĂšre Ă partir de laquelle nous pouvons dĂ©duire des informations, explicites ou non. Câest lâune des raisons pour lesquelles la parole est peut-ĂȘtre le moyen le plus intuitif de communiquer entre humains. Au cours de la dĂ©cennie Ă©coulĂ©e, des progrĂšs significatifs ont Ă©tĂ© rĂ©alisĂ©s dans le domaine du traitement de la parole et du son et en particulier dans le domaine de lâapprentissage automatique appliquĂ© au traitement de la parole et du son. GrĂące Ă ces progrĂšs, la parole est devenue un Ă©lĂ©ment central de nombreux outils de communication Ă distance dâhumain Ă humain ainsi que dans les systĂšmes de communication humain-machine. Ces solutions fonctionnent bien sur un signal de parole propre ou dans des conditions contrĂŽlĂ©es. Cependant, dans les scĂ©narios qui impliquent la prĂ©sence de perturbations acoustiques telles que du bruit ou de la rĂ©verbĂ©ration les performances peuvent avoir tendance Ă se dĂ©grader gravement. Dans cette HDR, nous nous concentrons sur le traitement de la parole et de son environnement dâun point de vue audio. Les algorithmes proposĂ©s ici reposent sur une variĂ©tĂ© de solutions allant des approches basĂ©es sur le traitement du signal aux solutions orientĂ©es donnĂ©es Ă base de factorisation matricielle supervisĂ©e ou de rĂ©seaux de neurones profonds. Nous proposons des solutions Ă des problĂšmes allant de la reconnaissance vocale au rehaussement de la parole ou Ă lâanalyse des sons ambiants. Lâobjectif est dâoffrir un panorama des diffĂ©rents aspects qui pourraient ĂȘtre amĂ©liorer un algorithme de traitement de la parole fonctionnant dans un environnement rĂ©el. Nous commençons par dĂ©crire la reconnaissance automatique de la parole comme une application finale potentielle et analysons progressivement les limites et les solutions proposĂ©es aboutissant Ă lâanalyse plus gĂ©nĂ©rale des sons ambiants
A Bandpass Transform for Speaker Normalization
One of the major challenges for Automatic Speech Recognition is to handle speech variability. Inter-speaker variability is partly due to differences in speakers' anatomy and especially in their Vocal Tract geometry. Dissimilarities in Vocal Tract Length (VTL) are a known source of speech variation. Vocal Tract Length Normalization is a popular Speaker Normalization technique that can be implemented as a transformation of a spectrum frequency axis. We introduce in this document a new spectral transformation for Speaker Normalization. We use the Bilinear Transformation to introduce a new frequency warping resulting from a mapping of a prototype Band-Pass (BP) filter into a general BP filter. This new transformation called the Bandpass Transformation (BPT) offers two degrees of freedom enabling complex warpings of the frequency axis that are different from previous works with the Bilinear Transform. We then define a procedure to use BPT for Speaker Normalization based on the Nelder-Mead algorithm for the estimation of the BPT parameters. We present a detailed study of the performance of our new approach on two test sets with gender dependent and independent systems. Our results demonstrate clear improvements compared to standard methods used in VTL Normalization. A score compensation procedure is presented and results in further improvements of our results by refining our BPT parameter estimation
Automatic speech recognition: from study to practice
Today, automatic speech recognition (ASR) is widely used for different purposes such as robotics, multimedia, medical and industrial application. Although many researches have been performed in this field in the past decades, there is still a lot of room to work. In order to start working in this area, complete knowledge of ASR systems as well as their weak points and problems is inevitable. Besides that, practical experience improves the theoretical knowledge understanding in a reliable way. Regarding to these facts, in this master thesis, we have first reviewed the principal structure of the standard HMM-based ASR systems from technical point of view. This includes, feature extraction, acoustic modeling, language modeling and decoding. Then, the most significant challenging points in ASR systems is discussed. These challenging points address different internal components characteristics or external agents which affect the ASR systems performance. Furthermore, we have implemented a Spanish language recognizer using HTK toolkit. Finally, two open research lines according to the studies of different sources in the field of ASR has been suggested for future work
Automatic Speech Recognition for ageing voices
With ageing, human voices undergo several changes which are typically characterised
by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking
rate. The focus of this thesis is to understand the impact of ageing on Automatic
Speech Recognition (ASR) performance and improve the ASR accuracies for older
voices.
Baseline results on three corpora indicate that the word error rates (WER) for older
adults are significantly higher than those of younger adults and the decrease in accuracies
is higher for males speakers as compared to females.
Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies
were found to be significantly higher for older adults. However, the hypothesis
that these changes explain the differences in WER for the two age groups is proven incorrect.
Experiments with artificial introduction of glottal source disfluencies in speech
from younger adults do not display a significant impact on WERs. Changes in fundamental
frequency observed quite often in older voices has a marginal impact on ASR
accuracies.
Analysis of phoneme errors between younger and older speakers shows a pattern
of certain phonemes especially lower vowels getting more affected with ageing. These
changes however are seen to vary across speakers. Another factor that is strongly associated
with ageing voices is a decrease in the rate of speech. Experiments to analyse
the impact of slower speaking rate on ASR accuracies indicate that the insertion errors
increase while decoding slower speech with models trained on relatively faster speech.
We then propose a way to characterise speakers in acoustic space based on speaker
adaptation transforms and observe that speakers (especially males) can be segregated
with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical
acoustic models based on gender and age. Significant improvements in word
accuracies are achieved over the baseline results with such models. The idea is then extended
to construct unsupervised hierarchical models which also outperform the baseline
models by a good margin.
Finally, we hypothesize that the ASR accuracies can be improved by augmenting
the adaptation data with speech from acoustically closest speakers. A strategy to select
the augmentation speakers is proposed. Experimental results on two corpora indicate
that the hypothesis holds true only when the amount of available adaptation is limited
to a few seconds. The efficacy of such a speaker selection strategy is analysed for both
younger and older adults
Adaptation of speech recognition systems to selected real-world deployment conditions
Tato habilitaÄnĂ prĂĄce se zabĂœvĂĄ problematikou adaptace systĂ©mĆŻ
rozpoznĂĄvĂĄnĂ ĆeÄi na vybranĂ© reĂĄlnĂ© podmĂnky nasazenĂ. Je koncipovĂĄna
jako sbornĂk celkem dvanĂĄcti ÄlĂĄnkĆŻ, kterĂ© se touto problematikou
zabĂœvajĂ. Jde o publikace, jejichĆŸ jsem hlavnĂm autorem
nebo spoluatorem, a kterĂ© vznikly v rĂĄmci nÄkolika navazujĂcĂch
vĂœzkumnĂœch projektĆŻ. Na ĆeĆĄenĂ tÄchto projektĆŻ jsem se
podĂlel jak v roli Älena vĂœzkumnĂ©ho tĂœmu, tak i v roli ĆeĆĄitele nebo
spoluĆeĆĄitele.
Publikace zaĆazenĂ© do tohoto sbornĂku lze rozdÄlit podle tĂ©matu
do tĆĂ hlavnĂch skupin. Jejich spoleÄnĂœm jmenovatelem je
snaha pĆizpĆŻsobit danĂœ rozpoznĂĄvacĂ systĂ©m novĂœm podmĂnkĂĄm Äi
konkrĂ©tnĂmu faktoru, kterĂœ vĂœznamnĂœm zpĆŻsobem ovlivĆuje jeho
funkci Äi pĆesnost.
PrvnĂ skupina ÄlĂĄnkĆŻ se zabĂœvĂĄ Ășlohou neĆĂzenĂ© adaptace na
mluvÄĂho, kdy systĂ©m pĆizpĆŻsobuje svoje parametry specifickĂœm
hlasovĂœm charakteristikĂĄm danĂ© mluvĂcĂ osoby. DruhĂĄ ÄĂĄst prĂĄce
se pak vÄnuje problematice identifikace neĆeÄovĂœch udĂĄlostĂ na vstupu
do systĂ©mu a souvisejĂcĂ Ășloze rozpoznĂĄvĂĄnĂ ĆeÄi s hlukem
(a zejmĂ©na hudbou) na pozadĂ. KoneÄnÄ tĆetĂ ÄĂĄst prĂĄce se zabĂœvĂĄ
pĆĂstupy, kterĂ© umoĆŸĆujĂ pĆepis audio signĂĄlu obsahujĂcĂho promluvy
ve vĂce neĆŸ v jednom jazyce. Jde o metody adaptace existujĂcĂho
rozpoznĂĄvacĂho systĂ©mu na novĂœ jazyk a metody identifikace
jazyka z audio signĂĄlu.
ObÄ zmĂnÄnĂ© identifikaÄnĂ Ășlohy jsou pĆitom vyĆĄetĆovĂĄny zejmĂ©na
v nĂĄroÄnĂ©m a mĂ©nÄ probĂĄdanĂ©m reĆŸimu zpracovĂĄnĂ po jednotlivĂœch
rĂĄmcĂch vstupnĂho signĂĄlu, kterĂœ je jako jedinĂœ vhodnĂœ pro on-line
nasazenĂ, napĆ. pro streamovanĂĄ data.This habilitation thesis deals with adaptation of automatic speech
recognition (ASR) systems to selected real-world deployment conditions.
It is presented in the form of a collection of twelve articles
dealing with this task; I am the main author or a co-author of these
articles. They were published during my work on several consecutive
research projects. I have participated in the solution of them
as a member of the research team as well as the investigator or a
co-investigator.
These articles can be divided into three main groups according to
their topics. They have in common the effort to adapt a particular
ASR system to a specific factor or deployment condition that affects
its function or accuracy.
The first group of articles is focused on an unsupervised speaker
adaptation task, where the ASR system adapts its parameters to
the specific voice characteristics of one particular speaker. The second
part deals with a) methods allowing the system to identify
non-speech events on the input, and b) the related task of recognition
of speech with non-speech events, particularly music, in the
background. Finally, the third part is devoted to the methods
that allow the transcription of an audio signal containing multilingual
utterances. It includes a) approaches for adapting the existing
recognition system to a new language and b) methods for identification
of the language from the audio signal.
The two mentioned identification tasks are in particular investigated
under the demanding and less explored frame-wise scenario,
which is the only one suitable for processing of on-line data streams
A System for Simultaneous Translation of Lectures and Speeches
This thesis realizes the first existing automatic system for simultaneous speech-to-speech translation. The focus of this system is the automatic translation of (technical oriented) lectures and speeches from English to Spanish, but the different aspects described in this thesis will also be helpful for developing simultaneous translation systems for other domains or languages
Robust learning of acoustic representations from diverse speech data
Automatic speech recognition is increasingly applied to new domains. A key challenge is
to robustly learn, update and maintain representations to cope with transient acoustic
conditions. A typical example is broadcast media, for which speakers and environments
may change rapidly, and available supervision may be poor. The concern of this
thesis is to build and investigate methods for acoustic modelling that are robust to the
characteristics and transient conditions as embodied by such media.
The first contribution of the thesis is a technique to make use of inaccurate transcriptions as supervision for acoustic model training. There is an abundance of audio
with approximate labels, but training methods can be sensitive to label errors, and their
use is therefore not trivial. State-of-the-art semi-supervised training makes effective
use of a lattice of supervision, inherently encoding uncertainty in the labels to avoid
overfitting to poor supervision, but does not make use of the transcriptions. Existing
approaches that do aim to make use of the transcriptions typically employ an algorithm
to filter or combine the transcriptions with the recognition output from a seed model,
but the final result does not encode uncertainty. We propose a method to combine the
lattice output from a biased recognition pass with the transcripts, crucially preserving
uncertainty in the lattice where appropriate. This substantially reduces the word error
rate on a broadcast task.
The second contribution is a method to factorise representations for speakers and
environments so that they may be combined in novel combinations. In realistic scenarios,
the speaker or environment transform at test time might be unknown, or there may be
insufficient data to learn a joint transform. We show that in such cases, factorised, or
independent, representations are required to avoid deteriorating performance. Using
i-vectors, we factorise speaker or environment information using multi-condition training
with neural networks. Specifically, we extract bottleneck features from networks trained
to classify either speakers or environments. The resulting factorised representations
prove beneficial when one factor is missing at test time, or when all factors are seen,
but not in the desired combination.
The third contribution is an investigation of model adaptation in a longitudinal
setting. In this scenario, we repeatedly adapt a model to new data, with the constraint
that previous data becomes unavailable. We first demonstrate the effect of such a
constraint, and show that using a cyclical learning rate may help. We then observe
that these successive models lend themselves well to ensembling. Finally, we show
that the impact of this constraint in an active learning setting may be detrimental to
performance, and suggest to combine active learning with semi-supervised training to
avoid biasing the model.
The fourth contribution is a method to adapt low-level features in a parameter-efficient and interpretable manner. We propose to adapt the filters in a neural feature
extractor, known as SincNet. In contrast to traditional techniques that warp the
filterbank frequencies in standard feature extraction, adapting SincNet parameters is
more flexible and more readily optimised, whilst maintaining interpretability. On a task
adapting from adult to child speech, we show that this layer is well suited for adaptation
and is very effective with respect to the small number of adapted parameters
Learning to adapt: meta-learning approaches for speaker adaptation
The performance of automatic speech recognition systems degrades rapidly when there
is a mismatch between training and testing conditions. One way to compensate for this
mismatch is to adapt an acoustic model to test conditions, for example by performing
speaker adaptation. In this thesis we focus on the discriminative model-based speaker
adaptation approach. The success of this approach relies on having a robust speaker
adaptation procedure â we need to specify which parameters should be adapted and
how they should be adapted. Unfortunately, tuning the speaker adaptation procedure
requires considerable manual effort.
In this thesis we propose to formulate speaker adaptation as a meta-learning task. In
meta-learning, learning occurs on two levels: a learner learns a task specific model and
a meta-learner learns how to train these task specific models. In our case, the learner is
a speaker dependent-model and the meta-learner learns to adapt a speaker-independent
model into the speaker dependent model. By using this formulation, we can automatically learn robust speaker adaptation procedures using gradient descent. In the exper iments, we demonstrate that the meta-learning approach learns competitive adaptation
schedules compared to adaptation procedures with handcrafted hyperparameters.
Subsequently, we show that speaker adaptive training can be formulated as a meta-learning task as well. In contrast to the traditional approach, which maintains and optimises a copy of speaker dependent parameters for each speaker during training, we
embed the gradient based adaptation directly into the training of the acoustic model.
We hypothesise that this formulation should steer the training of the acoustic model
into finding parameters better suited for test-time speaker adaptation. We experimentally compare our approach with test-only adaptation of a standard baseline model and
with SAT-LHUC, which represents a traditional speaker adaptive training method. We
show that the meta-learning speaker-adaptive training approach achieves comparable
results with SAT-LHUC. However, neither the meta-learning approach nor SAT-LHUC
outperforms the baseline approach after adaptation.
Consequently, we run a series of experimental ablations to determine why SAT-LHUC does not yield any improvements compared to the baseline approach. In these
experiments we explored multiple factors such as using various neural network architectures, normalisation techniques, activation functions or optimisers. We find that
SAT-LHUC interferes with batch normalisation, and that it benefits from an increased
hidden layer width and an increased model size. However, the baseline model benefits from increased capacity too, therefore in order to obtain the best model it is still
favourable to train a speaker independent model with batch normalisation. As such, an
effective way of training state-of-the-art SAT-LHUC models remains an open question.
Finally, we show that the performance of unsupervised speaker adaptation can be
further improved by using discriminative adaptation with lattices as supervision obtained from a first pass decoding, instead of traditionally used one-best path tran scriptions. We find that this proposed approach enables many more parameters to
be adapted without overfitting being observed, and is successful even when the initial
transcription has a WER in excess of 50%