21 research outputs found
Combining vocal tract length normalization with hierarchial linear transformations
Recent research has demonstrated the effectiveness of vocal tract length normalization (VTLN) as a rapid adaptation technique for statistical parametric speech synthesis. VTLN produces speech with naturalness preferable to that of MLLR-based adaptation techniques, being much closer in quality to that generated by the original av-erage voice model. However with only a single parameter, VTLN captures very few speaker specific characteristics when compared to linear transform based adaptation techniques. This paper pro-poses that the merits of VTLN can be combined with those of linear transform based adaptation in a hierarchial Bayesian frame-work, where VTLN is used as the prior information. A novel tech-nique for propagating the gender information from the VTLN prior through constrained structural maximum a posteriori linear regres-sion (CSMAPLR) adaptation is presented. Experiments show that the resulting transformation has improved speech quality with better naturalness, intelligibility and improved speaker similarity. Index Terms â Statistical parametric speech synthesis, hidden Markov models, speaker adaptation, vocal tract length normaliza-tion, constrained structural maximum a posteriori linear regression 1
A new method for speaker adaptation using bilinear model
ABSTRACT In this paper, a novel method for speaker adaptation using bilinear model is proposed. Bilinear model can express both characteristics of speakers (style) and phonemes across speakers (content) independently in a training database. The mapping from each speaker and phoneme space to observation space is carried out using bilinear mapping matrix which is independent of speaker and phoneme space. We apply the bilinear model to speaker adaption. Using adaptation data from a new speaker, speaker-adapted model is built by estimating the style(speaker)-specific matrix. Experimental results showed that the proposed method outperformed eigenvoice and MLLR. In vocabulary-independent isolated word recognition for speaker adaptation, bilinear model reduced word error rate by about 38% and about 10% compared to eigenvoice and MLLR respectively using 50 words for adaptation
Combining Vocal Tract Length Normalization with Linear Transformations in a Bayesian Framework
Recent research has demonstrated the effectiveness of vocal tract length normalization (VTLN) as a rapid adaptation technique for statistical parametric speech synthesis. VTLN produces speech with naturalness preferable to that of MLLR- based adaptation techniques, being much closer in quality to that generated by the original average voice model. By contrast, with just a single parameter, VTLN captures very few speaker specific characteristics when compared to the available linear transform based adaptation techniques. This paper proposes that the merits of VTLN can be combined with those of linear transform based adaptation technique in a Bayesian framework, where VTLN is used as the prior information. A novel technique of propa- gating the gender information from the VTLN prior through constrained structural maximum a posteriori linear regression (CSMAPLR) adaptation is presented. Experiments show that the resulting transformation has improved speech quality with better naturalness, intelligibility and improved speaker similarity
Adaptation of childrenâs speech with limited data based on formant-like peak alignment,â
Abstract Automatic recognition of children's speech using acoustic models trained by adults results in poor performance due to differences in speech acoustics. These acoustical differences are a consequence of children having shorter vocal tracts and smaller vocal cords than adults. Hence, speaker adaptation needs to be performed. However, in real-world applications, the amount of adaptation data available may be less than what is needed by common speaker adaptation techniques to yield reasonable performance. In this paper, we first study, in the discrete frequency domain, the relationship between frequency warping in the front-end and corresponding transformations in the back-end. Three common feature extraction schemes are investigated and their transformation linearity in the back-end are discussed. In particular, we show that under certain approximations, frequency warping of MFCC features with Mel-warped triangular filter banks equals a linear transformation in the cepstral space. Based on that linear transformation, a formant-like peak alignment algorithm is proposed to adapt adult acoustic models to children's speech. The peaks are estimated by Gaussian mixtures using the Expectation-Maximization (EM) algorith
VTLN Adaptation for Statistical Speech Synthesis
The advent of statistical speech synthesis has enabled the unification of the basic techniques used in speech synthesis and recognition. Adaptation techniques that have been successfully used in recognition systems can now be applied to synthesis systems to improve the quality of the synthesized speech. The application of vocal tract length normalization (VTLN) for synthesis is explored in this paper. VTLN based adaptation requires estimation of a single warping factor, which can be accurately estimated from very little adaptation data and gives additive improvements over CMLLR adaptation. The challenge of estimating accurate warping factors using higher order features is solved by initializing warping factor estimation with the values calculated from lower order features
Study of Jacobian Normalization for VTLN
The divergence of the theory and practice of vocal tract length normalization (VTLN) is addressed, with particular emphasis on the role of the Jacobian determinant. VTLN is placed in a Bayesian setting, which brings in the concept of a prior on the warping factor. The form of the prior, together with acoustic scaling and numerical conditioning are then discussed and evaluated. It is concluded that the Jacobian determinant is important in VTLN, especially for the high dimensional features used in HMM based speech synthesis, and difficulties normally associated with the Jacobian determinant can be attributed to prior and scaling
Bias Adaptation for Vocal Tract Length Normalization
Vocal tract length normalisation (VTLN) is a well known rapid adaptation technique. VTLN as a linear transformation in the cepstral domain results in the scaling and translation factors. The warping factor represents the spectral scaling parameter. While, the translation factor represented by bias term captures more speaker characteristics especially in a rapid adaptation framework without having the risk of over-fitting. This paper presents a complete and comprehensible derivation of the bias transformation for VTLN and implements it in a unified framework for statistical parametric speech synthesis and recognition. The recognition experiments show that bias term improves the rapid adaptation performance and gives additional performance over the cepstral mean normalisation factor. It was observed from the synthesis results that VTLN bias term did not have much effect in combination with model adaptation techniques that already have a bias transformation incorporated
Vocal Tract Length Normalization for Statistical Parametric Speech Synthesis
Vocal tract length normalization (VTLN) has been successfully used in automatic speech recognition for improved performance. The same technique can be implemented in statistical parametric speech synthesis for rapid speaker adaptation during synthesis. This paper presents an efficient implementation of VTLN using expectation maximization and addresses the key challenges faced in implementing VTLN for synthesis. Jacobian normalization, high dimensionality features and truncation of the transformation matrix are a few challenges presented with the appropriate solutions. Detailed evaluations are performed to estimate the most suitable technique for using VTLN in speech synthesis. Evaluating VTLN in the framework of speech synthesis is also not an easy task since the technique does not work equally well for all speakers. Speakers have been selected based on different objective and subjective criteria to demonstrate the difference between systems. The best method for implementing VTLN is confirmed to be use of the lower order features for estimating warping factors
Automatic Speech Recognition for ageing voices
With ageing, human voices undergo several changes which are typically characterised
by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking
rate. The focus of this thesis is to understand the impact of ageing on Automatic
Speech Recognition (ASR) performance and improve the ASR accuracies for older
voices.
Baseline results on three corpora indicate that the word error rates (WER) for older
adults are significantly higher than those of younger adults and the decrease in accuracies
is higher for males speakers as compared to females.
Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies
were found to be significantly higher for older adults. However, the hypothesis
that these changes explain the differences in WER for the two age groups is proven incorrect.
Experiments with artificial introduction of glottal source disfluencies in speech
from younger adults do not display a significant impact on WERs. Changes in fundamental
frequency observed quite often in older voices has a marginal impact on ASR
accuracies.
Analysis of phoneme errors between younger and older speakers shows a pattern
of certain phonemes especially lower vowels getting more affected with ageing. These
changes however are seen to vary across speakers. Another factor that is strongly associated
with ageing voices is a decrease in the rate of speech. Experiments to analyse
the impact of slower speaking rate on ASR accuracies indicate that the insertion errors
increase while decoding slower speech with models trained on relatively faster speech.
We then propose a way to characterise speakers in acoustic space based on speaker
adaptation transforms and observe that speakers (especially males) can be segregated
with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical
acoustic models based on gender and age. Significant improvements in word
accuracies are achieved over the baseline results with such models. The idea is then extended
to construct unsupervised hierarchical models which also outperform the baseline
models by a good margin.
Finally, we hypothesize that the ASR accuracies can be improved by augmenting
the adaptation data with speech from acoustically closest speakers. A strategy to select
the augmentation speakers is proposed. Experimental results on two corpora indicate
that the hypothesis holds true only when the amount of available adaptation is limited
to a few seconds. The efficacy of such a speaker selection strategy is analysed for both
younger and older adults