A major factor which causes a deterioration in speech quality in HMM-based speech synthesis is the use of a simple delta pulse signal to generate the excitation of voiced speech. This paper sets out a new approach to using an acoustic glottal source model in HMM-based synthesisers instead of the traditional pulse signal. The goal is to improve speech quality and to better model and transform voice characteristics. We have found the new method decreases buzziness and also improves prosodic modelling. A perceptual evaluation has supported this finding by showing a 55.6 % preference for the new system, as against the baseline. This improvement, while not being as significant as we had initially expected, does encourage us to work on developing the proposed speech synthesiser further

Cabral, J.

Renals, Steve

Richmond, K.

Yamagishi, J.

English

Crossref

HMM-based speech synthesiser using the LF-model of the glottal source

A major factor which causes a deterioration in speech quality in HMM-based speech synthesis is the use of a simple delta pulse signal to generate the excitation of voiced speech. This paper sets out a new approach to using an acoustic glottal source model in HMM-based synthesisers instead of the traditional pulse signal. The goal is to improve speech quality and to better model and transform voice characteristics. We have found the new method decreases buzziness and also improves prosodic modelling. A perceptual evaluation has supported this finding by showing a 55.6% preference for the new system, as against the baseline. This improvement, while not being as significant as we had initially expected, does encourage us to work on developing the proposed speech synthesiser further

Edinburgh Research Explorer

     Edinburgh Research Explorer                                      HMM-based speech synthesiser using the LF-model of the glottalsourceCitation for published version:Cabral, J, Renals, S, Yamagishi, J & Richmond, K 2011, 'HMM-based speech synthesiser using the LF-model of the glottal source'. in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE InternationalConference on. pp. 4704-4707., 10.1109/ICASSP.2011.5947405Digital Object Identifier (DOI):10.1109/ICASSP.2011.5947405Link:Link to publication record in Edinburgh Research ExplorerDocument Version:Author final version (often known as postprint)Published In:Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference onGeneral rightsCopyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)and / or other copyright owners and it is a condition of accessing these publications that users recognise andabide by the legal requirements associated with these rights.Take down policyThe University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorercontent complies with UK legislation. If you believe that the public display of this file breaches copyright pleasecontact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately andinvestigate your claim.Download date: 20. Feb. 2015HMM-BASED SPEECH SYNTHESISER USING THE LF-MODELOF THE GLOTTAL SOURCEJoa˜o P. Cabral1,2 , Steve Renals2, Junichi Yamagishi2 and Korin Richmond21School of Computer Science and Informatics, University College Dublin, Ireland2The Centre for Speech Technology Research, University of Edinburgh,UKjoao.cabral@ucd.ie, s.renals@ed.ac.uk, jyamagis@inf.ed.ac.uk, korin@cstr.ed.ac.ukABSTRACTA major factor which causes a deterioration in speechquality in HMM-based speech synthesis is the use of a simpledelta pulse signal to generate the excitation of voiced speech.This paper sets out a new approach to using an acoustic glottalsource model in HMM-based synthesisers instead of the tradi-tional pulse signal. The goal is to improve speech quality andto better model and transform voice characteristics. We havefound the new method decreases buzziness and also improvesprosodic modelling. A perceptual evaluation has supportedthis finding by showing a 55.6% preference for the new sys-tem, as against the baseline. This improvement, while notbeing as significant as we had initially expected, does encour-age us to work on developing the proposed speech synthesiserfurther.Index Terms— HMM-based Speech Synthesis, LF-Model, Glottal Source Modelling1. INTRODUCTIONHMM-based speech synthesisers typically generate speechby shaping a spectrally flat excitation with the spectral enve-lope of speech, e.g. [1]. A simple excitation model consistsof using white noise for unvoiced speech and an impulsetrain for voiced speech. However, this model makes the syn-thetic speech sound buzzy and just allows to control the pitch(through the F0 parameter). A popular method to reduce thebuzziness is to mix the impulse train with a noise componentusing a multi-band mixed excitation model, e.g. [2]. Re-cently, other excitation models have been used in statisticalspeech synthesis which try to better approximate the voicedexcitation to the residual calculated using the inverse filteringtechnique, e.g. [3, 4]. These models can represent more de-tails of the source than the noise. However, they do not modelrelevant characteristics of the glottal source.Speech can also be generated by passing a glottal sourcemodel through a filter representing the vocal tract system.First author is currently supported by the Science Foundation Ireland(Grant 07/CE/I1142). This paper is based on his PhD work supported byMarie Curie Early Stage Training Site EdSST (MEST-CT-2005-020568)However, the methods to estimate the glottal source and thevocal tract are typically less robust than those to estimate thespectral envelope. Nevertheless, this type of speech modelhas been successfully used in HMM-based synthesis. For ex-ample, the synthesiser in [5] models the glottal source andthe vocal tract filter using LPC parameters. During synthe-sis, the excitation is obtained by transforming a real glottalpulse using the glottal parameters generated by the synthe-siser. However, this approach does not allow control overglottal parameters related to voice quality and does not modelthe correlation between F0 and the glottal parameters.In previous work [6], we used an acoustic glottal sourcemodel, the Liljencrants-Fant (LF) model [7], in the synthesispart of an HMM-based speech synthesiser. In this system, aselected LF-model signal was passed through a post-filter toobtain a spectrally flat excitation and then speech was gener-ated by shaping the excitation with the spectral envelope.In this work, we propose another HMM-based speechsynthesiser, which generates speech by passing the LF-modelsignal through the vocal tract filter. The LF-model parametersare trained in the system, which allows the natural variationsof the glottal parameters with F0 to be modelled. In addition,the LF-model parameters can be used to control relevant prop-erties of the glottal pulse shape that are correlated with voicequality, such as breathiness. The vocal tract filter is estimatedusing the Glottal Spectral Separation (GSS) method [8]. Inthis paper, we also propose an extension to the GSS synthesismethod which consists of mixing the LF-model with a noisecomponent in order to improve speech naturalness further.2. LILJENCRANTS-FANT MODELThe Liljencrants-Fant (LF) model [7] is an acoustic modelof the glottal source derivative. It can be represented by thefollowing equation:eLF (t) = (1)E0eαt sin(wgt), to ≤ t ≤ te− EeTa [e−(t−te) − e−(tc−te)], te < t ≤ tc0, tc < t ≤ T0where wg = pi/tp. The LF-model is defined by six shapeparameters: tc, tp, te, Ta, T0, and Ee. The remaining param-eters (E0,  and α) can be calculated using the energy andcontinuity constraints, which are given by∫ T00eLF (t)dt = 0and eLF (te) = eLF (t+e ) = −Ee, respectively.The LF-model is often represented by the first twobranches of (1) for simplification. In this case, the instantof complete closure, tc, is set to the period T0 in the secondbranch. In this work, this simplified LF-model was used.3. BASELINE HMM-BASED SPEECH SYNTHESISERThe baseline statistical speech synthesiser used in this workemployed the MATLAB version of the STRAIGHT vocoder(STRAIGHTV40). This system is an implementation of theNitech-HTS 2005 speech synthesiser [1].3.1. AnalysisThe STRAIGHT analysis method was used to calculate theFFT parameters of the spectral envelope of the short-timespeech signal (40 ms long) and aperiodicity parameters (FFTcoefficients) measured in the speech spectrum. These param-eters were transformed to more suitable features for statisti-cal modelling. The spectral envelope was converted to mel-cepstral coefficients, whereas the aperiodicity measurementswere averaged over five frequency bands: 0-1, 1-2, 2-4, 4-6,and 6-8 kHz. Meanwhile, F0 was estimated using the RAPTalgorithm [9].3.2. Acoustic ModellingThe statistical model was a five-state left-to-right hidden-semiMarkov model (HSMM). Both state output density functionand state duration were modelled using a single Gaussian dis-tribution. Each observation feature vector consisted of fivestreams: mel-cepstrum, aperiodicity, logF0, ∆ of logF0 and∆2 of logF0. The spectrum and aperiodicity parameters weremodelled by continuous HMMs, while the last three streamswere modelled by multi-space probability distribution HMMs(MSD-HMMs) because F0 is not defined in unvoiced regions.The spectrum and aperiodicity streams included the static anddynamic features.The context-dependent models were also clustered usingdifferent decision trees for the spectrum, F0 and duration pa-rameters, since the influence of the contextual factors variesfor each of these.3.3. SynthesisThe STRAIGHT vocoder was used to synthesise speech byconvolving a spectrally flat excitation with the spectral en-velope of speech (obtained from the mel-cepstrum). Forvoiced speech, the aperiodicity parameters were used to de-rive Wp(w) and Wa(w), which are the weighting functionsfor the spectra of the impulse train (phase manipulated) andwhite noise respectively. The resulting weighted signals werethen added together to obtain the mixed excitation.4. HMM-BASED SPEECH SYNTHESISERUSING LF-MODEL: HTS-LFThe baseline HMM-based speech synthesiser was modified inorder to incorporate the LF-model. This system using glottalsource modelling is called HTS-LF. The main differences be-tween the two systems are the multi-stream structure of thespeech parameter vector and the analysis-synthesis methods.4.1. AnalysisIn the HTS-LF system, the aperiodicity parameters were com-puted using the STRAIGHT method, whereas the LF-modelparameters and the vocal tract spectrum were estimated as inthe Glottal Spectral Separation method [8].4.1.1. LF-model ParametersThe LF-model parameters were estimated from the LinearPrediction (LP) residual, as described in [8]. The residualwas computed using the inverse filtering technique with pre-emphasis (α = 0.97). Then, the LF-model parameters werecalculated for each pitch cycle of the residual, which was de-limited by contiguous glottal epochs. The estimation methodconsisted of fitting the LF-model waveform to the residualusing a non-linear optimisation algorithm. The initial esti-mates of the iterative method were obtained by performingamplitude-based measurements on the residual.The trajectories of the LF-parameters calculated for an ut-terance are shown in Figure 1 (a). A strong correlation be-tween the glottal parameters and T0 can be observed (directproportion), with the exception of the parameter Ta. Shortsegments can also be found which show a different pattern ofvariation with T0 that is not linear. These may be explainedby prosody effects such as accented words and syllable stress4.1.2. Vocal Tract SpectrumThe speech signal was segmented at 5 ms frame rate into 40ms long frames. In voiced speech regions, the set of LF-model parameters values associated with each frame sj(t)was obtained by finding the closest epoch i to the center ofsj(t). These parameters were used to generate one periodof the LF-model signal, eiLF (t). Next, the speech spectrumSj(w) was divided by the amplitude spectrum of the LF-model signal,∣∣EiLF (w)∣∣, in order to remove the glottal sourcemodel effects. That is, V j(w) = Sj(w)/∣∣EiLF (w)∣∣. Finally,the STRAIGHT vocoder was used to calculate the spectral en-velope of the signal V j(w). For unvoiced speech, the spectralparameters were estimated by computing the spectral enve-lope of Sj(w) using STRAIGHT.The vocal tract spectrum obtained using the GSS methodis expected to be sufficiently smooth, assuming that the LF-model parameter trajectories are smooth enough and thatSTRAIGHT computes a smooth spectrum. This is consideredto be an important characteristic to obtain accurate modellingof the spectrum in the HMM-based speech synthesiser.4.2. Acoustic ModellingThe statistical modelling part of the HTS-LF system is simi-lar to the baseline system. However, the F0 parameters vec-tor of the baseline was replaced by the LF-model parametervector in HTS-LF. The dimension of the LF-model, ∆ and∆2 streams was set to 5. These streams were modelled byMSD-HMMs using a Gaussian distribution with diagonal co-variance matrix for the voiced space. The clustering decisiontrees for the LF-model parameter streams were built using thesame question set and minimum description length criterionas used for clustering the F0 streams in the baseline system.We assumed the contextual factors most relevant to the LF-parameters were similar to those for the F0 factors becausethese parameters are strongly correlated.Figure 1 (b) shows that the parameter generation algo-rithm produces smoother trajectories than those obtained dur-ing speech analysis, mainly due to modelling by the HMMs.One advantage of this smoothing effect is attenuation of pa-rameter discontinuities due to estimation errors in analysis.4.3. SynthesisIn the GSS method proposed in previous work [8], voicedspeech was generated by passing two cycles of the LF-modelsignal through the vocal tract filter. In this work, the multi-band mixed excitation of STRAIGHT was adapted in order tomix the noise component of the excitation with the LF-modelsignal. The advantage is to better model the noise componentof the speech signal and improve speech naturalness.Figure 2 shows the flowchart of the synthesis method usedby the HTS-LF system. The LF-model signal has a decayingspectrum as it models the spectral tilt of the glottal source.In contrast, the noise spectrum is approximately flat. For thisreason, these two signals cannot be mixed using the aperiod-icity parameters as in STRAIGHT. In order to overcome thisproblem, the white noise signal is shaped with the spectral en-velope of the LF-model signal before the weighting operation.This shaping is performed in the frequency domain by multi-plying the amplitude spectrum of one period of the LF-modelsignal, |Ep(w)|, by the amplitude spectrum of the noise sig-nal, N(w). The resulting noise signal has the same durationas the periodic LF-model signal, E(w), and it is scaled in am-plitude by the factor Kn for the two signals to have the samepower. The spectrum of the excitation can be represented by:X(w) = E(w)Wp(w) +KnN(w)|Ep(w)|Wa(w) (2)0 0.5 1 1.5 2 2.5 3051015Time (s)Time (ms)T0tetpTa(a) Parameters estimated in the analysis.0 0.5 1 1.5 2 2.5 302468101214Time (s)Time (ms)T0tetpTa(b) Parameters generated by the HTS-LF system.Fig. 1. Example of trajectories of the LF-model parameters.FFTLF−modelFFTWeightingW (w)pAperiodicityparametersPSOLAIFFTFilter, V(w)Vocal TractSpeechSpectral parametersScalingAperiodicityparametersWeightingaW (w)e (t)pNoise GeneratorFFTGeneratorLF−waveformx(t)N(w).e(t)E(w)parametersFig. 2. Block diagram of the speech waveform generationtechnique used by the HTS-LF system.Speech is generated by passing the mixed excitation throughthe vocal tract filter. Finally, the speech frames are con-catenated using overlap-and-add with asymmetric windowscentered at the instants of maximum excitation. G(w) con-tains phase information from the LF-model signal which isexpected to reduce the buzziness effect of the impulse train.5. PERCEPTUAL EVALUATIONA forced-choice A-B test was conducted to evaluate thespeech quality of the HTS-LF system when compared to thestandard HMM-based synthesiser.5.1. StimuliThe US English BDL voice (male) was built from the CMUARCTIC speech database [10] for the two systems. The sizeof the BDL speech corpus is approximately one hour.The stimuli consisted of 36 pairs of utterances: 18 utter-ances synthesised with the two systems, randomly chosen andrepeated twice with the order of the samples switched.5.2. ExperimentThe evaluation was conducted via the web. Subjects wereasked to listen to the pairs of stimuli and for each pair they hadto select the version (A or B) that sounded best. They wereable to listen to the files in any order, and as many times asthey liked. We also instructed them to make a random choiceif they could not decide on the version they preferred.Students and staff from the University of Edinburgh wereasked to perform the evaluation. Fourteen listeners partici-pated in the test, of which six were native speakers of English.6. RESULTSThe results of the perceptual experiment are shown in Table 1.They are statistically significant with p <= 0.01. On average,the HTS-LF system obtained a higher rate of preference. Nev-ertheless, The results were expected to be even better, as theimprovement in the quality of resynthesised natural speech(without modelling) when using the LF-model compared tothe impulse train was significantly high in a previous evalua-tion [8].From our subjective analysis of the synthetic speech, the“metallic” quality produced by the standard HTS was clearlyreduced using the HTS-LF system for some utterances. How-ever, some samples synthesised with the LF-model containedsome distortion which might be more perceptually significantthan the buzziness characteristic of the standard system. Inour opinion, errors in the extraction of the glottal parametersby the HTS-LF system are a possible cause of degradationin speech quality. Also, rapid spectral variations due to themismatch between the spectral envelope and the vocal tractspectrum at voicing transitions may not have been modelledby the HMMs correctly.Finally, we also note we have found that prosodic char-acteristics, e.g. accent position in words, are often bettermodelled using the HTS-LF system. Examples of the synthe-sised speech are available at http://homepages.inf.ed.ac.uk/jscabral/hts-lf-model.html.Baseline HTS-LFMean preference (%) 44.4 55.695% Conf. Interv. (%) [40.1 48.9] [51.1 59.9]Table 1. Mean scores and 95% confidence intervals obtainedby the two HTS synthesisers in the A-B forced-choice test.7. CONCLUSIONSIn this work, the LF-model was incorporated into a standardHMM-based speech synthesiser by using the GSS method foranalysis-synthesis and adapting the acoustic modelling part totrain the glottal parameters.The proposed HTS-LF system obtained higher prefer-ence than an HMM-based speech synthesiser which usesthe STRAIGHT vocoder. A great advantage of the HTS-LFsystem is that it provides control over glottal parameters forvoice quality transformations.There is a good scope for further development of the HTS-LF system. We have been improving the method to estimatethe LF-model parameters and studying in detail the causes ofspeech distortion in this synthesiser.8. REFERENCES[1] H. Zen, T. Toda, M. Nakamura, and K. Tokuda, “Details of theNitech HMM-based speech synthesis system for the BlizzardChallenge 2005,” IEICE Trans. Inform. and Systems, vol. E90-D, pp. 325–333, January 2007.[2] T. Yoshimura, K. Tokuda, T. Masukom, T.and Kobayashi, andT. Kitamura, “Mixed excitation for HMM-based speech syn-thesis,” in Proc. of EUROSPEECH, Aalborg, September 2001.[3] R. Maia, T. Toda, H. Zen, Y. Nankaku, and K. Tokuda, “Atrainable excitation model for HMM-based speech synthesis,”in Proc. of INTERSPEECH, Antwerp, August 2007.[4] T. Drugman, G. Wilfart, and T. Dutoit, “A deterministic plusstochastic model of the residual signal for improved paramet-ric speech synthesis,” in Proc. of INTERSPEECH, Brighton,September 2009.[5] T. Raitio, A. Suni, H. Pulakka, M. Vainio, and P. Alku, “HMM-based Finnish text-to-speech system utilizing glottal inverse fil-tering,” in Proc. of INTERSPEECH, Brisbane, 2008.[6] J. Cabral, S. Renals, K. Richmond, and J. Yamagishi, “AnHMM-based speech synthesiser using Glottal-Post Filtering,”in Proc. of the 7th SSW, Japan, September 2010.[7] G. Fant, J. Liljencrants, and Q. Lin, “A four-parameter modelof glottal flow,” STL-QPSR, KTH, Stockholm, 1985.[8] J. Cabral, S. Renals, K. Richmond, and J. Yamagishi, “Glottalspectral separation for parametric speech synthesis,” in Proc.of the INTERSPEECH, Brisbane, 2008.[9] D. Talkin, “A robust algorithm for pitch tracking (RAPT),” inSpeech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal,Eds. 1995, pp. 495–518, Elsevier Science.[10] J. Kominek and A. Black, “The CMU Arctic speechdatabases,” in Proc. of 5th SSW, Pittsburgh, June 2004.

A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis,” in

A four-parameter model of glottal ﬂow,” STL-QPSR, KTH,

A robust algorithm for pitch tracking (RAPT),” in Speech Coding

A trainable excitation model for HMM-based speech synthesis,” in

An HMM-based speech synthesiser using Glottal-Post Filtering,”

Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge

Glottal spectral separation for parametric speech synthesis,”

HMMbasedFinnishtext-to-speechsystemutilizingglottalinverseﬁltering,” in

Mixed excitation for HMM-based speech synthesis,” in

The CMU Arctic speech databases,” in

     Edinburgh Research Explorer                                      HMM-based speech synthesiser using the LF-model of the glottalsourceCitation for published version:Cabral, J, Renals, S, Yamagishi, J & Richmond, K 2011, HMM-based speech synthesiser using the LF-model of the glottal source. in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE InternationalConference on. pp. 4704-4707, ICASSP 2011 - 2011 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), United Kingdom, 22/05/11. DOI: 10.1109/ICASSP.2011.5947405Digital Object Identifier (DOI):10.1109/ICASSP.2011.5947405Link:Link to publication record in Edinburgh Research ExplorerDocument Version:Peer reviewed versionPublished In:Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference onGeneral rightsCopyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)and / or other copyright owners and it is a condition of accessing these publications that users recognise andabide by the legal requirements associated with these rights.Take down policyThe University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorercontent complies with UK legislation. If you believe that the public display of this file breaches copyright pleasecontact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately andinvestigate your claim.Download date: 05. Apr. 2019HMM-BASED SPEECH SYNTHESISER USING THE LF-MODELOF THE GLOTTAL SOURCEJoa˜o P. Cabral1,2 , Steve Renals2, Junichi Yamagishi2 and Korin Richmond21School of Computer Science and Informatics, University College Dublin, Ireland2The Centre for Speech Technology Research, University of Edinburgh,UKjoao.cabral@ucd.ie, s.renals@ed.ac.uk, jyamagis@inf.ed.ac.uk, korin@cstr.ed.ac.ukABSTRACTA major factor which causes a deterioration in speechquality in HMM-based speech synthesis is the use of a simpledelta pulse signal to generate the excitation of voiced speech.This paper sets out a new approach to using an acoustic glottalsource model in HMM-based synthesisers instead of the tradi-tional pulse signal. The goal is to improve speech quality andto better model and transform voice characteristics. We havefound the new method decreases buzziness and also improvesprosodic modelling. A perceptual evaluation has supportedthis finding by showing a 55.6% preference for the new sys-tem, as against the baseline. This improvement, while notbeing as significant as we had initially expected, does encour-age us to work on developing the proposed speech synthesiserfurther.Index Terms— HMM-based Speech Synthesis, LF-Model, Glottal Source Modelling1. INTRODUCTIONHMM-based speech synthesisers typically generate speechby shaping a spectrally flat excitation with the spectral enve-lope of speech, e.g. [1]. A simple excitation model consistsof using white noise for unvoiced speech and an impulsetrain for voiced speech. However, this model makes the syn-thetic speech sound buzzy and just allows to control the pitch(through the F0 parameter). A popular method to reduce thebuzziness is to mix the impulse train with a noise componentusing a multi-band mixed excitation model, e.g. [2]. Re-cently, other excitation models have been used in statisticalspeech synthesis which try to better approximate the voicedexcitation to the residual calculated using the inverse filteringtechnique, e.g. [3, 4]. These models can represent more de-tails of the source than the noise. However, they do not modelrelevant characteristics of the glottal source.Speech can also be generated by passing a glottal sourcemodel through a filter representing the vocal tract system.First author is currently supported by the Science Foundation Ireland(Grant 07/CE/I1142). This paper is based on his PhD work supported byMarie Curie Early Stage Training Site EdSST (MEST-CT-2005-020568)However, the methods to estimate the glottal source and thevocal tract are typically less robust than those to estimate thespectral envelope. Nevertheless, this type of speech modelhas been successfully used in HMM-based synthesis. For ex-ample, the synthesiser in [5] models the glottal source andthe vocal tract filter using LPC parameters. During synthe-sis, the excitation is obtained by transforming a real glottalpulse using the glottal parameters generated by the synthe-siser. However, this approach does not allow control overglottal parameters related to voice quality and does not modelthe correlation between F0 and the glottal parameters.In previous work [6], we used an acoustic glottal sourcemodel, the Liljencrants-Fant (LF) model [7], in the synthesispart of an HMM-based speech synthesiser. In this system, aselected LF-model signal was passed through a post-filter toobtain a spectrally flat excitation and then speech was gener-ated by shaping the excitation with the spectral envelope.In this work, we propose another HMM-based speechsynthesiser, which generates speech by passing the LF-modelsignal through the vocal tract filter. The LF-model parametersare trained in the system, which allows the natural variationsof the glottal parameters with F0 to be modelled. In addition,the LF-model parameters can be used to control relevant prop-erties of the glottal pulse shape that are correlated with voicequality, such as breathiness. The vocal tract filter is estimatedusing the Glottal Spectral Separation (GSS) method [8]. Inthis paper, we also propose an extension to the GSS synthesismethod which consists of mixing the LF-model with a noisecomponent in order to improve speech naturalness further.2. LILJENCRANTS-FANT MODELThe Liljencrants-Fant (LF) model [7] is an acoustic modelof the glottal source derivative. It can be represented by thefollowing equation:eLF (t) = (1)E0eαt sin(wgt), to ≤ t ≤ te− EeTa [e−(t−te) − e−(tc−te)], te < t ≤ tc0, tc < t ≤ T0where wg = pi/tp. The LF-model is defined by six shapeparameters: tc, tp, te, Ta, T0, and Ee. The remaining param-eters (E0,  and α) can be calculated using the energy andcontinuity constraints, which are given by∫ T00eLF (t)dt = 0and eLF (te) = eLF (t+e ) = −Ee, respectively.The LF-model is often represented by the first twobranches of (1) for simplification. In this case, the instantof complete closure, tc, is set to the period T0 in the secondbranch. In this work, this simplified LF-model was used.3. BASELINE HMM-BASED SPEECH SYNTHESISERThe baseline statistical speech synthesiser used in this workemployed the MATLAB version of the STRAIGHT vocoder(STRAIGHTV40). This system is an implementation of theNitech-HTS 2005 speech synthesiser [1].3.1. AnalysisThe STRAIGHT analysis method was used to calculate theFFT parameters of the spectral envelope of the short-timespeech signal (40 ms long) and aperiodicity parameters (FFTcoefficients) measured in the speech spectrum. These param-eters were transformed to more suitable features for statisti-cal modelling. The spectral envelope was converted to mel-cepstral coefficients, whereas the aperiodicity measurementswere averaged over five frequency bands: 0-1, 1-2, 2-4, 4-6,and 6-8 kHz. Meanwhile, F0 was estimated using the RAPTalgorithm [9].3.2. Acoustic ModellingThe statistical model was a five-state left-to-right hidden-semiMarkov model (HSMM). Both state output density functionand state duration were modelled using a single Gaussian dis-tribution. Each observation feature vector consisted of fivestreams: mel-cepstrum, aperiodicity, logF0, ∆ of logF0 and∆2 of logF0. The spectrum and aperiodicity parameters weremodelled by continuous HMMs, while the last three streamswere modelled by multi-space probability distribution HMMs(MSD-HMMs) because F0 is not defined in unvoiced regions.The spectrum and aperiodicity streams included the static anddynamic features.The context-dependent models were also clustered usingdifferent decision trees for the spectrum, F0 and duration pa-rameters, since the influence of the contextual factors variesfor each of these.3.3. SynthesisThe STRAIGHT vocoder was used to synthesise speech byconvolving a spectrally flat excitation with the spectral en-velope of speech (obtained from the mel-cepstrum). Forvoiced speech, the aperiodicity parameters were used to de-rive Wp(w) and Wa(w), which are the weighting functionsfor the spectra of the impulse train (phase manipulated) andwhite noise respectively. The resulting weighted signals werethen added together to obtain the mixed excitation.4. HMM-BASED SPEECH SYNTHESISERUSING LF-MODEL: HTS-LFThe baseline HMM-based speech synthesiser was modified inorder to incorporate the LF-model. This system using glottalsource modelling is called HTS-LF. The main differences be-tween the two systems are the multi-stream structure of thespeech parameter vector and the analysis-synthesis methods.4.1. AnalysisIn the HTS-LF system, the aperiodicity parameters were com-puted using the STRAIGHT method, whereas the LF-modelparameters and the vocal tract spectrum were estimated as inthe Glottal Spectral Separation method [8].4.1.1. LF-model ParametersThe LF-model parameters were estimated from the LinearPrediction (LP) residual, as described in [8]. The residualwas computed using the inverse filtering technique with pre-emphasis (α = 0.97). Then, the LF-model parameters werecalculated for each pitch cycle of the residual, which was de-limited by contiguous glottal epochs. The estimation methodconsisted of fitting the LF-model waveform to the residualusing a non-linear optimisation algorithm. The initial esti-mates of the iterative method were obtained by performingamplitude-based measurements on the residual.The trajectories of the LF-parameters calculated for an ut-terance are shown in Figure 1 (a). A strong correlation be-tween the glottal parameters and T0 can be observed (directproportion), with the exception of the parameter Ta. Shortsegments can also be found which show a different pattern ofvariation with T0 that is not linear. These may be explainedby prosody effects such as accented words and syllable stress4.1.2. Vocal Tract SpectrumThe speech signal was segmented at 5 ms frame rate into 40ms long frames. In voiced speech regions, the set of LF-model parameters values associated with each frame sj(t)was obtained by finding the closest epoch i to the center ofsj(t). These parameters were used to generate one periodof the LF-model signal, eiLF (t). Next, the speech spectrumSj(w) was divided by the amplitude spectrum of the LF-model signal,∣∣EiLF (w)∣∣, in order to remove the glottal sourcemodel effects. That is, V j(w) = Sj(w)/∣∣EiLF (w)∣∣. Finally,the STRAIGHT vocoder was used to calculate the spectral en-velope of the signal V j(w). For unvoiced speech, the spectralparameters were estimated by computing the spectral enve-lope of Sj(w) using STRAIGHT.The vocal tract spectrum obtained using the GSS methodis expected to be sufficiently smooth, assuming that the LF-model parameter trajectories are smooth enough and thatSTRAIGHT computes a smooth spectrum. This is consideredto be an important characteristic to obtain accurate modellingof the spectrum in the HMM-based speech synthesiser.4.2. Acoustic ModellingThe statistical modelling part of the HTS-LF system is simi-lar to the baseline system. However, the F0 parameters vec-tor of the baseline was replaced by the LF-model parametervector in HTS-LF. The dimension of the LF-model, ∆ and∆2 streams was set to 5. These streams were modelled byMSD-HMMs using a Gaussian distribution with diagonal co-variance matrix for the voiced space. The clustering decisiontrees for the LF-model parameter streams were built using thesame question set and minimum description length criterionas used for clustering the F0 streams in the baseline system.We assumed the contextual factors most relevant to the LF-parameters were similar to those for the F0 factors becausethese parameters are strongly correlated.Figure 1 (b) shows that the parameter generation algo-rithm produces smoother trajectories than those obtained dur-ing speech analysis, mainly due to modelling by the HMMs.One advantage of this smoothing effect is attenuation of pa-rameter discontinuities due to estimation errors in analysis.4.3. SynthesisIn the GSS method proposed in previous work [8], voicedspeech was generated by passing two cycles of the LF-modelsignal through the vocal tract filter. In this work, the multi-band mixed excitation of STRAIGHT was adapted in order tomix the noise component of the excitation with the LF-modelsignal. The advantage is to better model the noise componentof the speech signal and improve speech naturalness.Figure 2 shows the flowchart of the synthesis method usedby the HTS-LF system. The LF-model signal has a decayingspectrum as it models the spectral tilt of the glottal source.In contrast, the noise spectrum is approximately flat. For thisreason, these two signals cannot be mixed using the aperiod-icity parameters as in STRAIGHT. In order to overcome thisproblem, the white noise signal is shaped with the spectral en-velope of the LF-model signal before the weighting operation.This shaping is performed in the frequency domain by multi-plying the amplitude spectrum of one period of the LF-modelsignal, |Ep(w)|, by the amplitude spectrum of the noise sig-nal, N(w). The resulting noise signal has the same durationas the periodic LF-model signal, E(w), and it is scaled in am-plitude by the factor Kn for the two signals to have the samepower. The spectrum of the excitation can be represented by:X(w) = E(w)Wp(w) +KnN(w)|Ep(w)|Wa(w) (2)0 0.5 1 1.5 2 2.5 3051015Time (s)Time (ms)T0tetpTa(a) Parameters estimated in the analysis.0 0.5 1 1.5 2 2.5 302468101214Time (s)Time (ms)T0tetpTa(b) Parameters generated by the HTS-LF system.Fig. 1. Example of trajectories of the LF-model parameters.FFTLF−modelFFTWeightingW (w)pAperiodicityparametersPSOLAIFFTFilter, V(w)Vocal TractSpeechSpectral parametersScalingAperiodicityparametersWeightingaW (w)e (t)pNoise GeneratorFFTGeneratorLF−waveformx(t)N(w).e(t)E(w)parametersFig. 2. Block diagram of the speech waveform generationtechnique used by the HTS-LF system.Speech is generated by passing the mixed excitation throughthe vocal tract filter. Finally, the speech frames are con-catenated using overlap-and-add with asymmetric windowscentered at the instants of maximum excitation. G(w) con-tains phase information from the LF-model signal which isexpected to reduce the buzziness effect of the impulse train.5. PERCEPTUAL EVALUATIONA forced-choice A-B test was conducted to evaluate thespeech quality of the HTS-LF system when compared to thestandard HMM-based synthesiser.5.1. StimuliThe US English BDL voice (male) was built from the CMUARCTIC speech database [10] for the two systems. The sizeof the BDL speech corpus is approximately one hour.The stimuli consisted of 36 pairs of utterances: 18 utter-ances synthesised with the two systems, randomly chosen andrepeated twice with the order of the samples switched.5.2. ExperimentThe evaluation was conducted via the web. Subjects wereasked to listen to the pairs of stimuli and for each pair they hadto select the version (A or B) that sounded best. They wereable to listen to the files in any order, and as many times asthey liked. We also instructed them to make a random choiceif they could not decide on the version they preferred.Students and staff from the University of Edinburgh wereasked to perform the evaluation. Fourteen listeners partici-pated in the test, of which six were native speakers of English.6. RESULTSThe results of the perceptual experiment are shown in Table 1.They are statistically significant with p <= 0.01. On average,the HTS-LF system obtained a higher rate of preference. Nev-ertheless, The results were expected to be even better, as theimprovement in the quality of resynthesised natural speech(without modelling) when using the LF-model compared tothe impulse train was significantly high in a previous evalua-tion [8].From our subjective analysis of the synthetic speech, the“metallic” quality produced by the standard HTS was clearlyreduced using the HTS-LF system for some utterances. How-ever, some samples synthesised with the LF-model containedsome distortion which might be more perceptually significantthan the buzziness characteristic of the standard system. Inour opinion, errors in the extraction of the glottal parametersby the HTS-LF system are a possible cause of degradationin speech quality. Also, rapid spectral variations due to themismatch between the spectral envelope and the vocal tractspectrum at voicing transitions may not have been modelledby the HMMs correctly.Finally, we also note we have found that prosodic char-acteristics, e.g. accent position in words, are often bettermodelled using the HTS-LF system. Examples of the synthe-sised speech are available at http://homepages.inf.ed.ac.uk/jscabral/hts-lf-model.html.Baseline HTS-LFMean preference (%) 44.4 55.695% Conf. Interv. (%) [40.1 48.9] [51.1 59.9]Table 1. Mean scores and 95% confidence intervals obtainedby the two HTS synthesisers in the A-B forced-choice test.7. CONCLUSIONSIn this work, the LF-model was incorporated into a standardHMM-based speech synthesiser by using the GSS method foranalysis-synthesis and adapting the acoustic modelling part totrain the glottal parameters.The proposed HTS-LF system obtained higher prefer-ence than an HMM-based speech synthesiser which usesthe STRAIGHT vocoder. A great advantage of the HTS-LFsystem is that it provides control over glottal parameters forvoice quality transformations.There is a good scope for further development of the HTS-LF system. We have been improving the method to estimatethe LF-model parameters and studying in detail the causes ofspeech distortion in this synthesiser.8. REFERENCES[1] H. Zen, T. Toda, M. Nakamura, and K. Tokuda, “Details of theNitech HMM-based speech synthesis system for the BlizzardChallenge 2005,” IEICE Trans. Inform. and Systems, vol. E90-D, pp. 325–333, January 2007.[2] T. Yoshimura, K. Tokuda, T. Masukom, T.and Kobayashi, andT. Kitamura, “Mixed excitation for HMM-based speech syn-thesis,” in Proc. of EUROSPEECH, Aalborg, September 2001.[3] R. Maia, T. Toda, H. Zen, Y. Nankaku, and K. Tokuda, “Atrainable excitation model for HMM-based speech synthesis,”in Proc. of INTERSPEECH, Antwerp, August 2007.[4] T. Drugman, G. Wilfart, and T. Dutoit, “A deterministic plusstochastic model of the residual signal for improved paramet-ric speech synthesis,” in Proc. of INTERSPEECH, Brighton,September 2009.[5] T. Raitio, A. Suni, H. Pulakka, M. Vainio, and P. Alku, “HMM-based Finnish text-to-speech system utilizing glottal inverse fil-tering,” in Proc. of INTERSPEECH, Brisbane, 2008.[6] J. Cabral, S. Renals, K. Richmond, and J. Yamagishi, “AnHMM-based speech synthesiser using Glottal-Post Filtering,”in Proc. of the 7th SSW, Japan, September 2010.[7] G. Fant, J. Liljencrants, and Q. Lin, “A four-parameter modelof glottal flow,” STL-QPSR, KTH, Stockholm, 1985.[8] J. Cabral, S. Renals, K. Richmond, and J. Yamagishi, “Glottalspectral separation for parametric speech synthesis,” in Proc.of the INTERSPEECH, Brisbane, 2008.[9] D. Talkin, “A robust algorithm for pitch tracking (RAPT),” inSpeech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal,Eds. 1995, pp. 495–518, Elsevier Science.[10] J. Kominek and A. Black, “The CMU Arctic speechdatabases,” in Proc. of 5th SSW, Pittsburgh, June 2004.

http://www.cstr.inf.ed.ac.uk/downloads/publications/2011/05947405.pdf

HMM-based speech synthesiser using the LF-model of the glottal source

Abstract

Similar works

Full text

Available Versions

Crossref

Edinburgh Research Explorer

Edinburgh Research Explorer