AbstractThis paper presents a technique for learning hidden Markov model (HMM) state sequences from phonemes, that combined with modified discrete cosine transform (MDCT), is useful for speech synthesis. Mel-cepstral spectral parameters, currently adopted in the conventional methods as features for HMM acoustic modeling, do not ensure direct speech waveforms reconstruction. In contrast to these approaches, we use an analysis/synthesis technique based on MDCT that guarantees a perfect reconstruction of the signal frame feature vectors and allows for a 50% overlap between frames without increasing the data rate. Experimental results show that the spectrograms achieved with the suggested technique behave very closely to the original spectrograms, and the quality of synthesized speech is conveniently evaluated using the well known Itakura-Saito measure

Claudio Turchetti

Giorgio Biagetti

Laura Falaschetti

Paolo Crippa

Simone Orcioni

English

Biagetti, Giorgio

Crippa, Paolo

Falaschetti, Laura

Orcioni, Simone

Turchetti, Claudio

Elsevier - Publisher Connector 

Learning HMM State Sequences from Phonemes for Speech Synthesis 

Open Access Repository

 Procedia Computer Science  96 ( 2016 )  1589 – 1596 Available online at www.sciencedirect.com1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).Peer-review under responsibility of KES Internationaldoi: 10.1016/j.procs.2016.08.206 ScienceDirect20th International Conference on Knowledge Based and Intelligent Information and EngineeringSystems, KES2016, 5-7 September 2016, York, United KingdomLearning HMM state sequences from phonemes for speech synthesisGiorgio Biagettia, Paolo Crippaa,∗, Laura Falaschettia, Simone Orcionia, ClaudioTurchettiaaDII – Department of Information Engineering,Università Politecnica delle Marche, via Brecce Bianche, 12, I-60131 Ancona, ItalyAbstractThis paper presents a technique for learning hidden Markov model (HMM) state sequences from phonemes, that combined withmodified discrete cosine transform (MDCT), is useful for speech synthesis. Mel-cepstral spectral parameters, currently adoptedin the conventional methods as features for HMM acoustic modeling, do not ensure direct speech waveforms reconstruction. Incontrast to these approaches, we use an analysis/synthesis technique based on MDCT that guarantees a perfect reconstruction of thesignal frame feature vectors and allows for a 50% overlap between frames without increasing the data rate. Experimental resultsshow that the spectrograms achieved with the suggested technique behave very closely to the original spectrograms, and the qualityof synthesized speech is conveniently evaluated using the well known Itakura-Saito measure.c© 2016 The Authors. Published by Elsevier B.V.Peer-review under responsibility of KES International.Keywords: Learning, HMM, Speech synthesis, EM estimation, MDCT, MFCC ;1. IntroductionHidden Markov model (HMM) statistical parametric speech synthesis has proven to be a particularly flexible androbust framework to generate synthetic speech with various speaking styles and emotional expression1,2. Thanks tothe ability in representing not only the phoneme sequences but also various contexts of the linguistic specification,HMM-based speech synthesis has recently been a major topic in speech research systems3,4,5,6,7.In conventional techniques based on the source-filter model assumption, phonetic and prosodic information areassumed to be conveyed primarily by the spectral envelope, fundamental frequency (F0), and the duration of individualphones8. However although these efforts have produced good performances, there are still limitations in this approach.In particular the modeling of F0 is difficult due to the discontinuity nature of F0 caused by the voice and unvoicedspeech regions9. Moreover the spectral envelope defines a non-invertible transform so that the speech signal cannotbe perfectly reconstructed from the feature sequence10,11.In this paper a novel HMM statistical parametric speech synthesis approach, based on learning HMM state se-quences from phonemes and the modified discrete cosine transform (MDCT), which guarantees the perfect recon-∗ Corresponding author. Tel.: +39-071-220-4541 ; fax: +39-071-220-4464.E-mail address: p.crippa@univpm.it 2016 The Authors. Published by Elsevi r B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons. rg/licenses/by-nc-nd/4.0/).Peer-review under responsibility of KES International1590   Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 struction of speech signal given the feature sequence and overcomes the main lacks of Mel-cepstral analysis/synthesistechnique, is proposed.2. Speech vector sequence generation2.1. MDCT feature vectorLet us represent the sampled signal S as a sequence of T + 1 blocks of D samples:S = [sT1 , sT2 , . . . , sTT+1]T ∈ R(T + 1)D×1 , (1)wherest ∈ RD×1 (2)is the single block of length D.In signal sampling with overlap, a sequence of framesX = [xT1 , xT2 , . . . , xTT ]T ∈ RT (2D)×1 , (3)is obtained, wherext =(xLtxRt)=(stst+1)∈ R2D×1, t = 1, . . . ,T (4)is the single frame corresponding to a window of length 2D.The sequences S , X, and the overlap regions are depicted in Fig. 1. As you can see the blocks xt and xt+1 overlapfor a length D, and the following condition holds:xRt = xLt+1. (5)MDCT  MDCT  MDCT  MDCT  MDCT  1s 2s 3s MDCTCT  1Lx 1Rx1x2xSX… MDCT  MDCT  2Lx 2RxCT  Fig. 1. The sequences S , X, and the overlap regions between different blocks.The usually adopted model for speech parametrization is the source-filter model which leads to the extraction ofparameters (features) such as linear predictive coding (LPC), Mel-frequency cepstral coefficients (MFCCs), perceptuallinear prediction (PLP) coefficients, etc. Among these, MFCCs are demonstrated the most successful due to theirparticular robustness to the environment and flexibility12. MFCC feature extraction corresponds to a transform Fsuch thatôt = Fxt (6)where the vector ôt represents the so-called feature vector belonging to an appropriate subspace.The main problem in speech synthesis is that, given the vector ôt from transcription, the frame signal xt cannotbe derived univocally from (6) because the transform F is not invertible. In order to face this problem we use ananalysis/synthesis technique based on the MDCT that ensures a perfect reconstruction of the signal from featurevectors and allows for a 50% overlap between blocks without increasing the data rate.1591 Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 Denoting with A = (A1A2) ∈ RD×2D the matrix that represents the MDCT13, and with ot the MDCT feature vector,it resultsot = Axt = A(stst+1)= (A1A2)(stst+1)= A1st + A2st+1 (7)where A1 , A2 ∈ RD×D. In matrix form we haveO = WS (8)withW =⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝A1 A2 · · · · · · 00 A1 A2 · · · 0....... . .. . ....0 · · · · · · A1 A2⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠∈ RT D×(T+1)D (9)andO = [oT1 , oT2 , . . . , oTT ]T ∈ RT D×1 (10)is the MDCT feature vector corresponding to the signal S .2.2. Learning HMM state sequences and maximum likelihood estimationThe speech synthesis algorithm we propose determines the sequence X of the synthetic signal, given the sequenceO of features corresponding to the transcription (or sequence of phonemes) H to be synthesized.In an HMM modeling we need to derive first the state sequence that generates the sequence O. To this end letP(O,Q/λ) = πθ0T∏t=1aθt−1θt bθt (ot) (11)be the joint pdf of O and Q, given the model λ, whereQ = {θ1, θ2, . . . , θT } = {(q1, i1), (q2, i2), . . . , (qT , iT )}, (12)being θt = (qt, it) the substate associated to the Gaussian mixture it of the state qt at the time instant t, that isbθt (ot) = (2π)−D/2 ∣∣∣Uθt ∣∣∣−1/2 · exp{−12(ot − μθt )TU−1θt (ot − μθt )}(13)with μθt ∈ RD×1, Uθt ∈ RD×D. πθ0 is the initial-state probability, and aθt−1θt is the state-transition probability.Since H = {h1, h2, . . .} is a sequence of phonemes, we restrict the mathematical formulation to a single phoneme halone. Given the phoneme h the sequences O and Q are chosen in such a way the joint pdfP(O,Q/λ) = P(O/Q, λ)P(Q/λ), (14)which represents the likelihood of the set χ = {O,Q}, is maximum. The sequence Q is obtained during learning phaseas the one that satisfies max P(Q/λ). At the end of training to a given h corresponds a set {Q1,Q2, . . .} of substatesequences, thus we choose Q as the one that satisfiesQ = Qbest = arg maxiP(Qi/λ). (15)Having derived Q, the sequence O is given by the maximum of the likelihood log P(O/Q, λ) which can be written asL(O) = log P(O/Q, λ) =T∑t=1log bθt (ot). (16)After some manipulations we haveL(O) = −12OTU−1O + OTU−1M + k (17)1592   Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 MDCT MDCT FeatureExtractionSpeech DatabaseMDCT LearningHMMsMDCT Text AnalysislabelsspeechsignalMDCT Sequences ofSubstatesMDCT Best Sequence QMDCTPhonemehOverlap-and-addMDCT Best Sequence OO*QbestX{Q1, Q2 , … }synthesizedspeechSYNTHESIS STAGELEARNING STAGEFig. 2. Block diagram of the MDCT-based speech synthesis system.whereU−1 = diag[U−1q1,it , U−1q2,it , . . . , U−1qT ,it]∈ RT D×T D, M =[μTq1,it , μTq2,it , . . . , μTqT ,it]T ∈ RT D×1 (18)andk = k′ + k′′, (19)beingk′ =T∑t=1log (2π)−D/2∣∣∣Uθt ∣∣∣−1/2, k′′ = μTqt ,it U−1qt ,itμqt ,it . (20)The sequence O can be derived as the one that maximizes (17).Having achieved the optimum sequence Qbest of substates for a given phoneme h, to such a sequence correspondsa set {O1,O2, . . . } of feature sequences and a set of likelihood values {L(O1),L(O2), . . . }. In order to maximise thejoint pdf (14), the sequence O∗ = {O1,O2, . . . } such that L(O∗) = max{L(O1),L(O2), . . . } is chosen.Finally, once the optimum sequence of feature vectors O∗ is obtained, the sequence X of synthesized signal framesis derived by the overlap-and-add synthesis process.An overview of the speech synthesis algorithm is shown in Fig. 2. The block diagram shows the two fundamentalsteps of the proposed approach: the learning stage, that is the off-line stage, and the synthesis stage, that is the on-linestage. The first step extracts from the input database (audio and text sources) the MDCT features and derives throwan HMM modeling the substates for all input sequences of phonemes; the second step, given an input text and onthe basis of the classified sequences of substates, determines the best sequence of states Qbest and the correspondingbest sequence of features O∗ for every input phoneme. At the end, the overlap-and-add synthesis process returns thesynthesized speech of the input text.1593 Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 | a |(a)(b)(c)| e |(a)(b)(c)| i |(a)(b)(c)| o |(a)(b)(c)| u |(a)(b)(c)Fig. 3. Spectrograms of the Italian vowels |a|, |e|, |i|, |o|, |u| for the: (a) original signal, (b) signal synthesized by our technique, and (c) signalsynthesized by diphones technique.1594   Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 3. Experimental results3.1. Acoustic model trainingThe first stage in the experiments we carried out to validate the proposed synthesis approach, was training theHMM acoustic model.The material adopted for training was based on a 22 hours audio recording of a female speaker extracted from anItalian audiobook. The feature vector has been derived by applying the MDCT to the 2D = 20 ms signal frame xt, 50%overlapped with the successive frame. With a sampling rate of 8 kHz, a frame length of 80 samples (corresponding tothe overlap length) is obtained.The training was conducted with the Baum-Welch algorithm that performs an EM estimation of the audio modelingparameters.To determine the most probable state sequences, we used the same training material and the Baum-Welch algorithmfor the audio/text alignment at the HMM states level. In such a way, once the most probable state sequence for a giventranscription is derived, the matrices in (18) can be computed.3.2. Vowel synthesisTo validate the proposed speech synthesis technique the above scheme was used to synthesize the five Italianvowels, once the best substate sequences are given.For comparison the same phonemes were synthesized using the “eSpeak” software14 and the MBROLA (it-4)female recording audio extracted from ITC-irst data base15. MBROLA is an diphone-based algorithm16 for speechsynthesis. The MBROLA project web page provides diphone databases for a large number of spoken languages.“eSpeak” is a compact open source software speech synthesizer that can be used as a front-end to MBROLA diphonevoices.Figure 3 reports the spectrograms of the five Italian vowels |a|, |e|, |i|, |o|, |u|, as achieved by a 20 ms, 50% overlappedwindow. The first spectrogram in each figure depicts the behavior of the original audio signal, while the second andthird spectrograms are related to the signal synthesized with our approach and the diphone (i.e. the second half of onephone plus the first half of the following) technique , respectively. As you can see, the spectrograms achieved withthe suggested technique behave very closely to the original spectrograms. Diphones instead give spectrograms thatare quite different from those expected.3.3. Word synthesisTo further validate the proposed technique several Italian words have been synthesized.Figure 4 reports the spectrograms of the three Italian word topo (| t o p o |), casa (| k a z a |), Alice (| a l i Ùe |),as achieved by a 20 ms, 50% overlapped window. The first spectrogram in each figure depicts the behavior of theoriginal audio signal, while the second spectrogram is related to the signal synthesized with this approach. As youcan see, the spectrograms achieved with the suggested technique behave very closely to the original spectrograms.In addition Table 1 shows for the same three words topo (| t o p o |), casa (| k a z a |), Alice (| a l i Ùe |), and for theadditional two voce (| v o Ùe |) and troppo (| t r o p p o |), the Itakura-Saito measure (ISM)17,18 both for the synthesizedwords and a population of observations extracted from the original database adopted for training, with respect to themost likely (the target) realizations of such words. As you can see, the values of ISM for the three synthesized wordsare inside the ranges achieved for the population of original words, thus confirming the quality of synthesized speech.4. ConclusionThis paper has derived a new HMM-based framework for speech synthesis. This framework combines an MDCTrepresentation that guarantees a perfect reconstruction of the signal from feature vectors, a technique for learningHMM state sequences from phonemes. In the paper the rigorous mathematical apparatus, which the technique isfounded on, has been reported together with some experimental results showing the validity of the approach.1595 Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 | t o p o |(a)(b)| k a z a |(a)(b)| a l i Ùe |(a)(b)Fig. 4. Spectrograms of the Italian words topo (| t o p o |), casa (| k a z a |), Alice (| a l i Ùe |) for the: (a) original signal, (b) signal synthesized byour technique.Table 1. Itakura-Saito measure for a population of observations and the synthesized Italian words.Word Original Words Synthesized Wordmin max| t o p o | 4.2371 18.1802 7.1187| k a z a | 4.5560 31.8218 14.7179| a l i Ùe | 8.8605 28.5589 11.1542| v o Ùe | 10.2970 27.6711 20.8455| t r o p p o | 1.2301 23.8528 5.2783| t E r r a | 7.5787 21.3814 16.5928References1. Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., Oura, K.. Speech synthesis based on hidden Markov models. Proceedings ofthe IEEE 2013;101(5):1234–1252.1596   Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 2. Donovan, R.E., Woodland, P.C.. A hidden Markov-model-based trainable speech synthesizer. Computer Speech& Language 1999;13(3):223– 241.3. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.. Speaker interpolation in HMM-based speech synthesis system. In:EUROSPEECH. 1997, .4. Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.. Speech parameter generation algorithms for HMM-based speechsynthesis. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, (ICASSP’00); vol. 3. 2000, p. 1315–1318.5. Toda, T., Tokuda, K.. Speech parameter generation algorithm considering global variance for HMM-based speech synthesis. In: 9thEuropean Conf. Speech Communication and Technology. 2005, p. 2801–2804.6. Yoshimura, T.. Simultaneous Modeling of Phonetic and Prosodic Parameters, and Characteristic Conversion for HMM-Based Text-to-SpeechSystems. Ph.D. thesis; Nagoya Institute of Technology; 2002.7. Yamagishi, J., Nose, T., Zen, H., Ling, Z.H., Toda, T., Tokuda, K., et al. Robust speaker-adaptive HMM-based text-to-speech synthesis.IEEE Trans Audio, Speech, and Language Processing 2009;17(6):1208–1230.8. Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., Vainio, M., et al. HMM-based speech synthesis utilizing glottal inversefiltering. IEEE Trans Audio, Speech, and Language Processing 2011;19(1):153–165.9. Yu, K., Young, S.. Continuous F0 modeling for HMM based statistical parametric speech synthesis. IEEE Trans Audio, Speech, andLanguage Processing 2011;19(5):1071–1079.10. Ling, Z.H., Deng, L., Yu, D.. Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statisticalparametric speech synthesis. IEEE Trans Audio, Speech, and Language Processing 2013;21(10):2129–2139.11. Cabral, J.P., Richmond, K., Yamagishi, J., Renals, S.. Glottal spectral separation for speech synthesis. IEEE Journal of Selected Topics inSignal Processing 2014;8(2):195–208.12. Dobrowolski, A.P., Majda, E.. Cepstral analysis in the speakers recognition systems. In: Proc. Signal Processing Algorithms, Architectures,Arrangements, and Applications Conference (SPA). 2011, p. 1–6.13. Bosi, M., Goldberg, R.E.. Introduction to digital audio coding and standards. Springer; 2003.14. eSpeak text to speech. 2007. http://espeak.sourgeforge.net.15. The MBROLA project. 2006. http://tcts.fpms.ac.be/synthesis/mbrola.16. Moulines, E., Charpentier, F.. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. SpeechCommunication 1990;9(5–6):453 – 467.17. Itakura, F., Saito, S.. Analysis synthesis telephony based on the maximum likelihood method. In: Proceedings of the 6th InternationalCongress on Acoustics; vol. 17. pp. C17–C20; 1968, p. C17–C20.18. Chen, G., Koh, S.N., Soon, I.Y.. Enhanced Itakura measure incorporating masking properties of human auditory system. Signal Processing2003;83(7):1445–1456.

Learning HMM State Sequences from Phonemes for Speech Synthesis

 Procedia Computer Science  96 ( 2016 )  1589 – 1596 
Available online at www.sciencedirect.com
1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license 
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of KES International
doi: 10.1016/j.procs.2016.08.206 
ScienceDirect
20th International Conference on Knowledge Based and Intelligent Information and Engineering
Systems, KES2016, 5-7 September 2016, York, United Kingdom
Learning HMM state sequences from phonemes for speech synthesis
Giorgio Biagettia, Paolo Crippaa,∗, Laura Falaschettia, Simone Orcionia, Claudio
Turchettia
aDII – Department of Information Engineering,
Universita` Politecnica delle Marche, via Brecce Bianche, 12, I-60131 Ancona, Italy
Abstract
This paper presents a technique for learning hidden Markov model (HMM) state sequences from phonemes, that combined with
modiﬁed discrete cosine transform (MDCT), is useful for speech synthesis. Mel-cepstral spectral parameters, currently adopted
in the conventional methods as features for HMM acoustic modeling, do not ensure direct speech waveforms reconstruction. In
contrast to these approaches, we use an analysis/synthesis technique based on MDCT that guarantees a perfect reconstruction of the
signal frame feature vectors and allows for a 50% overlap between frames without increasing the data rate. Experimental results
show that the spectrograms achieved with the suggested technique behave very closely to the original spectrograms, and the quality
of synthesized speech is conveniently evaluated using the well known Itakura-Saito measure.
c© 2016 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of KES International.
Keywords: Learning, HMM, Speech synthesis, EM estimation, MDCT, MFCC ;
1. Introduction
Hidden Markov model (HMM) statistical parametric speech synthesis has proven to be a particularly ﬂexible and
robust framework to generate synthetic speech with various speaking styles and emotional expression1,2. Thanks to
the ability in representing not only the phoneme sequences but also various contexts of the linguistic speciﬁcation,
HMM-based speech synthesis has recently been a major topic in speech research systems3,4,5,6,7.
In conventional techniques based on the source-ﬁlter model assumption, phonetic and prosodic information are
assumed to be conveyed primarily by the spectral envelope, fundamental frequency (F0), and the duration of individual
phones8. However although these eﬀorts have produced good performances, there are still limitations in this approach.
In particular the modeling of F0 is diﬃcult due to the discontinuity nature of F0 caused by the voice and unvoiced
speech regions9. Moreover the spectral envelope deﬁnes a non-invertible transform so that the speech signal cannot
be perfectly reconstructed from the feature sequence10,11.
In this paper a novel HMM statistical parametric speech synthesis approach, based on learning HMM state se-
quences from phonemes and the modiﬁed discrete cosine transform (MDCT), which guarantees the perfect recon-
∗ Corresponding author. Tel.: +39-071-220-4541 ; fax: +39-071-220-4464.
E-mail address: p.crippa@univpm.it
 2016 The Authors. Published by Elsevi r B.V. This is an open access article under the CC BY-NC-ND license 
(http://creativecommons. rg/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of KES International
1590   Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 
struction of speech signal given the feature sequence and overcomes the main lacks of Mel-cepstral analysis/synthesis
technique, is proposed.
2. Speech vector sequence generation
2.1. MDCT feature vector
Let us represent the sampled signal S as a sequence of T + 1 blocks of D samples:
S = [sT1 , s
T
2 , . . . , s
T
T+1]
T ∈ R(T + 1)D×1 , (1)
where
st ∈ RD×1 (2)
is the single block of length D.
In signal sampling with overlap, a sequence of frames
X = [xT1 , x
T
2 , . . . , x
T
T ]
T ∈ RT (2D)×1 , (3)
is obtained, where
xt =
(
xLt
xRt
)
=
(
st
st+1
)
∈ R2D×1, t = 1, . . . ,T (4)
is the single frame corresponding to a window of length 2D.
The sequences S , X, and the overlap regions are depicted in Fig. 1. As you can see the blocks xt and xt+1 overlap
for a length D, and the following condition holds:
xRt = x
L
t+1. (5)
MDCT  MDCT  MDCT  
MDCT  MDCT  
1s 2s 3s MDCT
C
T
  
1
Lx 1
Rx
1x
2x
S
X
… 
MDCT  MDCT  2
Lx 2
Rx
C
T
  
Fig. 1. The sequences S , X, and the overlap regions between diﬀerent blocks.
The usually adopted model for speech parametrization is the source-ﬁlter model which leads to the extraction of
parameters (features) such as linear predictive coding (LPC), Mel-frequency cepstral coeﬃcients (MFCCs), perceptual
linear prediction (PLP) coeﬃcients, etc. Among these, MFCCs are demonstrated the most successful due to their
particular robustness to the environment and ﬂexibility12. MFCC feature extraction corresponds to a transform F
such that
oˆt = Fxt (6)
where the vector oˆt represents the so-called feature vector belonging to an appropriate subspace.
The main problem in speech synthesis is that, given the vector oˆt from transcription, the frame signal xt cannot
be derived univocally from (6) because the transform F is not invertible. In order to face this problem we use an
analysis/synthesis technique based on the MDCT that ensures a perfect reconstruction of the signal from feature
vectors and allows for a 50% overlap between blocks without increasing the data rate.
1591 Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 
Denoting with A = (A1A2) ∈ RD×2D the matrix that represents the MDCT13, and with ot the MDCT feature vector,
it results
ot = Axt = A
(
st
st+1
)
= (A1A2)
(
st
st+1
)
= A1st + A2st+1 (7)
where A1 , A2 ∈ RD×D. In matrix form we have
O = WS (8)
with
W =
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
A1 A2 · · · · · · 0
0 A1 A2 · · · 0
...
...
. . .
. . .
...
0 · · · · · · A1 A2
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
∈ RTD×(T+1)D (9)
and
O = [oT1 , o
T
2 , . . . , o
T
T ]
T ∈ RTD×1 (10)
is the MDCT feature vector corresponding to the signal S .
2.2. Learning HMM state sequences and maximum likelihood estimation
The speech synthesis algorithm we propose determines the sequence X of the synthetic signal, given the sequence
O of features corresponding to the transcription (or sequence of phonemes) H to be synthesized.
In an HMM modeling we need to derive ﬁrst the state sequence that generates the sequence O. To this end let
P(O,Q/λ) = πθ0
T∏
t=1
aθt−1θt bθt (ot) (11)
be the joint pdf of O and Q, given the model λ, where
Q = {θ1, θ2, . . . , θT } = {(q1, i1), (q2, i2), . . . , (qT , iT )}, (12)
being θt = (qt, it) the substate associated to the Gaussian mixture it of the state qt at the time instant t, that is
bθt (ot) = (2π)
−D/2 ∣∣∣Uθt ∣∣∣−1/2 · exp
{
−1
2
(ot − μθt )TU−1θt (ot − μθt )
}
(13)
with μθt ∈ RD×1, Uθt ∈ RD×D. πθ0 is the initial-state probability, and aθt−1θt is the state-transition probability.
Since H = {h1, h2, . . .} is a sequence of phonemes, we restrict the mathematical formulation to a single phoneme h
alone. Given the phoneme h the sequences O and Q are chosen in such a way the joint pdf
P(O,Q/λ) = P(O/Q, λ)P(Q/λ), (14)
which represents the likelihood of the set χ = {O,Q}, is maximum. The sequence Q is obtained during learning phase
as the one that satisﬁes max P(Q/λ). At the end of training to a given h corresponds a set {Q1,Q2, . . .} of substate
sequences, thus we choose Q as the one that satisﬁes
Q = Qbest = argmax
i
P(Qi/λ). (15)
Having derived Q, the sequence O is given by the maximum of the likelihood log P(O/Q, λ) which can be written as
L(O) = log P(O/Q, λ) =
T∑
t=1
log bθt (ot). (16)
After some manipulations we have
L(O) = −1
2
OTU−1O + OTU−1M + k (17)
1592   Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 
MDCT 
MDCT 
Feature
Extraction
Speech 
Database
MDCT 
Learning
HMMsMDCT 
Text 
Analysis
labels
speech
signal
MDCT Sequences of
Substates
MDCT 
Best 
Sequence QMDCT
Phoneme
h
Overlap-
and-addMDCT 
Best 
Sequence O
O*Qbest
X
{Q1, Q2 , … }
synthesized
speechSYNTHESIS STAGE
LEARNING STAGE
Fig. 2. Block diagram of the MDCT-based speech synthesis system.
where
U−1 = diag
[
U−1q1,it , U
−1
q2,it , . . . , U
−1
qT ,it
]
∈ RTD×TD, M =
[
μTq1,it , μ
T
q2,it , . . . , μ
T
qT ,it
]T ∈ RTD×1 (18)
and
k = k′ + k′′, (19)
being
k′ =
T∑
t=1
log (2π)−D/2
∣∣∣Uθt ∣∣∣−1/2, k′′ = μTqt ,itU−1qt ,itμqt ,it . (20)
The sequence O can be derived as the one that maximizes (17).
Having achieved the optimum sequence Qbest of substates for a given phoneme h, to such a sequence corresponds
a set {O1,O2, . . . } of feature sequences and a set of likelihood values {L(O1),L(O2), . . . }. In order to maximise the
joint pdf (14), the sequence O∗ = {O1,O2, . . . } such that L(O∗) = max{L(O1),L(O2), . . . } is chosen.
Finally, once the optimum sequence of feature vectors O∗ is obtained, the sequence X of synthesized signal frames
is derived by the overlap-and-add synthesis process.
An overview of the speech synthesis algorithm is shown in Fig. 2. The block diagram shows the two fundamental
steps of the proposed approach: the learning stage, that is the oﬀ-line stage, and the synthesis stage, that is the on-line
stage. The ﬁrst step extracts from the input database (audio and text sources) the MDCT features and derives throw
an HMM modeling the substates for all input sequences of phonemes; the second step, given an input text and on
the basis of the classiﬁed sequences of substates, determines the best sequence of states Qbest and the corresponding
best sequence of features O∗ for every input phoneme. At the end, the overlap-and-add synthesis process returns the
synthesized speech of the input text.
1593 Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 
| a |
(a)
(b)
(c)
| e |
(a)
(b)
(c)
| i |
(a)
(b)
(c)
| o |
(a)
(b)
(c)
| u |
(a)
(b)
(c)
Fig. 3. Spectrograms of the Italian vowels |a|, |e|, |i|, |o|, |u| for the: (a) original signal, (b) signal synthesized by our technique, and (c) signal
synthesized by diphones technique.
1594   Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 
3. Experimental results
3.1. Acoustic model training
The ﬁrst stage in the experiments we carried out to validate the proposed synthesis approach, was training the
HMM acoustic model.
The material adopted for training was based on a 22 hours audio recording of a female speaker extracted from an
Italian audiobook. The feature vector has been derived by applying the MDCT to the 2D = 20 ms signal frame xt, 50%
overlapped with the successive frame. With a sampling rate of 8 kHz, a frame length of 80 samples (corresponding to
the overlap length) is obtained.
The training was conducted with the Baum-Welch algorithm that performs an EM estimation of the audio modeling
parameters.
To determine the most probable state sequences, we used the same training material and the Baum-Welch algorithm
for the audio/text alignment at the HMM states level. In such a way, once the most probable state sequence for a given
transcription is derived, the matrices in (18) can be computed.
3.2. Vowel synthesis
To validate the proposed speech synthesis technique the above scheme was used to synthesize the ﬁve Italian
vowels, once the best substate sequences are given.
For comparison the same phonemes were synthesized using the “eSpeak” software14 and the MBROLA (it-4)
female recording audio extracted from ITC-irst data base15. MBROLA is an diphone-based algorithm16 for speech
synthesis. The MBROLA project web page provides diphone databases for a large number of spoken languages.
“eSpeak” is a compact open source software speech synthesizer that can be used as a front-end to MBROLA diphone
voices.
Figure 3 reports the spectrograms of the ﬁve Italian vowels |a|, |e|, |i|, |o|, |u|, as achieved by a 20 ms, 50% overlapped
window. The ﬁrst spectrogram in each ﬁgure depicts the behavior of the original audio signal, while the second and
third spectrograms are related to the signal synthesized with our approach and the diphone (i.e. the second half of one
phone plus the ﬁrst half of the following) technique , respectively. As you can see, the spectrograms achieved with
the suggested technique behave very closely to the original spectrograms. Diphones instead give spectrograms that
are quite diﬀerent from those expected.
3.3. Word synthesis
To further validate the proposed technique several Italian words have been synthesized.
Figure 4 reports the spectrograms of the three Italian word topo (| t o p o |), casa (| k a z a |), Alice (| a l i Ùe |),
as achieved by a 20 ms, 50% overlapped window. The ﬁrst spectrogram in each ﬁgure depicts the behavior of the
original audio signal, while the second spectrogram is related to the signal synthesized with this approach. As you
can see, the spectrograms achieved with the suggested technique behave very closely to the original spectrograms.
In addition Table 1 shows for the same three words topo (| t o p o |), casa (| k a z a |), Alice (| a l i Ùe |), and for the
additional two voce (| v o Ùe |) and troppo (| t r o p p o |), the Itakura-Saito measure (ISM)17,18 both for the synthesized
words and a population of observations extracted from the original database adopted for training, with respect to the
most likely (the target) realizations of such words. As you can see, the values of ISM for the three synthesized words
are inside the ranges achieved for the population of original words, thus conﬁrming the quality of synthesized speech.
4. Conclusion
This paper has derived a new HMM-based framework for speech synthesis. This framework combines an MDCT
representation that guarantees a perfect reconstruction of the signal from feature vectors, a technique for learning
HMM state sequences from phonemes. In the paper the rigorous mathematical apparatus, which the technique is
founded on, has been reported together with some experimental results showing the validity of the approach.
1595 Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 
| t o p o |
(a)
(b)
| k a z a |
(a)
(b)
| a l i Ùe |
(a)
(b)
Fig. 4. Spectrograms of the Italian words topo (| t o p o |), casa (| k a z a |), Alice (| a l i Ùe |) for the: (a) original signal, (b) signal synthesized by
our technique.
Table 1. Itakura-Saito measure for a population of observations and the synthesized Italian words.
Word Original Words Synthesized Word
min max
| t o p o | 4.2371 18.1802 7.1187
| k a z a | 4.5560 31.8218 14.7179
| a l i Ùe | 8.8605 28.5589 11.1542
| v o Ùe | 10.2970 27.6711 20.8455
| t r o p p o | 1.2301 23.8528 5.2783
| t E r r a | 7.5787 21.3814 16.5928
References
1. Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., Oura, K.. Speech synthesis based on hidden Markov models. Proceedings of
the IEEE 2013;101(5):1234–1252.
1596   Giorgio Biagetti et al. /  Procedia Computer Science  96 ( 2016 )  1589 – 1596 
2. Donovan, R.E., Woodland, P.C.. A hiddenMarkov-model-based trainable speech synthesizer. Computer Speech& Language 1999;13(3):223
– 241.
3. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.. Speaker interpolation in HMM-based speech synthesis system. In:
EUROSPEECH. 1997, .
4. Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.. Speech parameter generation algorithms for HMM-based speech
synthesis. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, (ICASSP’00); vol. 3. 2000, p. 1315–1318.
5. Toda, T., Tokuda, K.. Speech parameter generation algorithm considering global variance for HMM-based speech synthesis. In: 9th
European Conf. Speech Communication and Technology. 2005, p. 2801–2804.
6. Yoshimura, T.. Simultaneous Modeling of Phonetic and Prosodic Parameters, and Characteristic Conversion for HMM-Based Text-to-Speech
Systems. Ph.D. thesis; Nagoya Institute of Technology; 2002.
7. Yamagishi, J., Nose, T., Zen, H., Ling, Z.H., Toda, T., Tokuda, K., et al. Robust speaker-adaptive HMM-based text-to-speech synthesis.
IEEE Trans Audio, Speech, and Language Processing 2009;17(6):1208–1230.
8. Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., Vainio, M., et al. HMM-based speech synthesis utilizing glottal inverse
ﬁltering. IEEE Trans Audio, Speech, and Language Processing 2011;19(1):153–165.
9. Yu, K., Young, S.. Continuous F0 modeling for HMM based statistical parametric speech synthesis. IEEE Trans Audio, Speech, and
Language Processing 2011;19(5):1071–1079.
10. Ling, Z.H., Deng, L., Yu, D.. Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical
parametric speech synthesis. IEEE Trans Audio, Speech, and Language Processing 2013;21(10):2129–2139.
11. Cabral, J.P., Richmond, K., Yamagishi, J., Renals, S.. Glottal spectral separation for speech synthesis. IEEE Journal of Selected Topics in
Signal Processing 2014;8(2):195–208.
12. Dobrowolski, A.P., Majda, E.. Cepstral analysis in the speakers recognition systems. In: Proc. Signal Processing Algorithms, Architectures,
Arrangements, and Applications Conference (SPA). 2011, p. 1–6.
13. Bosi, M., Goldberg, R.E.. Introduction to digital audio coding and standards. Springer; 2003.
14. eSpeak text to speech. 2007. http://espeak.sourgeforge.net.
15. The MBROLA project. 2006. http://tcts.fpms.ac.be/synthesis/mbrola.
16. Moulines, E., Charpentier, F.. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech
Communication 1990;9(5–6):453 – 467.
17. Itakura, F., Saito, S.. Analysis synthesis telephony based on the maximum likelihood method. In: Proceedings of the 6th International
Congress on Acoustics; vol. 17. pp. C17–C20; 1968, p. C17–C20.
18. Chen, G., Koh, S.N., Soon, I.Y.. Enhanced Itakura measure incorporating masking properties of human auditory system. Signal Processing
2003;83(7):1445–1456.


This paper presents a technique for learning hidden Markov model (HMM) state sequences from phonemes, that combined with modified discrete cosine transform (MDCT), is useful for speech synthesis. Mel-cepstral spectral parameters, currently adopted in the conventional methods as features for HMM acoustic modeling, do not ensure direct speech waveforms reconstruction. In contrast to these approaches, we use an analysis/synthesis technique based on MDCT that guarantees a perfect reconstruction of the signal frame feature vectors and allows for a 50% overlap between frames without increasing the data rate. Experimental results show that the spectrograms achieved with the suggested technique behave very closely to the original spectrograms, and the quality of synthesized speech is conveniently evaluated using the well known Itakura-Saito measure

Learning HMM State Sequences from Phonemes for Speech Synthesis

Abstract

Similar works

Full text

Available Versions

Elsevier - Publisher Connector

Open Access Repository

Elsevier - Publisher Connector

IRIS UniversitÃ Politecnica delle Marche