In the EMIME project, we are developing a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrate two techniques, unsupervised adaptation for HMM-based TTS using a word-based large-vocabulary continuous speech recognizer and cross-lingual speaker adaptation for HMM-based TTS, into a single architecture. Thus, an unsupervised cross-lingual speaker adaptation system can be developed. Listening tests show very promising results, demonstrating that adapted voices sound similar to the target speaker and that differences between supervised and unsupervised cross-lingual speaker adaptation are small

King, Simon

Oura, Keiichiro

Tokuda, Keiichi

Wester, Mirjam

Yamagishi, Junichi

English

Edinburgh Research Archive

UNSUPERVISED CROSS-LINGUAL SPEAKER ADAPTATIONFOR HMM-BASED SPEECH SYNTHESISKeiichiro Oura, Keiichi TokudaDpartment of Computer Science and EngineeringNagoya Institute of Technology, Japanuratec@sp.nitech.ac.jpJunichi Yamagishi, Simon King, Mirjam WesterThe Centre for Speech Technology ResearchUniversity of Edinburgh, UKjyamagis@inf.ed.ac.ukABSTRACTIn the EMIME project, we are developing a mobile devicethat performs personalized speech-to-speech translation suchthat a user’s spoken input in one language is used to pro-duce spoken output in another language, while continuingto sound like the user’s voice. We integrate two techniques,unsupervised adaptation for HMM-based TTS using a word-based large-vocabulary continuous speech recognizer andcross-lingual speaker adaptation for HMM-based TTS, intoa single architecture. Thus, an unsupervised cross-lingualspeaker adaptation system can be developed. Listening testsshow very promising results, demonstrating that adaptedvoices sound similar to the target speaker and that differencesbetween supervised and unsupervised cross-lingual speakeradaptation are small.Index Terms— HMM-based speech synthesis, unsuper-vised cross-lingual speaker adaptation1. INTRODUCTIONThe goal of Speech-to-Speech Translation (S2ST) research isto “enable real-time, interpersonal communication via naturalspoken language for people who do not share a common lan-guage” [1] and many large-scale projects (Verbmobil, Baby-lon, TC/LC-STAR, EU-Trans, ATR, etc.) have focused onthis topic. In our EU FP7 project EMIME [2], we are devel-oping a mobile device that performs personalized S2ST, suchthat a user’s spoken input in one language is used to producespoken output in another language, while continuing to soundlike the user’s voice.Contrary to previous ‘pipeline’ S2ST systems that com-bined isolated automatic speech recognition (ASR), machinetranslation (MT), and text-to-speech (TTS) systems, or sys-tems that coupled ASR with MT [3, 4], EMIME placesthe main emphasis on coupling ASR with TTS, specificallyto enable cross-lingual speaker adaptation for HMM-basedASR and TTS [5, 6]. The principal modeling framework ofspeaker-adaptive HMM-based speech synthesis [6] is concep-tually similar to conventional ASR systems (although withoutdiscriminative training) and it is therefore possible to shareGaussians, decision trees or linear transforms between thetwo [7].In the EMIME project, we have conducted extensive ex-periments exploring the possibilities for combining ASR andTTS models. We have also developed unsupervised adapta-tion techniques for HMM-based TTS using either a phonemerecognizer [8] or a word-based large-vocabulary continuousspeech recognizer (LVCSR) [9], and cross-lingual adaptationtechniques for HMM-based TTS [10].In this paper, we integrate these developments into a sin-gle architecture which achieves unsupervised cross-lingualspeaker adaptation for HMM-based speech synthesis. Wedemonstrate an initial S2ST system built for four languages– American English, Mandarin, Japanese, and Finnish. Al-though all language pairs and directions are possible in ourframework, only the English-to-Japanese adaptation wasevaluated in the perceptual experiments presented here; theseexperiments focus on measuring the similarity between theoutput Japanese synthetic speech to the speech of the originalEnglish speaker. The following sections give an overviewof the system built, the unsupervised cross-lingual speakeradaptation method and the TTS evaluation results.2. OVERVIEW OF THE S2ST SYSTEM USINGHMM-BASED ASR AND TTSAll acoustic models, for both ASR and TTS, are trainedon large conventional speech databases, comprising speechfrom hundreds of speakers, which were originally intendedfor ASR: WSJ0/1 (for English), Speecon Mandarin, JNAS(Japanese), and Speecon Finnish databases. Details of thefront-end text processing used to derive phonetic-prosodiclabels from the word transcriptions can be found in [11].For each language, state-tied context-dependent speaker-independent HMMs (or multi-space distribution hidden semi-Markov models – MSD-HSMMs) are trained using speaker-adaptive training (SAT) [12]. For the state tying, minimumdescription length (MDL) automatic decision tree clusteringis used [5]. The acoustic features for ASR are either thesame as those for TTS or more typical ASR features such asMFCCs or PLPs. TTS acoustic features comprise the spectraland excitation features required for the STRAIGHT mel-cepstral vocoder with mixed excitation [6]. For unsupervisedcross-lingual speaker adaptation and decoding, a multi-passframework is used: in the first pass, initial transcriptions areobtained from speaker independent (SI) HMMs, and thenCSMAPLR adaptation [13] is applied to SAT-HMMs (ASR)using these obtained transcriptions. In the second pass, us-ing these adapted models, the transcriptions are refined. Inthe final pass, CSMAPLR transforms are estimated for SAT-HSMMs (TTS) with the refined transcriptions. These trans-forms can then be applied to the SAT-HSMMs for the outputlanguage, by employing a state-level mapping that has beenconstructed based on the Kullback-Leibler divergence (KLD)between pairs of states from the input and output TTS HMMs[10]. The ASR language models used for English, Mandarinand Japanese each contain about 20k bi-grams; the languagemodel for Finnish is a word 10-gram plus a morph bi-gram[14]. For MT we simply used Google’s AJAX language API1.In future work, this will be replaced by our own MT systembased on one being developed for the AGILE project2. Inthe TTS module, acoustic features are generated from theadapted HSMMs in the output language [6] and an MLSAfilter is used to generate the speech waveform.3. UNSUPERVISED CROSS-LINGUAL ADAPTATIONBASED ON A STATE-LEVEL MAPPING LEARNEDUSING MINIMUM KLDA cross-lingual adaptation method based on a state-level map-ping, learned using the KLD between pairs of states, was pro-posed by Wu et al. [10] and is summarized here. We call thisapproach “state-level transform mapping”.3.1. Learning the mapping between statesFor each state ∀j ∈ [1, J ] in the output language HMMλoutput, we search for the state î in the input language HMMλinput with the minimum symmetrized KLD to state j inλoutput:î = argmin1≤i≤IDKL(j, i), (1)where λoutput has J states and DKL(j, i) represents the KLDbetween state i in λinput and state j in λoutput (Fig. 1).DKL(j, i) is calculated as [15]:DKL(j, i) ≈DKL(j || i) + DKL(i || j), (2)DKL(i || j) =12ln(|Σj ||Σi|)− D2+12tr(Σ−1j Σi)+12(µj − µi)>Σ−1j (µj − µi), (3)1http://code.google.com/intl/ja/apis/ajaxlanguage/2http://svr-www.eng.cam.ac.uk/research/projects/AGILE/Fig. 1. The state-mapping is learned by searching for pairsof states that have minimum KLD between input and outputlanguage HMMs. Linear transforms estimated with respect tothe input language HMMs are applied to the output languageHMMs, using the mapping to determine which transform toapply to which state in the output language HMMs.where µi and Σi represent the mean vector and covariancematrix of the Gaussian pdf associated with state i.3.2. Estimating the transforms for the input languageHMMNext, we estimate a set of state-dependent linear transformsΛ̂ for the input language HMM λinput in the usual way:Λ̂ =(Ŵ1, · · · , ŴI)= argmaxΛP (O|λinput,Λ)P (Λ), (4)where Wi represents a linear transform for state i, I is thenumber of states in λinput, and O represents the adaptationdata. P (Λ) represents the prior distribution of the linear trans-forms, which is a uniform distribution for MLLR and CM-LLR and a matrix variate normal distribution for SMAPLRand CSMAPLR [13]. Note that the linear transforms willusually be tied (shared) between groups of states known asregression classes, to avoid over-fitting and to enable adapta-tion of all states, including those with no adaptation data.3.3. Applying the transforms to the output languageHMMFinally, these transforms are mapped to the output languageHMM. The Gaussian pdf in state j of λoutput is transformedusing the linear transform for state î, which is transform Ŵbi.By transforming all Gaussian pdfs in λoutput in this way,cross-lingual speaker adaptation is achieved.3.4. Unsupervised cross-lingual adaptationWe can extend this method to unsupervised adaptation sim-ply by automatically transcribing the input data using ASR-HMMs. For supervised adaptation, λinput and λoutput areboth TTS-HMMs (for the input and output languages, respec-tively). For unsupervised adaptation of HMM-based speechsynthesis, λinput may be either a TTS-HMM, or an ASR-HMM that utilizes the same acoustic features as TTS. Noother constraints need to be placed on the ASR-HMM. In par-ticular, it does not need to use prosodic-context-dependent-quinphones (which would be necessary for TTS models).4. EXPERIMENTS4.1. Experimental conditionsWe performed experiments on unsupervised English-to-Japanese speaker adaptation for HMM-based speech syn-thesis. An English speaker-independent model for ASR andaverage voice model for TTS were trained on the pre-definedtraining set “SI-84” comprising 7.2k sentences uttered by 84speakers included in the “short term” subset of the WSJ0database (15 hours of speech). A Japanese average voicemodel for TTS was trained on 10k sentences uttered by 86speakers from the JNAS database (19 hours of speech). Onemale and one female American English speaker, not includedin the training set, were chosen from the “long term” subsetof the WSJ0 database as target speakers. The adaptation datacomprised 5, 50, or 2000 sentences selected arbitrarily fromthe 2.3k sentences available for each of the target speakers.Speech signals were sampled at a rate of 16 kHz andwindowed by a 25ms Hamming window with a 10 ms shiftfor ASR and by an F0-adaptive Gaussian window with a5 ms shift for TTS. ASR feature vectors consisted of 39-dimensions: 13 PLP features and their dynamic and accel-eration coefficients. TTS feature vectors comprised 138-dimensions: 39-dimension STRAIGHT mel-cepstral coef-ficients (plus the zeroth coefficient), log F0, 5 band-filteredaperiodicity measures, and their dynamic and accelerationcoefficients. We used 3-state left-to-right triphone HMMs forASR and 5-state left-to-right context-dependent multi-streamMSD-HSMMs for TTS. Each state had 16 Gaussian mixturecomponents for ASR and a single Gaussian for TTS. Forspeaker adaptation, the linear transforms Wi had a tri-blockdiagonal structure, corresponding to the static, dynamic, andacceleration coefficients. Since automatically transcribed la-bels for unsupervised adaptation contain errors, we adjusteda hyperparameter (τb in [13]) of CSMAPLR to higher-than-usual value of 10000 in order to place more importance onthe prior (which is a global transform that is less sensitive totranscription errors).4.2. Listening testsSynthetic stimuli were generated from 7 models: the averagevoice model and supervised or unsupervised adapted mod-els each with 5, 50, or 2k sentences of adaptation data.10 Japanese native listeners participated in the listeningtest. Each listener was presented with 12 pairs of syntheticJapanese speech samples in random order: the first sample ineach pair was a reference original utterance from the databaseand the second was a synthetic speech utterance generatedfrom one of the 7 models. For each pair, listeners were askedto give an opinion score for the second sample relative to thefirst (DMOS), expressing how similar the speaker identitywas. Since there were no Japanese speech data available forthe target English speakers, the reference utterances wereEnglish. The text for the 12 sentences in the listening testcomprised 6 written Japanese news sentences randomly cho-sen from the Mainichi corpus and 6 spoken English newssentences from the English adaptation data that had been rec-ognized using ASR then translated into Japanese text usingMT.Figure 2 shows the average DMOS and their 95% confi-dence intervals. First of all, we can see that the adapted voicesare judged to sound more similar to target speaker than theaverage voice. Next, we can see that the differences betweensupervised and unsupervised adaptation are very small. Thisis a very pleasing result. However, the effect of the amountof adaptation data is also small, contrary to our expectations.This requires further investigation in future work.Figure 3 shows the average scores using Japanese newstexts from the corpus and English news texts recognized byASR and translated by MT. It appears that the speaker simi-larity scores are affected by the text of the sentences. Inter-estingly the gap becomes larger as the number of adaptationsentences increases; this also deserves further investigation infuture work.5. CONCLUSIONSIn this paper, we described the integration of several tech-niques we have developed for model adaptation into a sin-gle architecture which achieves unsupervised cross-lingualspeaker adaptation for HMM-based speech synthesis. Thelistening tests show very promising results: it has beendemonstrated that the adapted voices sound more similar tothe target speaker than the average voice and that differencesbetween supervised and unsupervised cross-lingual speakeradaptation are small. It appears that the speaker similarityscores are affected by the text of the sentences, which needsfurther investigation.Although all language pairs and directions are possi-ble in our system, only English-to-Japanese adaptation hasbeen evaluated in the perceptual experiments presented here.Evaluation of other language pairs and directions is ongo-1.52.02.53.03.5DMOS0 5 50 2000No adaptationNumber of adaptation sentencesSupervised adaptationUnsupervised adaptation95% confidence intervalsFig. 2. Experimental results: comparison of supervised andunsupervised speaker adaptation. “0 sentences” means theunadapted average voice model for the output language.1.52.02.53.03.5DMOS0 5 50 2000News textsTranslated texts95% confidence intervalsNumber of adaptation sentencesFig. 3. Experimental results: comparison of Japanese newstexts chosen from the corpus and English news texts whichwere recognized by ASR then translated into Japanese by MT.“0 sentences” means the unadapted average voice model forthe output language.ing. Other future work includes unsupervised cross-lingualspeaker adaptation using linear transform estimated directlyby ASR-HMMs, which must then use the same acousticfeatures as TTS-HSMM.6. ACKNOWLEDGEMENTSThe authors thank Ms. Kaori Yutani and Ms. Xiang-Lin Pengof the Nagoya Institute of Technology for their help with theexperiments reported in this paper.The research leading to these results was partly fundedfrom the European Community’s Seventh Framework Pro-gramme (FP7/2007-2013) under grant agreement 213845 (theEMIME project), and the Strategic Information and Commu-nications R&D Promotion Programme (SCOPE), Ministry ofInternal Affairs and Communication, Japan. SK holds andEPSRC Advanced Research Fellowship.7. REFERENCES[1] F. H. Liu, L. Gu, Y. Gao, and M. Picheny, “Use of statisti-cal N-gram models in natural language generation for machinetranslation,” Proc. ICASSP 2003, pp. 636–639, 2003.[2] Effective Multilingual Interaction in Mobile Environments(The FP7 EMIME Project) http://www.emime.org[3] Y. Gao, “Coupling vs. Unifying: Modeling Techniques forSpeech-to-Speech Translation” Proc. EUROSPEECH 2003,pp. 365–368, 2003.[4] H. Ney, “Speech translation: coupling of recognition andtranslation,” Proc. ICASSP-99, pp. 517–520, 1999.[5] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Ki-tamura, “Simultaneous modeling of spectrum, pitch and du-ration in HMM-based speech synthesis,” Proc. Eurospeech,pp. 2347–2350, 1999.[6] J. Yamagishi, T. Nose, H. Zen, L. Zhen-Hua, T. Toda,K. Tokuda, S. King, and S. Renals, “A robust speaker-adaptiveHMM-based text-to-speech synthesis,” IEEE TSALP, 17(6)pp. 1208–1230, 2009.[7] J. Dines, J. Yamagishi, and S. King, “Measuring the gap be-tween HMM-based ASR and TTS,” Proc. Interspeech 2009,pp. 1391–1394, 2009.[8] S. King, K. Tokuda, H. Zen, and J. Yamagishi, “Unsuper-vised adaptation for HMM-based speech synthesis,” Proc. In-terspeech 2008, pp. 1869–1872, 2008.[9] M. Gibson, “Two-pass decision tree construction for unsuper-vised adaptation of HMM-based synthesis models,” Proc. In-terspeech 2009, pp. 1791–1794, 2009.[10] Y. J. Wu and K. Tokuda, “State mapping based method forcross-lingual speaker adaptation in HMM-based speech syn-thesis,” Proc. Interspeech 2009, pp. 528–531, 2009.[11] J. Yamagishi et al., “Thousands of voices for HMM-basedspeech synthesis,” Proc. Interspeech 2009, pp. 420–423, 2009.[12] M. J. F. Gales, “Maximum likelihood linear transformationsfor HMM-based speech recognition,” Computer Speech &Language, 12(2), pp. 75–98, 1998.[13] J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Iso-gai, “Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptationalgorithm,” IEEE Trans. Speech, Audio & Language Process.,17(1), pp. 66–83, 2009.[14] T. Hirsimäki, J. Pylkkonen, and M. Kurimo, “Importance ofhigh-order N-gram models in morph-based speech recogni-tion,” IEEE Trans. Speech, Audio & Language Process., 17(4),pp. 724–732, 2009.[15] Y. Qian, H. Lang, and F. K. Soong, “A cross-language statesharing and mapping approach to bilingual (Mandarin – En-glish) TTS,” IEEE Trans. Speech, Audio & Language Pro-cess., 17(6) pp. 1231–1239, 2009.

Unsupervised Cross-lingual Speaker Adaptation for HMM-based Speech Synthesis

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.185.1769

Unsupervised Cross-lingual Speaker Adaptation for HMM-based Speech Synthesis

Abstract

Similar works

Full text

Available Versions

Edinburgh Research Archive