Croatian speech synthesis based on unit selection and stochastic models

Abstract

Govor je čovjeku prirodan način komunikacije. Govorne tehnologije poput sinteze i automatskog raspoznavanja govora te automastkog vođenja dijaloga omogućavaju govornu komunikaciju sa strojevima i raznim uređajima poput pametnih telefona i televizora. Govorno sučelje pri korištenju takvih uređaja može u mnogim situacijama biti prikladnije od korištenja tipkovnice i ekrana, primjerice u vožnji dok korisnik mora imati slobodne ruke i oči. Kako bi upotreba tih uređaja bila što prirodnija i predstavljala što manje opterećenje, od govornih tehnologija se očekuju sve bolje performanse te stoga njihov razvoj postaje sve važniji. U ovom radu u središtu pažnje je razvoj sustava za sintezu hrvatskoga govora koji omogućuje automatsku pretvorbu proizvoljnog teksta u govorni oblik. Za izgradnju sustava korištene su metode odabira jedinica i statističke parametarske sinteze te je predložena hibridna arhitektura koja objedinjuje obje metode. Govor dobiven pomoću statističke parametarske sinteze govora zvuči razumljivo i obično ima ujednačenu kvalitetu, no veću prirodnost je moguće ostvariti metodom odabira jedinica. Međutim, kod sinteze odabirom jedinica čak i mali broj jedinica koje se loše povezuju s ostalima u lancu mogu znatno narušiti dojam kvalitete. Stoga se u predloženoj hibridnoj metodi predlaže korištenje stohastičkih modela F0 za odbacivanje nizova koji sadrže jedinice koje prema modelu imaju premalenu vjerojatnost. Provedena je subjektivna evaluacija kvalitete, razumljivosti, prirodnosti i pojava nepravilnosti pri govoru razvijenih sustava za sintezu govora. Za slučaj sinteze tekstova unutar domene korpusa za učenje najbolje je ocijenjena sinteza odabirom jedinica grupiranjem, dok je za tekstove izvan domene najbolje ocijenjen hibridni sustav. Za automatsku objektivnu evaluaciju razumljivosti umjetnog govora predložena je mjera temeljena na rezultatima automatskog raspoznavanja govora koja je korištena za optimiranje parametara hibridnog sustava. Govor koji je točno automatski raspoznat i slušaoci su ocijenili boljim čime se potvrđuje opravdanost korištenja predložene objektivne mjere za optimiranje sustava za sintezu govora.Speech is a most natural mode of communication to people. Speech technologies such as speech synthesis, automatic speech recognition and spoken dialogue management enable spoken communication with machines and devices such as smartphones and entertainment devices. In many situations, for example when driving, spoken-language interface can be more appropriate and practical than using a keyboard and screen. In order to make the use of spoken-language interfaces as natural and convenient as possible, increasingly better performance is expected from speech technologies so their continued development is becoming more important. In this work the focus is on development of a speech synthesis system for Croatian language. A hybrid architecture based on unit selection and statistical parametric synthesis is proposed for the system. Speech generated using statistical parametric synthesizer sounds intelligible and usually has a consistent quality and speech generated using unit selection can sound more natural. However, in unit selection speech synthesis, even a small number of units that do not join well with other units in a chain can significantly degrade the perceived quality of synthesis. Therefore, a hybrid synthesis method is proposed where stochastic models of fundamental frequency are used to discard those candidate unit chains for synthesis that contain units that have a low probability according to the model. The thesis is composed of 8 chapters. In the first chapter the motivation and goals of the work are presented. Chapter 2 gives an overview of the state of the art and previous research. Principles of unit selection and statistical parametric synthesis are given, as well as formant synthesis which can be considered a predecessor of the statistical parametric synthesis. Hybrid approaches that combine ideas from unit selection and statistical parametric synthesis are described next. The chapter concludes with an overview of the work specific for the speech synthesis in Croatian. In Chapter 3 unit selection and statistical parametric speech synthesis methods, used in the developed system, are presented in more detail. Chapter 4 presents the speech corpus that was used in development of the speech synthesis system. A procedure for selection of a phonetically rich subset from a larger set of text is described. The procedure is applied to a large collection of Croatian text, described in this chapter, and a subset is selected that is small enough to be practical for recording, and allows synthesis of an arbitrary utterance. Construction of a speech unit database from speech recordings and corresponding transcriptions for use in the system is described last. A procedure for objective evaluation of algorithms for fundamental frequency (F0) estimation is described in Chapter 5. F0 is an important parameter for modelling speech and thus its accurate estimation from natural speech is important in speech synthesis system construction. In the proposed objective evaluation procedure, algorithms are tested on synthetic speech, and F0 values estimated by the tested algorithm are compared with known referent values used for synthesis. Six F0 estimation algorithms are tested and compared on male and female synthetic speech. The architecture of the developed speech synthesis system is presented in Chapter 6. The system is composed of two basic subsystems, the linguistic analysis subsystem and the speech synthesis subsystem. Three variants of the speech synthesis subsystems differing in the method of speech synthesis are developed. In the first the unit selection method is used, where synthetic speech is generated by concatenating units of natural speech. In the second, statistical parametric method is used, where speech is generated using stochastic models of speech. In the third, hybrid approach, a method for unit selection is proposed where potential candidate unit chains are scored according to statistical models of speech. In the conventional unit selection approach, a target pronunciation is set first, and then a chain of units from the database is selected that best fits the target according to a cost function. However, if there are no chain of units in the database that fit the target pronunciation it is still possible that a different pronunciation that can be realised with available units would still sound natural. In the hybrid system, this is achieved using a two-step unit selection procedure, where in the first step candidate chains are selected primarily based on cost of joining consecutive units, while in the second step final selection is made using statistical models of fundamental frequency. In Chapter 7 the results of evaluation of the synthetic speech are presented. A formal subjective evaluation of speech quality, intelligibility, naturalness and appearance of irregularities in speech was conducted. For texts in domain of learning corpus, a variant of the system based on unit selection was rated best, while for out-of-domain texts the best score was achieved for the hybrid system. Subjective evaluation can be inconvenient to perform in various stages of system development, since it can take long, be expensive and results may vary between runs. Objective evaluation procedures that can be done quickly and with consistent results between runs are favoured in that case. In this work, a measure based on automatic speech recognition (ASR) was proposed for automatic objective evaluation of speech intelligibility and was applied for the problem of parameter optimization of the hybrid system. A subjective evaluation confirmed a correspondence between the results of automatic recognition and human perception, and an improvement in synthetic speech quality after optimization. The thesis concludes with Chapter 8 where the contributions of the thesis and possible future work is presented

    Similar works