This paper proposes a new method that determines segmental duration for text-to-speech conversion based on the movement of articulatory organs which compose an articulatory model. The articulatory model comprises four time-variable articulatory parameters representing the conditions of articulatory organs whose physical restriction seems to significantly influence the segmental duration. The parameters are controlled according to an input sequence of phonetic symbols, following which segmental duration is determined based on the variation of the articulatory parameters. The proposed method is evaluated through an experiment using a Japanese speech database that consists of 150 phonetically balanced sentences. The results indicate that the mean square error of predicted segmental duration is approximately 15[ms] for the closed set and 15-17[ms] for the open set. The error is within 20[ms], the level of acceptability for distortion of segmental duration without loss of naturalness, and hence the method is proved to effectively predict segmental duration

Matsuura, Hiroshi

Nitta, Tsuneo

Shiga, Yoshinori

This paper proposes a new method that determines segmental duration for text-to-speech conversion based on the movement of articulatory organs which compose an articulatory model. The articulatory model comprises four time-variable articulatory parameters representing the conditions of articulatory organs whose physical restriction seems to significantly influence the segmental duration. The parameters are controlled according to an input sequence of phonetic symbols, following which segmental duration is determined based on the variation of the articulatory parameters. The proposed method is evaluated through an experiment using a Japanese speech database that consists of 150 phonetically balanced sentences. The results indicate that the mean square error of predicted segmental duration is approximately 15[ms] for the closed set and 15-17[ms] for the open set. The error is within 20[ms], the level of acceptability for distortion of segmental duration without loss of naturalness, and hence the method is proved to effectively predict segmental duration. 1

Yoshinori Shiga

Hiroshi Matsuura

Tsuneo Nitta

CiteSeerX

Segmental duration control based on an articulatory model

Edinburgh Research Archive

SEGMENTAL DURATION CONTROLBASED ON AN ARTICULATORY MODELYoshinori Shiga, Hiroshi Matsuura and Tsuneo NittaMultimedia Engineering Laboratory, TOSHIBA Corporation70 Yanagi-cho, Saiwai-ku, Kawasaki, JapanABSTRACTThis paper proposes a new method that determines segmentalduration for text-to-speech conversion based on the movementof articulatory organs which compose an articulatory model.The articulatory model comprises four time-variablearticulatory parameters representing the conditions ofarticulatory organs whose physical restriction seems tosignificantly influence the segmental duration. The parametersare controlled according to an input sequence of phoneticsymbols, following which segmental duration is determinedbased on the variation of the articulatory parameters.The proposed method is evaluated through an experiment usinga Japanese speech database that consists of 150 phoneticallybalanced sentences. The results indicate that the mean squareerror of predicted segmental duration is approximately 15[ms]for the closed set and 15-17[ms] for the open set. The error iswithin 20[ms], the level of acceptability for distortion ofsegmental duration without loss of naturalness, and hence themethod is proved to effectively predict segmental duration. 1. INTRODUCTIONDuration control is one of the most important factors thatdecide the naturalness of speech produced by text-to-speech(TTS) systems. Unnatural and artificial duration makesspeech so monotonous that people quickly become tired oflistening to it, and often causes misperception.In actual speech production, as illustrated in Figure 1, plans arefirst made on local speaking rates according to prosodicinformation, such as word stress, syntactic structure andsemantic focuses, i.e., the target timing structure is determined.Although voice is then adjusted to the target structure,articulatory organs can not always maintain the voice at thetarget due to physical restrictions on movement of the organs.We focus on this phenomenon and believe that it is the mainfactor affecting the basic timing structure of each language andalso prevents synthetic speech from being monotonous.Therefore, in order to synthesize temporally natural-soundingspeech, segmental duration should be controlled withconsideration of the restricted movement of organs.In this paper, we propose a novel method for determiningsegmental duration based on the movement of articulatoryorgans which compose an articulatory model, in order to takeinto account the movement restriction of articulatory organs.The method has already been applied to duration control in ourJapanese TTS system that runs on PCs, and helps to makesynthetic speech much more natural.2. DURATION CONTROL BASED ONAN ARTICULATORY MODEL2.1 Articulatory ModelMany kinds of articulatory model have been proposed[1-3]after the earliest one proposed by Coker and Fujimura[4]. Mostof these models are, however, designed to approximate thetransmission characteristics of the vocal tract in order to clarifythe relation between the movement of vocal organs andacoustic characteristics of speech. Consequently, they arecomplicated by having many parameters corresponding toevery part of the vocal tract.On the other hand, the articulatory model in our method fordetermining segmental duration is fairly simple, because it issufficient for the model to employ parameters that represent theconditions of articulatory organs whose physical restrictionsignificantly influences the segmental duration. Figure 2 showsthe articulatory model we adopt in the proposed method. ThePlanning for localspeaking ratesWord stress Syntactic structure Semantic focusProsodic informationTarget timing structureArticulationRestriction oforgan’s movementPhonetic informationFeedbackSpeechFigure 1: Duration control in actual speech production.model comprises only four time-variable articulatoryparameters, i.e., the opening area of the lips (L), and thepositions of the lower jaw (J), the front tongue (FT) and theback tongue (BT), whose physical restriction is considered tosignificantly influence the duration.2.2 Model ControlSince the method simulates the movements of the articulatoryorgans, speech sounds should be classified according to theirmanners and places of articulation. The classification is doneusing a knowledge of articulatory phonetics. We will discussthe classification of Japanese speech in section 3.1.There are 13 coefficients assigned to each classified sound(simply called “phone” below) in total. They include threecoefficients of each articulatory parameter (therefore 12 valuesin total), Ainh, Amax and Amin, whose values represent the inherentarticulation and the upper and lower articulatory limits of thephone, respectively. The coefficient Ainh indicates the organ’s“typical” position and Amax and Amin the range of the positionacceptable as the articulation of the phone. The remainingcoefficient is the minimum duration Dmin of the phone; thisduration must be put after every parameter comes intoarticulation of the phone until a command for articulation ofthe next phone is given.Based on these coefficients, the articulatory parameters arecontrolled in the model according to an input sequence ofphonetic symbols. We approximate the parameter change ofthe articulatory organ represented by k(= L,J,FT,BT) with thefollowing function M(k,t):where phi and N indicate the type of the i-th phone and thenumber of phones to be synthesized, respectively. Here, ti is thetime where all the articulatory parameters start shifting thearticulation from the i-th phone to the next phone. We supposethat a “command” for the phone phi+1 is given at the time ti.S(t) is obtained by the following step-response function of acritically damped second-order system:where αk is a time constant inherent in the articulatory organ k.An organ that moves quicker takes a higher value of α.Figure 3 shows the parameter change representing the spring-up and –down movements of the tongue, which are realizedwith two commands. Figure 4 shows an example of articulatoryparameter variation that the method generates from the input“arayuru genjitsu (all the facts)”.2.3 Duration DeterminationSegmental duration is determined based on the four contours ofthe time-variable articulatory parameters, which are controlledaccording to the input phonetic string to be synthesized. Themethod first sets the boundaries of segments. Although thereare generally different definitions of segment boundary, thedefinition must be at least the same as that in the post-processing. Our TTS system employs a concatenative methodusing diphone units for post-processing, which are produced onthe basis of phonetic labels in the speech database we built. Thephonetic boundaries are hence set corresponding to thedefinition of the labeling method of the database, thensegmental duration is obtained as the time difference betweenadjacent boundaries.        Figure 2: The articulatory model in the proposed method.tti ti+1		 Figure 3: Commands and the change of articulatory parameter.M k t A k ph R k tinh iiN( , ) ( , ) ( , )= +=−∑011R k ti ( , ) =01{ ( , ) ( , )} ( )A k ph A k ph S t tinh i inh i i+ − −( )( )t tt tii<≥S t t ektk( ) ( )= − + −1 1 α α3. APPLICATION TO JAPANESESPEECH SYNTHESISWe have discussed the proposed method independent oflanguage so far. In this section, the method is applied toJapanese speech.3.1 Classifying the SoundAs explained in section 2.2, speech sounds must be classifiedby their manners and places of articulation. In order to meetthis requirement, we referred to the classification ofInternational Phonetic Alphabet (IPA). With this reference, weclassified Japanese sounds into 51 phones, which consist of 12vowels including devoiced or nasalized vowels, and 39consonants including palatalized sounds, which are listed inTable 1. The coefficients for the sound [i] of Japanese are givenin Table 2 as an example.3.2 Controlling the Articulatory Model   Based on Mora-timed RhythmThe timing structure of Japanese speech is characterized bymora-timed rhythm, i.e., rhythm on mora isochrony. Thistiming structure is achieved with the articulatory model byadjusting each command in a vowel for the following phone sothat the commands are issued at equal time intervals. However,if not every articulatory parameter has reached the acceptablerange for that vowel, the command waits until this state isreached, because it indicates that the articulatory combinationis too difficult to be produced within the given interval of time.A command in a consonant for the next phone is issued afterthe elapse of the minimum duration Dmin of the consonant fromwhere every articulatory parameter comes into its acceptablearticulation range, as explained in section 2.2.Figure 4: Articulatory parameter variation that the method generates from the input “arayuru genjitsu (all the facts)”. Vowels a  i    e  o   Consonants k  kj  s     t  t j  ts  t   t n  nj    h    j  m  mj j   j   w  p  pj  j  z  dz  d   d   d  dj b  bj    j Table 1: Sounds in Japanese speech. Amin(J, i) Ainh(J, i) Amax(J, i) 0.190.300.50 Amin(FT, i) Ainh(FT, i) Amax(FT, i) 0.10 0.17 0.25 Amin(BT, i) Ainh(BT, i) Amax(BT, i) 0.59 0.64 0.83 Amin(L, i) Ainh(L, i) Amax(L, i) 0.32 0.35 0.49 Dmin(i)  0.00 Table 2: Example of the coefficients        (Japanese [i]). !"#$%#4. EVALUATIONWe objectively evaluate the proposed method by examiningerrors between actual duration and the duration predicted bythe method.4.1 Speech DataThe data used in the experiment are 150 phonetically-balancedsentences in Japanese. Labels representing the phone types thatwe classified in section 3.1 are manually assigned. 100sentences out of all the data are used as a closed set and theremaining 50 as an open set.4.2 Experimental ProcedureThe coefficients of each phone are first assigned valuesestimated roughly from the articulatory shape with aknowledge of articulatory phonetics, and then optimized by theA-b-S method with the closed set of 100 sentences.As the interval of time until a command in a vowel for the nextphone is issued, we use the value extracted from the databaseby the “accent phrase”, assuming that the interval is constant inthe phrase.Segmental duration is predicted for all the sentences using theproposed method, and compared to the measured duration forthe same 100 sentence data used for the optimization, and forthe remaining 50 sentence data for an open test.4.3 ResultsThe experimental results are shown in Table 3. The meansquare error was approximately 15[ms] for the closed set and15-17[ms] for the open set.5. DISCUSSIONIn order to evaluate the method from experimental results, theacceptability for distortion of segmental duration should betaken into consideration.Hashimoto[5] reported that approximately 20[ms] is the limitof segmental duration within which speech remains natural. Allthe mean square errors in Table 3 are within the limit, andhence the experimental results show that the proposed methodprecisely estimates segmental duration.6. CONCLUSIONWe have discussed a new method that determines segmentalduration by simulating articulatory motion with an articulatorymodel. After a theoretical examination, an experiment wasconducted, and the results confirmed the effectiveness of theproposed method.Since the method determines duration based on the articulatorymodel representing the movement of articulatory organs,duration is influenced by physical restriction of the modelfairly similar to the process of actual speech production. Theproposed method is therefore expected to give a natural rhythmto synthetic speech at different speaking rates, but not to begood at tongue twisters.ACKNOWLEDGMENTWe are very grateful to the members of the speech synthesisgroup in the TOSHIBA R&D Center for willingly providingvaluable speech databases to us.REFERENCES1. Mermelstein, P. “Articulatory model for the study ofspeech production,” J. Acoust. Soc. Am. 53(40): 1070-1082, 1973.2. Shirai, K. and Honda, M. “Estimation of articulatorymotion from speech waves and its application forautomatic recognition,” in Spoken LanguageGeneration and Understanding (ed. J. C. Simon),Reidel, Dordrecht, Holland, pp.87-99, 1980.3. Maeda, S. “Compensatory articulation during speech:Evidence from the analysis and synthesis of vocal-tractshapes using an articulatory model,” in Speechproduction and speech modeling (NATO AdvancedStudy Institute Series), W. J. Hardcastle and A.Marchal (eds). Kluwer Academic Publishers, Boston,pp.131-149, 1990.4. Coker, C. H. and Fujimura, O. “A model forspecification of vocal tract area function,” J. Acoust.Soc. Am. 40: 1271(A), 1966.5. Hashimoto, S. and Saito, S. “Prosodic rules for speechsynthesis,” 7th International Congress on Acoustics,pp.129-132, 1971. Closed(100sentences)  Open(50sentences)  Consonant  Vowel  Consonant  Vowel Durationaverage[ms] 60.5  79.8  61.7  77.7 Mean squareerror[ms] 14.9  15.4  15.4  17.3 Table 3: Experimental Results (Mean square error betweenestimated duration and actual duration).

Segmental Duration Control Based on an Articulatory Model

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.62.6930

Segmental Duration Control Based on an Articulatory Model

Abstract

Similar works

Full text

Available Versions

CiteSeerX

Edinburgh Research Archive