This paper describes a new method for synthesizing
speech by concatenating sub-word units from a
database of labelled speech. A large unit inventory is
created by automatically clustering units of the same
phone class based on their phonetic and prosodic context.
The appropriate cluster is then selected for a target
unit offering a small set of candidate units. An optimal
path is found through the candidate units based on
their distance from the cluster center and an acoustically
based join cost. Details of the method and justification
are presented. The results of experiments using
two different databases are given, optimising various
parameters within the system. Also a comparison
with other existing selection based synthesis techniques
is given showing the advantages this method has over
existing ones. The method is implemented within a full
text-to-speech system offering efficient natural sounding
speech synthesis

Black, Alan W

Taylor, Paul A

This paper describes a new method for synthesizing speech by concatenating sub-word units from a database of labelled speech. A large unit inventory is created by automatically clustering units of the same phone class based on their phonetic and prosodic context. The appropriate cluster is then selected for a target unit offering a small set of candidate units. An optimal path is found through the candidate units based on their distance from the cluster center and an acoustically based join cost. Details of the method and justification are presented. The results of experiments using two different databases are given, optimising various parameters within the system. Also a comparison with other existing selection based synthesis technique

Alan W Black

Paul Taylor

CiteSeerX

Automatically clustering similar units for unit selection in speech synthesis

Edinburgh Research Archive

AUTOMATICALLY CLUSTERING SIMILAR UNITS FORUNIT SELECTION IN SPEECH SYNTHESIS.Alan W Black and Paul TaylorCentre for Speech Technology Research, University of Edinburgh,80, South Bridge, Edinburgh, U.K. EH1 1HNhttp://www.cstr.ed.ac.ukemail: awb@cstr.ed.ac.uk, Paul.Taylor@ed.ac.ukABSTRACTThis paper describes a new method for synthesiz-ing speech by concatenating sub-word units from adatabase of labelled speech. A large unit inventory iscreated by automatically clustering units of the samephone class based on their phonetic and prosodic con-text. The appropriate cluster is then selected for a targetunit offering a small set of candidate units. An opti-mal path is found through the candidate units based ontheir distance from the cluster center and an acousti-cally based join cost. Details of the method and justi-fication are presented. The results of experiments us-ing two different databases are given, optimising vari-ous parameters within the system. Also a comparisonwith other existing selection based synthesis techniquesis given showing the advantages this method has overexisting ones. The method is implemented within a fulltext-to-speech system offering efficient natural sound-ing speech synthesis.1. BACKGROUNDSpeech synthesis by concatenation of sub-word units(e.g. diphones) has become basic technology. It pro-duces reliable clear speech and is the basis for a num-ber of commercial systems. However with simple di-phones, although the speech is clear, it does not havethe naturalness of real speech. In attempt to improvenaturalness, a variety of techniques have been recentlyreported which expand the inventory of units used inconcatenation from the basic diphone schema (e.g. [7][5] [6]). There are a number of directions in which thishas been done, both in changing the size of the units, theclassification of the units themselves, and the number ofoccurrences of each unit.A convenient term for these approaches is selectionbased synthesis. In general, there is a large databaseof speech with a variable number of units from a par-ticular class. The goal of these algorithms is to selectthe best sequence of units from all the possibilities inthe database, and concatenate them to produce the finalspeech.The higher level (linguistic) components of the sys-tem produce a target specification, which is a sequenceof target units, each of which is associated with a set offeatures. In the algorithm described here the databaseunits are phones, but they can be diphones or other sizedunits. In the work of Sagisaka et al. [9], units are ofvariable length, giving rise to the term non-uniform unitsynthesis. In that sense our units are uniform. The fea-tures include both phonetic and prosodic context, forinstance the duration of the unit, or its position in asyllable. The selection algorithm has two jobs: (1) tofind units in the database which best match this targetspecification and (2) to find units which join togethersmoothly.2. CLUSTERING ALGORITHMOur basic approach is to cluster units within a unit type(i.e. a particular phone) based on questions concerningprosodic and phonetic context. Specifically, these ques-tions relate to information that can be produced by thelinguistic component, e.g. is the unit phrase-final, or isthe unit in a stressed syllable. Thus for each phone inthe database a decision tree is constructed whose leavesare a list of database units that are best identified by thequestions which lead to that leaf.At synthesis time for each target in the target speci-fication the appropriate decision tree is used to find thebest cluster of candidate units. A search is then made tofind the best path through the candidate units that takesinto account the distance of a candidate unit from itscluster center and the cost of joining two adjacent units.2.1. Clustering unitsTo cluster the units, we first define an acoustic mea-sure to measure the distance between two units of thesame phone type. Expanding on [7], we use an acousticvector which comprises Mel frequency cepstrum coef-ficients, F  , power, and delta cepstrum, F  and power.The acoustic distance between two units is simply theaverage distance for the vectors of all the frames in theunits plus X% of the frames in the previous units, whichhelps ensure that close units will have similar preced-ing contexts. More formally, we use a weighted maha-lanobis distance metric to define the acoustic distance 	between two unitsandof the samephoneme class asif       	! "#  $% '&  "# ( ) *,+-(. *,+/ .10 	3254!	36 ) . 	798:6; )  $< = "# > . 	3?@@A#B . &<CD&   where    is number of frames in  , 6#EF	7 is pa-rameter G of frame H of unit  , AB . is the standarddeviation of parameter I , / . is weight for parameterI . This measure gives the mean weighted distance be-tween units with the shorter unit linear interpolated tothe longer unit./ Bis the duration penalty weightingthe difference between the two units’ lengths.This acoustic measure is used to define the impurityof a cluster of units as the mean acoustic distance be-tween all members. The object is to split clusters basedon questions to produce a better classification of theunits. A CART method [2] is used to build a decisiontree whose questions best minimise the impurity of thesub-clusters at that point in the tree. A standard greedyalgorithm is used for building the tree. This techniquemay not be globally optimal but a full global searchwould be prohibitively computationally expensive. Aminimum cluster size is specified (typically between10-20).Although the available questions are the same foreach phone type, the tree building algorithm will se-lect only the questions that are significant in partition-ing that particular type. The features used for CARTquestions include only those features that are availablefor target phones during synthesis. In our experimentsthese were: previous and following phonetic context(both phonetic identity and phonetic features), prosodiccontext (pitch and duration including that of previousand next units), stress, position in syllable, and posi-tion in phrase. Additional features were originally in-cluded, such as delta F  between a phone and its pre-ceding phone, but they did not appear as significant andwere removed. Different features are significant for dif-ferent phones, for example we see that lexical stress isonly used in the phones schwa, i, a and n, while a fea-ture representing pitch is only rarely used in unvoicedconsonants.The CART building algorithm implicitly deals withsparseness of units in that it will only split a cluster ifthere are sufficient examples and significant differenceto warrant it.2.2. Joining unitsTo join consecutive candidate units from clusters se-lected by the decision trees, we use an optimal coupling[4] technique to measure the concatenation costs be-tween two units. This technique offers two results: thecost of a join and a position for the join. Allowing thejoin point to move is particularly important when ourunits are phones: initial unit boundaries are on phone-phone boundaries which probably are the least stablepart of the signal. Optimal coupling allows us to selectmore stable positions towards the center of the phone.In our implementation, if the previous phone in thedatabase is of the same type as the selected phone weuse a search region that extends 60% into the previousphone, otherwise the search region is defined to be thephone boundaries of the current phone.Our actual measure of join cost is a frame based Eu-clidean distance. The frame information includes F  ,Mel frequency cepstrum coefficients, and power andtheir delta counterparts. Although this uses the sameparameters as used in the acoustic measure used in clus-tering, now it is necessary to weight the F  parameterto deter discontinuity of local F  which can be partic-ularly distracting in synthesized examples. Except forthe delta features this measure is similar to that used in[7].2.3. Selecting unitsAt synthesis time we have a stream of target segmentsthat we wish to synthesize. For each target we use theCART for that unit type, and ask the questions to findthe appropriate cluster which provides a set of candi-date units. The function J 	 is defined as the dis-tance of a unitto its cluster center, and the functionK 1	 )  ) L +  as the join cost of the optimal couplingpoint between a candidate unit )and the previous can-didate unit ) L + it is to be joined to. We then use aViterbi search to find the optimal path through the can-didate units that minimizes the following expression:M( )*,+ J 	7) ON / & K 1	)  ) L + /allows a weight to be set optimizing join cost overtarget cost. Given that clusters typically contain unitsthat are very close, the join cost is usually the more im-portant measure and hence is weighted accordingly.2.4. PruningAs distributing the whole database as part of a synthe-sis voice may be prohibitively large, especially if mul-tiple voices are required, appropriate pruning of unitscan be done to reduce the size of the database. Thishas two effects. The first is to remove spurious atypicalunits which may have been caused by mislabelling orpoor articulation in the original recording. The secondis to remove those units which are so common that thereis no significant distinction between candidates. Giventhis clustering algorithm it is easy (and worthwhile) toachieve the first by removing the units from a clusterthat are furthest from its center. Results of some exper-iments on pruning are shown below.The second type of pruning, removing overly com-mon units, is a little harder as it requires looking at thedistribution of the distances within clusters for a unittype to find what can be determined as, “close enough.”Again this involves removal of those units furthest fromthe cluster center, though this is best done before the fi-nal splits in the tree, and only for the most common unittypes.As with all the measures and parameters there is atrade off between synthesis resources (size of databaseand time to select) verses quality, but it seems that prun-ing 20% of units makes no significant difference (andmay even improve the results) while up to 50% may beremoved without seriously degrading the quality. (Sim-ilar figures were also found in the work described in[7].)3. EXPERIMENTSTwo databases have so far been tested with this tech-nique, a male British English RP speaker consistingof 460 TIMIT phonetically balanced sentences (about14,000 units) and a female American news reader fromthe Boston University FM Radio corpus [8] (about37,000 units).Testing the quality of speech synthesis is difficult.Initially we tried to score a model under some set ofparameters by synthesizing a set of 50 sentences. Theresults were scored on a scale of 1-5 (excellent to in-comprehensible). However the results were not consis-tent except when the quality widely differed. Thereforeinstead of using an absolute score we used a relativeone, as it was found to be much easier and reliable tojudge if an example was better, equal or worse than an-other than state its quality on some absolute scale.In these tests we generated 20 sentences for a smallset of models by varying some parameter (e.g. clustersize). The 20 sentences consisted of 10 “natural target”sentences (where the segments, duration and F  werederived directly from naturally spoken examples), and10 examples of text to speech. None of the sentencesin the test set were in the databases used to build thecluster models. Each set of 20 was played against eachother set (in random order) and a score of better, worseor equal was recorded. A sample set was said to “win”if it had more better examples than another. A leaguetable was kept recording the number of “wins” for eachsample set thus giving an ordering on the sets.In the following tests we varied cluster size, and F weight in the acoustic cost, and the amount to prunefinal clusters. These full tests were only carried out onthe male 460 sentence database.For the cluster size we fixed the other parameters atwhat we thought were mid-values. The following tablegives the number of “wins” of that sample set over theothers.minimum cluster size5 8 10 12 15wins 1 0 4 3 2Obviously we can see that when the cluster is too re-strictive the quality decreases but at around 10 it is at itsbest and decreases as the cluster size gets bigger.The importance of F  in the acoustic measure wastested by varying its weighting relative to the other pa-rameters in the acoustic vector.F  acoustic weight0.0 1.0 2.0 3.0wins 1 3 2 0This optimal value is lower than we expected but webelieve this is because our listening test did not testagainst an original or actual desired F  , thus no penaltywas given to a “wrong” but acceptable F  contour, in asynthesized example.The final test was to find the effect of pruning theclusters. In this case clusters of size 15 and 10 weretested, and pruning involved discarding a number ofunits from the clusters. In both cases discarding 1 or2 made no perceptible difference in quality (though re-sults actually differed in 2 units). In the size 10 clus-ter case, further pruning began to degrade quality. Inthe size 15 cluster case, quality only degraded after dis-carding more than 3 units. Overall the best quality wasfor the size 10 cluster and pruning 2 allows the databasesize to be reduced without affecting quality. The prun-ing was also tested on the f2b database with its muchlarger inventory. Best overall results with that databasewere found with pruning 3 and 4 from a cluster size of20.In these experiments no signal modification was doneafter selection, even though we believe that such pro-cessing (e.g. PSOLA) is necessary. We do not expect allprosodic forms to exist in the database and it is better tointroduce a small amount of modification to the signalin return for fixing obvious discontinuities. However itis important for the selection algorithm to be sensitive tothe prosodic variation required by the targets so that theselected units require only minimal modification. Ide-ally the selection scoring should take into account thecost of signal modification, and we intend to run simi-lar tests on selections modified by signal processing.4. DISCUSSIONThis algorithm has a number of advantages over otherselection based synthesis techniques. First the clustermethod based on acoustic distances avoids the problemof estimating weights in a feature based target distancemeasure as described in [7], but still allows unit clustersto be sensitive to general prosodic and phonetic distinc-tions. It also neatly finesses the problem of variabil-ity in sparseness of units. The tree building algorithmonly splits a cluster when there are a significant num-ber and identifiable variation to make the split worth-while. The second advantage over [7] is that no targetcost measurement need be done at synthesis time as thetree effectively has pre-calculated the “target cost” (inthis case simply the distance from the cluster center).This makes for more efficient synthesis as many dis-tance measurements now need not be done.Although this method removes the need to gener-ate the target feature weights generated in [7] used inestimating acoustic distance there are still many otherplaces in the model where parameters need to be esti-mated, particularly the acoustic cost and the continuitycost. Any frame based distance measure will not eas-ily capture “discontinuity errors” perceived as bad joinsbetween units. This probably makes it difficult to findautomatic training methods to measure the quality ofthe synthesis produced.Donovan and Woodland [5] use a similar clusteringmethod, but the method described here differs in that in-stead of a single example being chosen from the cluster,all the members are used so that continuity costs maytake part in the criteria for selection of the best units.In [5], HMMs are used instead of a direct frame-based measure for acoustic distance. The advantage inusing an HMM is that different states can be used fordifferent parts of the unit. Our model is equivalent to asingle state HMM and so may not capture transient in-formation in the unit. We intend to investigate the useof HMMs as representations of units as this should leadto a better unit distance score.Other selection algorithms use clustering, though notalways in the way presented here. As stated, the clustermethod presented here is most similar to [5]. Sagisakaet al. [9] also clusters units but only using phoneticinformation, they combine units forming longer, “non-uniform” units based on the distribution found in thedatabase. Campbell and Black [3] also use similar pho-netic based clustering but further cluster the units basedon prosodic features, but still resorts to a weighted fea-ture target distance for ultimate selection.It is difficult to give realistic comparisons of the qual-ity of this method over others. Unit selection techniquesare renowned for both their extreme high quality exam-ples and their extreme low quality ones, and minimis-ing the bad examples is a major priority. This techniquedoes not yet remove all low quality examples, but doestry to minimise them. Most examples lie in the mid-dle of the quality spectrum with mostly good selectionbut a few noticable errors which detract from the over-all acceptability of the utterance. The best examples,however, are nearly indistinguishable from natural ut-terances.This cluster method is fully implemented as a wave-form synthesis component using the Festival SpeechSynthesis System [1].5. ACKNOWLEDGEMENTSWe gratefully acknowledge the support of the UK En-gineering and Physical Science Research Council (EP-SRC grant GR/K54229 and EPSRC grant GR/L53250).REFERENCES[1] A. W. Black and P. Taylor. The Festival SpeechSynthesis System: system documentation. Tech-nical Report HCRC/TR-83, Human Communci-ation Research Centre, University of Edinburgh,Scotland, UK, January 1997. Avaliable athttp://www.cstr.ed.ac.uk/projects/festival.html.[2] L. Breiman, J. Friedman, R. Olshen, and C. Stone.Classification and Regression Trees. Wadsworth &Brooks, Pacific Grove, CA., 1984.[3] N. Campbell and A. Black. Prosody and the selec-tion of source units for concatenative synthesis. InJ. van Santen, R. Sproat, J. Olive, and J. Hirschberg,editors, Progress in speech synthesis, pages 279–282. Springer Verlag, 1996.[4] A. Conkie and S. Isard. Optimal coupling of di-phones. In J. van Santen, R. Sproat, J. Olive, andJ. Hirschberg, editors, Progress in speech synthesis,pages 293–305. Springer Verlag, 1996.[5] R. Donovan and P. Woodland. Improvements in anHMM-based speech synthesiser. In Eurospeech95,volume 1, pages 573–576, Madrid, Spain, 1995.[6] X. Huang, A. Acero, H. Hon, Y. Ju, J Liu,S. Meredith, and M. Plumpe. Recent improvementson microsoft’s trainable text-to-speech synthesizer:Whistler. In ICASSP-97, volume II, pages 959–962,Munich, Germany, 1997.[7] A. Hunt and A. Black. Unit selection in a concate-native speech synthesis system using a large speechdatabase. In ICASSP-96, volume 1, pages 373–376,Atlanta, Georgia, 1996.[8] M. Ostendorf, P. Price, and S. Shattuck-Hufnagel.The Boston University Radio News Corpus. Tech-nical Report ECS-95-001, Electrical, Computer andSystems Engineering Department, Boston Univer-sity, Boston, MA, 1995.[9] Y. Sagisaka, N. Kaiki, N. Iwahashi, and K. Mimura.ATR –   -TALK speech synthesis system. In Pro-ceedings of ICSLP 92, volume 1, pages 483–486,1992.

Automatically clustering similar units for unit selection in speech synthesis.

https://era.ed.ac.uk/bitstream/1842/1236/1/Black_1997_b.pdf

Automatically clustering similar units for unit selection in speech synthesis.

Abstract

Similar works

Full text

Available Versions

CiteSeerX

Edinburgh Research Archive

CiteSeerX