Skip to main content
Article thumbnail
Location of Repository

Concatenative speech synthesis: a Framework for Reducing Perceived Distortion when using the TD-PSOLA Algorithm

By Jennifer Ann Longster

Abstract

This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The \ud TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. \ud Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to \ud speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co-\ud occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These \ud parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. \ud Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using \ud TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the \ud sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community. \u

Topics: csi
OAI identifier: oai:eprints.bournemouth.ac.uk:415

Suggested articles

Citations

  1. (1999). A Call for Genetic-Use Luge-Scale Single-Speaker Speech Corpora and an Example of their Application in Concatenative Speech Synthesis. doi
  2. (1998). A Computationally Efficient Alternative for the Liljencrants-Fant Model and its Perceptual Evaluation. doi
  3. (1989). A Diphone Synthesis System based on TuneDomain Modifications of Speech. In: doi
  4. (1994). A Feature-Profile for Application-Specific Speech Synthesis Assessment and Evaluation. In:
  5. (1996). A Framework to
  6. (1998). A Hybrid Model for Text-to-Speech Synthesis. In: doi
  7. (1987). A Model of Segmental Duration for Speech Synthesis in French. doi
  8. (1989). A New Analysis/Synthesis System of Musical Signals using Prony's Method. doi
  9. (1994). A New Waveform Speech Synthesis Approach based on the COC Speech Spectrum. In: doi
  10. (1998). A Nonlinear Unit Selection Strategy for Concatenative Speech Synthesis based on Syllable Level Features. In:
  11. (1990). A Pen tua! Study of Intonation. Cambridge: doi
  12. (1998). A Perceptual Evaluation of Distance Measures for Concatenative Speech Synthesis. To Appear in:
  13. (1956). A Redetermination of the Equal Loudness Relations for Pure Tones. doi
  14. A Trainable Text-to-Speech System. In: doi
  15. (1993). Accurate Short-Term Analysis of the Fundamental Frequency and the Haazmonics-to-Noise Ratio of a Sampled Sound. In:
  16. (1957). Acoustic Cues for the Perception of Initial /w, j, r, l/ in English. Word,
  17. (1957). Acoustic Properties of Stop Consonants. doi
  18. (1960). Acoustic Theory of Speech Production. doi
  19. (1995). An Approach to Text-to-Speech Synthesis. In doi
  20. (1997). An Auditory-Based Measure for Improved Phone Segment Concatenation. In: doi
  21. (1997). An Introduction to Text-to-Speech Synthesis The Netherlands: doi
  22. (1996). An Investigation into the Generation of Mouth Shapes for a Talking Head. In: doi
  23. (1962). Analysis of Nasal Consonants. doi
  24. Application to Heavily Damped Percussive Sounds. In: doi
  25. (2001). Applying the Harmonic plus Noise Model in Concatenative Speech Synthesis. In: doi
  26. (1965). Articulation Testing Methods: Consonantal Differentiation with a Closed Response Set doi
  27. (1993). Assessing the Quality of Synthetic Speech. In:
  28. (1997). Assessment of Synthesis Systems. In: Gibbon st aL (Eds). Handbook of Standards and Resourcrs for Spoken Language Systems,
  29. (1960). Audibility of Switching Transients. doi
  30. (1993). Auditory Profile Analysis of Harmonic Signals. doi
  31. (1992). Automatic Generation of Optimized Unit Dictionaries for Text to Speech Synthesis. In:
  32. (1988). Automatic Generation of Synthesis Units based on Context Oriented Clustering. In: doi
  33. (1998). Automatic Generation of Synthesis Units for Trainable Text-to-Speech Systems. In: doi
  34. (1993). Automatic Segmentation and Quality Evaluation of Speech Unit Inventories for Concatenation-Based, Multilingual PSOLA Text-to-Speech Systems. In: doi
  35. (1994). Automatic Synthesis Unit Generation for English Speech Synthesij based on Multi-Layered Context Oriented Clustering. Speech Comm, unica en, doi
  36. (1997). Automatically Clustering Similar Units for Unit Selection in Speech Synthesis. In:
  37. (2000). Building Voices in the Festival Speech Synthesis System,
  38. (1998). CHAIR. A Natural Speech Bs-SequencingSyxtbesir System.
  39. (1994). CHATR A Generic Speech Synthesis System. In: doi
  40. (1999). Choose the Best to Modify the Least: A New Generation Concatenative Synthesis System. In:
  41. (1984). Classification and Re scion Tees
  42. (1993). Comprehension of KTH Text-to-Speech with Listening Speed Program. In:
  43. (1992). Concatenative Speech Synthesis by Minimum Distortion Criteria. In: doi
  44. (1998). Concatenative Speech Synthesis using a Harmonic + Noise Model. In: doi
  45. (1990). Control Parameters for Synthesis by Rule. In:
  46. (2000). Corpus-Based Speech Synthesis: Methods and Challenges. Anbeits├čapian des InstitutsJurMaschinelle Sprachverarieitung (Univ. Stut ant),
  47. (1993). Cue Trading in the Production and Perception of Vowel Stress. doi
  48. (1994). Development of a Text-to-Speech System for Japanese based on Waveform Slicing. In: doi
  49. (2000). Diphone Collection and Synthesis. In:
  50. (1988). Diphone Synthesis using a Multipulse LPC Technique, In:
  51. (1986). Diphone Synthesis using an Overlap-Add Technique for Speech Waveform Concatenation. In: doi
  52. (1975). Distance Measures for Speech Processing. doi
  53. (1999). DoinLQuantitative Pgchologica1 Research - fmm Design to Report.
  54. (1999). Efficient Weight Training for Selection Based Synthesis. In:
  55. (1985). Electronic Synthesis of Speech. Cambridge: doi
  56. (1996). Elements ofAcourtic Phonetics. 2 "d Ed.
  57. (1996). Elements ofAcourtic Phonetics. 2 "d Ed. Chicago:
  58. (1983). Evaluating Processed Speech using the Diagnostic Rhyme Test.
  59. (1991). Expectations for Assessment Techniques Applied to Speech Synthesis.
  60. (1996). Expert Advisory Group on Language Engineering Standards (EAGLES) Guidelines.
  61. (1998). Exploration of Acoustic Correlates in Speaker Selection for Concatenative Synthesis. In:
  62. (1983). Formant Synthesisers - Cascade or Parallel? Speech Communication, doi
  63. (1987). Frnm Tact to Speech, The MITALK SystM .
  64. (1998). Generalization and Discrimination in TreeStructured Unit Selection. In:
  65. (1997). Handbook of Standards and Rerommr for Spoken Language Systems.
  66. (1998). Hearing. An Introduction to Prychological and Phy ioIoIcaJAcousticc.
  67. (1995). High Quality Speech Modification based on a Harmonic + Noise Model. In: doi
  68. (1994). High Quality Text-to-Speech Synthesis: A Comparison of Four Candidate Algorithms. In: doi
  69. (1989). High-Quality Prosodic Modifications of Speech using Time-Domain Overlap-add Synthesis. 1n Pracssa ns X17 iaht v
  70. (1993). HNS: Speech Modification based on a Harmonic + Noise Model. In: doi
  71. (1989). Intensity Discrimination Determined with Two Paradigms in Normal and Hearing Impaired Subjects. doi
  72. (1995). JEIDA Guideline for Speech Synthesiser Evaluation. Research Report on Office Automation Equipment Standardization,
  73. (1998). Just Concatenation -A Corpus-Based Approach and its Limits. In:
  74. (1976). Linear Prediction of Speech,
  75. (1989). Linguistic and Prosodic Processing for a Text-tospeech Synthesis System. In:
  76. (1993). MBR-PSOLA: Text-to-Speech Synthesis based on an MBE Resynthesis of the Segments Database. doi
  77. (1997). Methods for Optimal Text Selection. In:
  78. (1992). Minimal Rules for Articulatory Speech Synthesis. In:
  79. (1996). Modeling Formant Frequency Discrimination of Female Vowels. doi
  80. (1993). Multi-Lingual PSOLA Text-to-Speech System. In: doi
  81. (1992). Multi-Lingual Synthesis Evaluation Methods. In:
  82. Multilingual Speech Input/Output Assessment Methodology and Standam'isation.
  83. (1992). Numerical Recipes in C. " The Art of Scientific Computing, 2"d Edn. doi
  84. (1995). Objective Optimization in Algorithms for Text-to-Speech Synthesis. doi
  85. (1990). On the Prediction of Global FO Shape for Japanese Text-to-Speech. In: doi
  86. (1996). On the use of a Sinusoidal Model for Speech Synthesis in Text-to-Speech. In: J. van Saaten et at (Eds. ) Prvg rt in Speech Synthesis, doi
  87. (1993). On the use of Neural Networks in Articulatory Speech Synthesis. doi
  88. (1978). On the use of Windows for Harmonic Analysis with the Discrete Fourier Transform. In: doi
  89. (1990). On Unit Selection Algorithms and their Evaluation in Non-Uniform Speech Synthesis. In:
  90. (1996). Optimal Coupling of Diphones. In: Progress in Speech Synthesis, doi
  91. (2001). Optimal Data Selection for Unit Synthesis. In:
  92. (1995). Optimising Selection of Units from Speech Databases for Concatenative Synthesis. In: doi
  93. (1996). Overview of Current Text-to-Speech Techniques: Part 1- Text and Linguistic Analysis.
  94. (1996). Overview of Current Text-to-Speech Techniques: Part 2- Prosody and Speech Generation.
  95. (1993). Perceptual Experiments for Diagnostic Testing of Text-to-Speech Systems. Computer Speech and Ian&uage, doi
  96. (2002). Perfect Synthesis for all of the People all of the Time. In: Ktynote, IEEE Tact-to-Speech Workshop, doi
  97. (1983). Pitch Determination of Speech Signals"Algorithms and Devices. doi
  98. (1990). Pitch Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis using Diphones. Speech C6ommi cation, doi
  99. (1989). Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech using Diphones. In: doi
  100. (1999). Praat 3.8.38. A System for doing Phonetics by Computer. [http:
  101. (1997). Prosodic Modeling in Text-to-Speech Synthesis. In: doi
  102. (1996). Prosody and the Selection of Source Units for Concatenative Synthesis. In: doi
  103. (1999). Psychoacoustic Evaluation of PSOIA II. Double Fonnant Stimuli and the Role of Vocal Peturbation.
  104. (1997). Psychoacoustical Evaluation of the PitchSynchronous Overlap-and-Add Speech-Waveform Manipulation Technique using SingleFormant Stimuli. doi
  105. (1997). Psychophysical Evaluation of PSOLA: Natural versus Synthetic Speech.
  106. (1995). Quality Evaluation of five German Speech Synthesis Systems. Acta Acxstica,
  107. (1999). Rapid Unit Selection from a Large Speech Corpus for Concatenative Speech Synthesis. In: doi
  108. (1994). Recommendation P. 85 -A Method for Subjective Performance Asse invent of the Qxali>, of Speech Voice Output Devices.
  109. (1999). Recording Concatenative Units for Speech Synthesis using a Reference Pitch Prompt. In:
  110. (2001). Reducing Audible Spectral Discontinuities. doi
  111. (1987). Review of Text-to-Speech Conversion for English. doi
  112. (1999). Robust Unit Selection System for Speech Synthesis. In: Collected Papers of the 137'6 Meeting of the Acoustical Society of America dam' the 2"d Convention of the Euroean Acoustics Association,
  113. (1999). Robust Unit Selection System for Speech Synthesis. In: Collected Papers of the 137'6 Meeting of the Acoustical Society of America dam' the 2"d Convention of the Euroean Acoustics Association, Germany, Paper 1PSCB.
  114. (1999). Rules for the Generation of ToBI-based American English Intonation. Speech Communication, doi
  115. (1990). Segment Selection and Pitch Modification for High Quality Synthesis using Waveform Segment. In:
  116. (1999). Selection of Waveform Units for Corpus-Based Mandarin Speech Synthesis Based on Decision Trees and Prosodic Modification Costs. In:
  117. (1977). Short-Term Spectral Analysis, Synthesis and Modification by Discrete Fourier Transform. doi
  118. (1984). Signal Estimation from Modified Short-Time Fourier Transform. doi
  119. (1979). Some Notes on the Perception of Temporal Patterns in Speech.
  120. (2000). Spectral Modification for Concatenative Speech Synthesis. In: doi
  121. (1956). Spectral Properties of Fricative Consonants. doi
  122. (1998). Spectral Smoothing for Concatenative Speech Synthesis. doi
  123. (1972). Speech Analyri c Synthesis & Perception. doi
  124. (1986). Speech Analysis/Synthesis based on a Sinusoidal Representation. doi
  125. (1987). Speech Communication - Human and Machine, doi
  126. (1996). Speech Concatenation and Synthesis using an OverlapAdd Sinusoidal ModeL In: doi
  127. (2001). Speech Processing for Communications: What's New? Revue HF,
  128. (1993). Speech Quality Assessment and Evaluation. In:
  129. (1995). Speech Segment Network Approach for an Optimal Synthesis Unit Set. Computer Speech 6-Language, doi
  130. (1988). Speech Synthesis by Rule using Optimal Selection of Non-Uniform Synthesis Units. In: doi
  131. (1964). Speech Synthesis by Rule. language and Speech,
  132. (1992). Standard Computer-Compatible Transcription.
  133. (1988). Statistical PowerAnalyris for the Behavioural Sciences (2┬░d Edn. ). NJ: Lawrence Erlbaum Associates.
  134. (1993). Subjective Performance Assessment of the Quality of Speech Output Devices. Special Rapporteur forQuestions S/XII.
  135. (1978). Syllables as Concatenative Phonetic Units.
  136. (1979). Synthesis by Rule of Segmental Durations in English Sentences.
  137. (2001). Synthesis of Emotional Speech using Prosodically Balanced VCV Segments. In:
  138. (1998). TD-PSOLA versus Harmonic plus Noise Model in Diphone Based Speech Synthesis. In: doi
  139. (1968). Terminal Analog Synthesis of Continuous Speech using the Diphone Method of Segment Assembly. doi
  140. (1992). The Acoustic Analy is of Speech.
  141. (1996). The Aligner Text-to-Speech Alignment using Markov Models. In: doi
  142. (1999). The AT&T Next-Gen TTS System. In: doi
  143. (1987). The Cambridge Encyclopaedia of Language. Cambridge UK:
  144. (1999). The Festival Speech Synthesis System - System Documentation. CSTR Edinburgh. Edition 1.4 for Festival 1.4.0. [http:
  145. (1982). The Klattalk Text-to-Speech System. In: doi
  146. (1996). The Laureate Text-to-Speech System, Architecture and Applications.
  147. (1996). The MBROLA Project: Towards a Set of High-Quality Speech Synthesizers Free of Use for Non-Commercial Purposes. In: doi
  148. (1999). The Need for Incrraced Speech Synthesis Research. Report of the 1998 NSF Workshop for Discussing Research Priorities & Evaluation Strategies in Speech Synthesis.
  149. (1949). The Principles of the International Phonetic Association: A Description of the International Phonetic Alphabet and the Manner of Using It, Illustrated by Texts doi
  150. (1940). The Relation of Pitch to Frequency: A Revised Scale. doi
  151. the SAM Partnership (1992a). `SOAP' -A Speech Output Assessment Package for Controlled Multilingual Evaluation of Synthetic Speech. In:
  152. (1992). the SAM Partnership.
  153. (1946). The Sound Spectrograph.
  154. (1992). ToBI: A Standard for Labeling English Prosody. In:
  155. (1996). Trainable Speech Synthesis. PhD thesis. Cambridge University. [ftp: //svt_
  156. (1993). Tree-Based Unit Selection for English Speech Synthesis. In: doi
  157. (1996). Unit Selection in a Concatenative Speech Synthesis System using a Large Speech Database. In: doi
  158. (1992). Voice Transformation using PSOLA Technique. Speech Communication, doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.