165 research outputs found

    Concatenative speech synthesis: a Framework for Reducing Perceived Distortion when using the TD-PSOLA Algorithm

    Get PDF
    This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co- occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community

    Speech Synthesis Based on Hidden Markov Models

    Get PDF

    Concatenative speech synthesis : a framework for reducing perceived distortion when using the TD-PSOLA algorithm

    Get PDF
    This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co- occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    A Parametric Approach for Efficient Speech Storage, Flexible Synthesis and Voice Conversion

    Get PDF
    During the past decades, many areas of speech processing have benefited from the vast increases in the available memory sizes and processing power. For example, speech recognizers can be trained with enormous speech databases and high-quality speech synthesizers can generate new speech sentences by concatenating speech units retrieved from a large inventory of speech data. However, even in today's world of ever-increasing memory sizes and computational resources, there are still lots of embedded application scenarios for speech processing techniques where the memory capacities and the processor speeds are very limited. Thus, there is still a clear demand for solutions that can operate with limited resources, e.g., on low-end mobile devices. This thesis introduces a new segmental parametric speech codec referred to as the VLBR codec. The novel proprietary sinusoidal speech codec designed for efficient speech storage is capable of achieving relatively good speech quality at compression ratios beyond the ones offered by the standardized speech coding solutions, i.e., at bitrates of approximately 1 kbps and below. The efficiency of the proposed coding approach is based on model simplifications, mode-based segmental processing, and the method of adaptive downsampling and quantization. The coding efficiency is also further improved using a novel flexible multi-mode matrix quantizer structure and enhanced dynamic codebook reordering. The compression is also facilitated using a new perceptual irrelevancy removal method. The VLBR codec is also applied to text-to-speech synthesis. In particular, the codec is utilized for the compression of unit selection databases and for the parametric concatenation of speech units. It is also shown that the efficiency of the database compression can be further enhanced using speaker-specific retraining of the codec. Moreover, the computational load is significantly decreased using a new compression-motivated scheme for very fast and memory-efficient calculation of concatenation costs, based on techniques and implementations used in the VLBR codec. Finally, the VLBR codec and the related speech synthesis techniques are complemented with voice conversion methods that allow modifying the perceived speaker identity which in turn enables, e.g., cost-efficient creation of new text-to-speech voices. The VLBR-based voice conversion system combines compression with the popular Gaussian mixture model based conversion approach. Furthermore, a novel method is proposed for converting the prosodic aspects of speech. The performance of the VLBR-based voice conversion system is also enhanced using a new approach for mode selection and through explicit control of the degree of voicing. The solutions proposed in the thesis together form a complete system that can be utilized in different ways and configurations. The VLBR codec itself can be utilized, e.g., for efficient compression of audio books, and the speech synthesis related methods can be used for reducing the footprint and the computational load of concatenative text-to-speech synthesizers to levels required in some embedded applications. The VLBR-based voice conversion techniques can be used to complement the codec both in storage applications and in connection with speech synthesis. It is also possible to only utilize the voice conversion functionality, e.g., in games or other entertainment applications

    Text-Independent Voice Conversion

    Get PDF
    This thesis deals with text-independent solutions for voice conversion. It first introduces the use of vocal tract length normalization (VTLN) for voice conversion. The presented variants of VTLN allow for easily changing speaker characteristics by means of a few trainable parameters. Furthermore, it is shown how VTLN can be expressed in time domain strongly reducing the computational costs while keeping a high speech quality. The second text-independent voice conversion paradigm is residual prediction. In particular, two proposed techniques, residual smoothing and the application of unit selection, result in essential improvement of both speech quality and voice similarity. In order to apply the well-studied linear transformation paradigm to text-independent voice conversion, two text-independent speech alignment techniques are introduced. One is based on automatic segmentation and mapping of artificial phonetic classes and the other is a completely data-driven approach with unit selection. The latter achieves a performance very similar to the conventional text-dependent approach in terms of speech quality and similarity. It is also successfully applied to cross-language voice conversion. The investigations of this thesis are based on several corpora of three different languages, i.e., English, Spanish, and German. Results are also presented from the multilingual voice conversion evaluation in the framework of the international speech-to-speech translation project TC-Star

    Computer speech synthesis: a systematic method to extract synthesis parameters for formant synthesizers.

    Get PDF
    by Yu Wai Leung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1993.Includes bibliographical references (leaves 94-96).Abstract --- p.1Introduction --- p.2Chapter 1. --- Human speech and its production modelChapter 1.1 --- The human vocal system --- p.4Chapter 1.2 --- Speech production mechanism --- p.5Chapter 1.3 --- Acoustic properties of human speech --- p.5Chapter 1.4 --- Modeling the speech production process --- p.6Chapter 1.5 --- Speech as the spoken form of a language --- p.7Chapter 2. --- Speech analysis techniquesChapter 2.1 --- Short time speech analysis and speech segmentation --- p.9Chapter 2.2 --- Pre-emphasis --- p.9Chapter 2.3 --- Linear predictive analysis --- p.10Chapter 2.4 --- Formant tracking --- p.13Chapter 2.5 --- Pitch determination --- p.20Chapter 3. --- Speech synthesis technologyChapter 3.1 --- Overview --- p.24Chapter 3.2 --- Articulatory synthesis --- p.24Chapter 3.3 --- Concatenation synthesis --- p.24Chapter 3.4 --- LPC synthesis --- p.27Chapter 3.5 --- Formant speech synthesis --- p.28Chapter 3.6 --- Synthesis by rule --- p.29Chapter 4. --- LSYNTH: A parallel formant synthesizerChapter 4.1 --- OverviewChapter 4.2 --- Synthesizer configuration: cascade and parallel --- p.32Chapter 4.3 --- Structure ofLSYNTH --- p.33Chapter 5. --- Automatic formant parameter extraction for parallel formant synthesizersChapter 5.1 --- Introduction --- p.47Chapter 5.2 --- The idea of a feedback analysis system --- p.48Chapter 5.3 --- Overview of the feedback analysis system --- p.49Chapter 5.4 --- Iterative spectral matching algorithm --- p.52Chapter 5.5 --- Results and discussions --- p.65Chapter 6. --- Generate formant trajectories in synthesis-by-rule systemsChapter 6.1 --- Formant trajectories generation in synthesis-by-rule systems --- p.70Chapter 6.2 --- Modeling formant transitions --- p.71Chapter 6.3 --- Conventional formant transition calculation --- p.72Chapter 6.4 --- The 4-point Bezier curve model --- p.73Chapter 6.5 --- Modeling of formant transitions for Cantonese --- p.77Chapter 7. --- Some listening test resultsChapter 7.1 --- Introduction --- p.87Chapter 7.2 --- Tone recognition test --- p.87Chapter 7.3 --- Cantonese final recognition test --- p.89Chapter 7.4 --- Problems and discussions --- p.91Conclusion --- p.92References --- p.94Appendix A: The Cantonese phonetic system --- p.97"Appendix B: TPIT, A tone trajectory generator for Cantonese" --- p.10

    Time-domain concatenative text-to-speech synthesis.

    Get PDF
    A concatenation framework for time-domain concatenative speech synthesis (TDCSS) is presented and evaluated. In this framework, speech segments are extracted from CV, VC, CVC and CC waveforms, and abutted. Speech rhythm is controlled via a single duration parameter, which specifies the initial portion of each stored waveform to be output. An appropriate choice of segmental durations reduces spectral discontinuity problems at points of concatenation, thus reducing reliance upon smoothing procedures. For text-to-speech considerations, a segmental timing system is described, which predicts segmental durations at the word level, using a timing database and a pattern matching look-up algorithm. The timing database contains segmented words with associated duration values, and is specific to an actual inventory of concatenative units. Segmental duration prediction accuracy improves as the timing database size increases. The problem of incomplete timing data has been addressed by using `default duration' entries in the database, which are created by re-categorising existing timing data according to articulation manner. If segmental duration data are incomplete, a default duration procedure automatically categorises the missing speech segments according to segment class. The look-up algorithm then searches the timing database for duration data corresponding to these re-categorised segments. The timing database is constructed using an iterative synthesis/adjustment technique, in which a `judge' listens to synthetic speech and adjusts segmental durations to improve naturalness. This manual technique for constructing the timing database has been evaluated. Since the timing data is linked to an expert judge's perception, an investigation examined whether the expert judge's perception of speech naturalness is representative of people in general. Listening experiments revealed marked similarities between an expert judge's perception of naturalness and that of the experimental subjects. It was also found that the expert judge's perception remains stable over time. A synthesis/adjustment experiment found a positive linear correlation between segmental durations chosen by an experienced expert judge and duration values chosen by subjects acting as expert judges. A listening test confirmed that between 70% and 100% intelligibility can be achieved with words synthesised using TDCSS. In a further test, a TDCSS synthesiser was compared with five well-known text-to-speech synthesisers, and was ranked fifth most natural out of six. An alternative concatenation framework (TDCSS2) was also evaluated, in which duration parameters specify both the start point and the end point of the speech to be extracted from a stored waveform and concatenated. In a similar listening experiment, TDCSS2 stimuli were compared with five well-known text-tospeech synthesisers, and were ranked fifth most natural out of six

    Phone-based speech synthesis using neural network with articulatory control.

    Get PDF
    by Lo Wai Kit.Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 151-160).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Applications of Speech Synthesis --- p.2Chapter 1.1.1 --- Human Machine Interface --- p.2Chapter 1.1.2 --- Speech Aids --- p.3Chapter 1.1.3 --- Text-To-Speech (TTS) system --- p.4Chapter 1.1.4 --- Speech Dialogue System --- p.4Chapter 1.2 --- Current Status in Speech Synthesis --- p.6Chapter 1.2.1 --- Concatenation Based --- p.6Chapter 1.2.2 --- Parametric Based --- p.7Chapter 1.2.3 --- Articulatory Based --- p.7Chapter 1.2.4 --- Application of Neural Network in Speech Synthesis --- p.8Chapter 1.3 --- The Proposed Neural Network Speech Synthesis --- p.9Chapter 1.3.1 --- Motivation --- p.9Chapter 1.3.2 --- Objectives --- p.9Chapter 1.4 --- Thesis outline --- p.11Chapter 2 --- Linguistic Basics for Speech Synthesis --- p.12Chapter 2.1 --- Relations between Linguistic and Speech Synthesis --- p.12Chapter 2.2 --- Basic Phonology and Phonetics --- p.14Chapter 2.2.1 --- Phonology --- p.14Chapter 2.2.2 --- Phonetics --- p.15Chapter 2.2.3 --- Prosody --- p.16Chapter 2.3 --- Transcription Systems --- p.17Chapter 2.3.1 --- The Employed Transcription System --- p.18Chapter 2.4 --- Cantonese Phonology --- p.20Chapter 2.4.1 --- Some Properties of Cantonese --- p.20Chapter 2.4.2 --- Initial --- p.21Chapter 2.4.3 --- Final --- p.23Chapter 2.4.4 --- Lexical Tone --- p.25Chapter 2.4.5 --- Variations --- p.26Chapter 2.5 --- The Vowel Quadrilaterals --- p.29Chapter 3 --- Speech Synthesis Technology --- p.32Chapter 3.1 --- The Human Speech Production --- p.32Chapter 3.2 --- Important Issues in Speech Synthesis System --- p.34Chapter 3.2.1 --- Controllability --- p.34Chapter 3.2.2 --- Naturalness --- p.34Chapter 3.2.3 --- Complexity --- p.35Chapter 3.2.4 --- Information Storage --- p.35Chapter 3.3 --- Units for Synthesis --- p.37Chapter 3.4 --- Type of Synthesizer --- p.40Chapter 3.4.1 --- Copy Concatenation --- p.40Chapter 3.4.2 --- Vocoder --- p.41Chapter 3.4.3 --- Articulatory Synthesis --- p.44Chapter 4 --- Neural Network Speech Synthesis with Articulatory Control --- p.47Chapter 4.1 --- Neural Network Approximation --- p.48Chapter 4.1.1 --- The Approximation Problem --- p.48Chapter 4.1.2 --- Network Approach for Approximation --- p.49Chapter 4.2 --- Artificial Neural Network for Phone-based Speech Synthesis --- p.53Chapter 4.2.1 --- Network Approximation for Speech Signal Synthesis --- p.53Chapter 4.2.2 --- Feed forward Backpropagation Neural Network --- p.56Chapter 4.2.3 --- Radial Basis Function Network --- p.58Chapter 4.2.4 --- Parallel Operating Synthesizer Networks --- p.59Chapter 4.3 --- Template Storage and Control for the Synthesizer Network --- p.61Chapter 4.3.1 --- Implicit Template Storage --- p.61Chapter 4.3.2 --- Articulatory Control Parameters --- p.61Chapter 4.4 --- Summary --- p.65Chapter 5 --- Prototype Implementation of the Synthesizer Network --- p.66Chapter 5.1 --- Implementation of the Synthesizer Network --- p.66Chapter 5.1.1 --- Network Architectures --- p.68Chapter 5.1.2 --- Spectral Templates for Training --- p.74Chapter 5.1.3 --- System requirement --- p.76Chapter 5.2 --- Subjective Listening Test --- p.79Chapter 5.2.1 --- Sample Selection --- p.79Chapter 5.2.2 --- Test Procedure --- p.81Chapter 5.2.3 --- Result --- p.83Chapter 5.2.4 --- Analysis --- p.86Chapter 5.3 --- Summary --- p.88Chapter 6 --- Simplified Articulatory Control for the Synthesizer Network --- p.89Chapter 6.1 --- Coarticulatory Effect in Speech Production --- p.90Chapter 6.1.1 --- Acoustic Effect --- p.90Chapter 6.1.2 --- Prosodic Effect --- p.91Chapter 6.2 --- Control in various Synthesis Techniques --- p.92Chapter 6.2.1 --- Copy Concatenation --- p.92Chapter 6.2.2 --- Formant Synthesis --- p.93Chapter 6.2.3 --- Articulatory synthesis --- p.93Chapter 6.3 --- Articulatory Control Model based on Vowel Quad --- p.94Chapter 6.3.1 --- Modeling of Variations with the Articulatory Control Model --- p.95Chapter 6.4 --- Voice Correspondence : --- p.97Chapter 6.4.1 --- For Nasal Sounds ´ؤ Inter-Network Correspondence --- p.98Chapter 6.4.2 --- In Flat-Tongue Space - Intra-Network Correspondence --- p.101Chapter 6.5 --- Summary --- p.108Chapter 7 --- Pause Duration Properties in Cantonese Phrases --- p.109Chapter 7.1 --- The Prosodic Feature - Inter-Syllable Pause --- p.110Chapter 7.2 --- Experiment for Measuring Inter-Syllable Pause of Cantonese Phrases --- p.111Chapter 7.2.1 --- Speech Material Selection --- p.111Chapter 7.2.2 --- Experimental Procedure --- p.112Chapter 7.2.3 --- Result --- p.114Chapter 7.3 --- Characteristics of Inter-Syllable Pause in Cantonese Phrases --- p.117Chapter 7.3.1 --- Pause Duration Characteristics for Initials after Pause --- p.117Chapter 7.3.2 --- Pause Duration Characteristic for Finals before Pause --- p.119Chapter 7.3.3 --- General Observations --- p.119Chapter 7.3.4 --- Other Observations --- p.121Chapter 7.4 --- Application of Pause-duration Statistics to the Synthesis System --- p.124Chapter 7.5 --- Summary --- p.126Chapter 8 --- Conclusion and Further Work --- p.127Chapter 8.1 --- Conclusion --- p.127Chapter 8.2 --- Further Extension Work --- p.130Chapter 8.2.1 --- Regularization Network Optimized on ISD --- p.130Chapter 8.2.2 --- Incorporation of Non-Articulatory Parameters to Control Space --- p.130Chapter 8.2.3 --- Experiment on Other Prosodic Features --- p.131Chapter 8.2.4 --- Application of Voice Correspondence to Cantonese Coda Discrim- ination --- p.131Chapter A --- Cantonese Initials and Finals --- p.132Chapter A.1 --- Tables of All Cantonese Initials and Finals --- p.132Chapter B --- Using Distortion Measure as Error Function in Neural Network --- p.135Chapter B.1 --- Formulation of Itakura-Saito Distortion Measure for Neural Network Error Function --- p.135Chapter B.2 --- Formulation of a Modified Itakura-Saito Distortion (MISD) Measure for Neural Network Error Function --- p.137Chapter C --- Orthogonal Least Square Algorithm for RBFNet Training --- p.138Chapter C.l --- Orthogonal Least Squares Learning Algorithm for Radial Basis Function Network Training --- p.138Chapter D --- Phrase Lists --- p.140Chapter D.1 --- Two-Syllable Phrase List for the Pause Duration Experiment --- p.140Chapter D.1.1 --- 兩字詞 --- p.140Chapter D.2 --- Three/Four-Syllable Phrase List for the Pause Duration Experiment --- p.144Chapter D.2.1 --- 片語 --- p.14
    corecore