Search CORE

40 research outputs found

Concatenative speech synthesis: a Framework for Reducing Perceived Distortion when using the TD-PSOLA Algorithm

Author: Longster Jennifer Ann
Publication venue
Publication date
Field of study

This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co- occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community

Bournemouth University Research Online

Diphthong Synthesis using the Three-Dimensional Dynamic Digital Waveguide Mesh

Author: Gully Amelia J
Publication venue: University of York
Publication date: 01/09/2017
Field of study

The human voice is a complex and nuanced instrument, and despite many years of research, no system is yet capable of producing natural-sounding synthetic speech. This affects intelligibility for some groups of listeners, in applications such as automated announcements and screen readers. Furthermore, those who require a computer to speak - due to surgery or a degenerative disease - are limited to unnatural-sounding voices that lack expressive control and may not match the user's gender, age or accent. It is evident that natural, personalised and controllable synthetic speech systems are required. A three-dimensional digital waveguide model of the vocal tract, based on magnetic resonance imaging data, is proposed here in order to address these issues. The model uses a heterogeneous digital waveguide mesh method to represent the vocal tract airway and surrounding tissues, facilitating dynamic movement and hence speech output. The accuracy of the method is validated by comparison with audio recordings of natural speech, and perceptual tests are performed which confirm that the proposed model sounds significantly more natural than simpler digital waveguide mesh vocal tract models. Control of such a model is also considered, and a proof-of-concept study is presented using a deep neural network to control the parameters of a two-dimensional vocal tract model, resulting in intelligible speech output and paving the way for extension of the control system to the proposed three-dimensional vocal tract model. Future improvements to the system are also discussed in detail. This project considers both the naturalness and control issues associated with synthetic speech and therefore represents a significant step towards improved synthetic speech for use across society

White Rose E-theses Online

Low bit rate speech communication based on charge coupled device fourier transform processors

Author: Davie Malcolm Craig
Publication venue: The University of Edinburgh
Publication date: 01/01/1980
Field of study

Edinburgh Research Archive

Concatenative speech synthesis : a framework for reducing perceived distortion when using the TD-PSOLA algorithm

Author: Longster Jennifer Ann
Publication venue
Publication date: 01/01/2003
Field of study

OpenGrey Repository

On the design of visual feedback for the rehabilitation of hearing-impaired speech

Author: Carraro Fabrizio
Publication venue: The University of Edinburgh
Publication date: 01/01/1997
Field of study

Edinburgh Research Archive

Spherical near field acoustic holography with microphones on a rigid sphere:Abstract of paper

Author: Fernandez Grande Efren
Hald Jørgen
Jacobsen Finn
Moreno Guillermo
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2008
Field of study

Crossref

Online Research Database In Technology

A virtual auditory environment for investigating the auditory signal processing of realistic sounds

Author: Buchholz Jörg
Favrot Sylvain Emmanuel
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2008
Field of study

Online Research Database In Technology

The perceptual flow of phonetic feature processing

Author: Christiansen Thomas Ulrich
Greenberg Steven
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2008
Field of study

Crossref

Online Research Database In Technology

Amplitude modulation depth discrimination in hearing-impaired and normal-hearing listeners

Author: Dau Torsten
Ewert Stephan D.
Verhey Jesko
Volmer Jutta
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2008
Field of study

Crossref

Online Research Database In Technology

Speaker comfort and increase of voice level in lecture rooms

Author: Anders C. Gade
Gaspar Payà Bellester
Jonas Brunskog
Lilian Reig Calbo
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2008
Field of study

Crossref

Online Research Database In Technology