24 research outputs found

    Speech synthesis based on a harmonic model

    Get PDF
    The wide range of potential commercial applications for a com puter system capable of automatically converting text to speech (TTS) has stimulated decades of research. One of the currently most successful approaches to synthesising speech, concatenative TTS synthesis, combines prerecorded speech units to build full utterances. However, th e prosody of the stored units is often not consistent with that of the target utterance and m ust be altered. Furthermore, several types of mismatch can occur at unit boundaries and must be smoothed. Thus, pitch and time-scale modification techniques as well as smoothing algorithms play a critical role in all concatenative-based systems. This thesis presents the developm ent of a concatenative TTS system based on a harm onic model and incorporating new pitch and time-scaling as well as smoothing algorithms. Experim ent has shown our system capable of both very high quality prosodic modification and synthesis. Results com pare very favourably with those of existing state-of-the-art systems

    Concatenative speech synthesis: a Framework for Reducing Perceived Distortion when using the TD-PSOLA Algorithm

    Get PDF
    This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co- occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community

    An Investigation of nonlinear speech synthesis and pitch modification techniques

    Get PDF
    Speech synthesis technology plays an important role in many aspects of man–machine interaction, particularly in telephony applications. In order to be widely accepted, the synthesised speech quality should be as human–like as possible. This thesis investigates novel techniques for the speech signal generation stage in a speech synthesiser, based on concepts from nonlinear dynamical theory. It focuses on natural–sounding synthesis for voiced speech, coupled with the ability to generate the sound at the required pitch. The one–dimensional voiced speech time–domain signals are embedded into an appropriate higher dimensional space, using Takens’ method of delays. These reconstructed state space representations have approximately the same dynamical properties as the original speech generating system and are thus effective models. A new technique for marking epoch points in voiced speech that operates in the state space domain is proposed. Using the fact that one revolution of the state space representation is equal to one pitch period, pitch synchronous points can be found using a Poincar®e map. Evidently the epoch pulses are pitch synchronous and therefore can be marked. The same state space representation is also used in a locally–linear speech synthesiser. This models the nonlinear dynamics of the speech signal by a series of local approximations, using the original signal as a template. The synthesised speech is natural–sounding because, rather than simply copying the original data, the technique makes use of the local dynamics to create a new, unique signal trajectory. Pitch modification within this synthesis structure is also investigated, with an attempt made to exploit the ˇ Silnikov–type orbit of voiced speech state space reconstructions. However, this technique is found to be incompatible with the locally–linear modelling technique, leaving the pitch modification issue unresolved. A different modelling strategy, using a radial basis function neural network to model the state space dynamics, is then considered. This produces a parametric model of the speech sound. Synthesised speech is obtained by connecting a delayed version of the network output back to the input via a global feedback loop. The network then synthesises speech in a free–running manner. Stability of the output is ensured by using regularisation theory when learning the weights. Complexity is also kept to a minimum because the network centres are fixed on a data–independent hyper–lattice, so only the linear–in–the–parameters weights need to be learnt for each vowel realisation. Pitch modification is again investigated, based around the idea of interpolating the weight vector between different realisations of the same vowel, but at differing pitch values. However modelling the inter–pitch weight vector variations is very difficult, indicating that further study of pitch modification techniques is required before a complete nonlinear synthesiser can be implemented

    Concatenative speech synthesis : a framework for reducing perceived distortion when using the TD-PSOLA algorithm

    Get PDF
    This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co- occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Real-Time Polyphonic Octave Doubling for the Guitar

    Get PDF
    This thesis studies digital signal processing solutions for enriching live guitar sound by way of mixing-in octave-doubled versions of the chords and melodies performed on the instrument in real-time. Following a review of techniques applicable for real-time polyphonic octave doubling, four candidate solutions are proposed, amongst which two novel methods: ERB-PS2 and ERB-SSM2. Performance of said candidates is compared to that of three state of the art effect pedal offerings of the market. In particular, an evaluation of the added roughness and transient alterations introduced by each solution in the output sound is conducted. The ERB-PS2 method, which consists in doubling the instantaneous phases of the sub-bands signals extracted with a constant-ERB-bandwidth non-decimated IIR filter bank, is found to provide the best overall performance amongst the candidates. This novel solution provides greatly reduced latency compared to the baseline pedals, with comparable, and in some case improved, sound quality

    Text-Independent Voice Conversion

    Get PDF
    This thesis deals with text-independent solutions for voice conversion. It first introduces the use of vocal tract length normalization (VTLN) for voice conversion. The presented variants of VTLN allow for easily changing speaker characteristics by means of a few trainable parameters. Furthermore, it is shown how VTLN can be expressed in time domain strongly reducing the computational costs while keeping a high speech quality. The second text-independent voice conversion paradigm is residual prediction. In particular, two proposed techniques, residual smoothing and the application of unit selection, result in essential improvement of both speech quality and voice similarity. In order to apply the well-studied linear transformation paradigm to text-independent voice conversion, two text-independent speech alignment techniques are introduced. One is based on automatic segmentation and mapping of artificial phonetic classes and the other is a completely data-driven approach with unit selection. The latter achieves a performance very similar to the conventional text-dependent approach in terms of speech quality and similarity. It is also successfully applied to cross-language voice conversion. The investigations of this thesis are based on several corpora of three different languages, i.e., English, Spanish, and German. Results are also presented from the multilingual voice conversion evaluation in the framework of the international speech-to-speech translation project TC-Star

    How touch and hearing influence visual processing in sensory substitution, synaesthesia and cross-modal correspondences

    Get PDF
    Sensory substitution devices (SSDs) systematically turn visual dimensions into patterns of tactile or auditory stimulation. After training, a user of these devices learns to translate these audio or tactile sensations back into a mental visual picture. Most previous SSDs translate greyscale images using intuitive cross-sensory mappings to help users learn the devices. However more recent SSDs have started to incorporate additional colour dimensions such as saturation and hue. Chapter two examines how previous SSDs have translated the complexities of colour into hearing or touch. The chapter explores if colour is useful for SSD users, how SSD and veridical colour perception differ and how optimal cross-sensory mappings might be considered. After long-term training, some blind users of SSDs report visual sensations from tactile or auditory stimulation. A related phenomena is that of synaesthesia, a condition where stimulation of one modality (i.e. touch) produces an automatic, consistent and vivid sensation in another modality (i.e. vision). Tactile-visual synaesthesia is an extremely rare variant that can shed light on how the tactile-visual system is altered when touch can elicit visual sensations. Chapter three reports a series of investigations on the tactile discrimination abilities and phenomenology of tactile-vision synaesthetes, alongside questionnaire data from synaesthetes unavailable for testing. Chapter four introduces a new SSD to test if the presentation of colour information in sensory substitution affects object and colour discrimination. Chapter five presents experiments on intuitive auditory-colour mappings across a wide variety of sounds. These findings are used to predict the reported colour hallucinations resulting from LSD use while listening to these sounds. Chapter six uses a new sensory substitution device designed to test the utility of these intuitive sound-colour links for visual processing. These findings are discussed with reference to how cross-sensory links, LSD and synaesthesia can inform optimal SSD design for visual processing

    The Role Of Lexical Contrast In The Perception Of Intonational Prominence In Japanese

    Get PDF
    In this dissertation, I examine the effects of lexical accent on the perception of intonational prominence in Japanese. I look at how an F0 accent peak is perceived relative to another flanking F0 peak in the same utterance with respect to perceived intonational prominence. Through four experiments, I show that the lexical prosodic structure plays a significant role in the perception of intonational prominence. I first show that two distinct perceptual processes are at play in the perception of relative perceived prominence in Japanese: accentual boost normalization and downstep normalization . Accentual boost normalization normalizes the accentual boost of an accented word. In this process, the extra F0 boost assigned by a lexical accent does not count as part of the F0 peak\u27s excursion that contributes to the perceived prominence of the F0 peak. I demonstrate that when an accented word and an unaccented word are perceived as having the same prominence, the accented word has a higher F0 peak value than the unaccented word does. Downstep normalization compensates for the production effect of downstep, a pitch range compression phenomenon after a lexical accent. Experiments show that for an F0 peak to be perceived as having equivalent prominence to a preceding F0 peak, the second peak is always lower in F0 when the first word is accented than when it is unaccented. This suggests the existence of a perceptual process that normalizes the effect of downstep. I then examine the nature of accentual boost normalization and downstep normalization and show that they refer to two distinct types of lexical accent property when they are applied. One is the phonetic F0 contour shape that is characteristic of accented words. The other is the phonological lexical accent information that is uniquely specified for accented words. The experimental results show that the perceptual effects of the normalization processes are seen when only the phonological lexical accent information of a word is present with its F0 contour shape being ambiguous as well as when the same word is acoustically manipulated into different F0 contour shapes
    corecore