21 research outputs found

    An Investigation of nonlinear speech synthesis and pitch modification techniques

    Get PDF
    Speech synthesis technology plays an important role in many aspects of manā€“machine interaction, particularly in telephony applications. In order to be widely accepted, the synthesised speech quality should be as humanā€“like as possible. This thesis investigates novel techniques for the speech signal generation stage in a speech synthesiser, based on concepts from nonlinear dynamical theory. It focuses on naturalā€“sounding synthesis for voiced speech, coupled with the ability to generate the sound at the required pitch. The oneā€“dimensional voiced speech timeā€“domain signals are embedded into an appropriate higher dimensional space, using Takensā€™ method of delays. These reconstructed state space representations have approximately the same dynamical properties as the original speech generating system and are thus effective models. A new technique for marking epoch points in voiced speech that operates in the state space domain is proposed. Using the fact that one revolution of the state space representation is equal to one pitch period, pitch synchronous points can be found using a PoincarĀ“e map. Evidently the epoch pulses are pitch synchronous and therefore can be marked. The same state space representation is also used in a locallyā€“linear speech synthesiser. This models the nonlinear dynamics of the speech signal by a series of local approximations, using the original signal as a template. The synthesised speech is naturalā€“sounding because, rather than simply copying the original data, the technique makes use of the local dynamics to create a new, unique signal trajectory. Pitch modification within this synthesis structure is also investigated, with an attempt made to exploit the Ė‡ Silnikovā€“type orbit of voiced speech state space reconstructions. However, this technique is found to be incompatible with the locallyā€“linear modelling technique, leaving the pitch modification issue unresolved. A different modelling strategy, using a radial basis function neural network to model the state space dynamics, is then considered. This produces a parametric model of the speech sound. Synthesised speech is obtained by connecting a delayed version of the network output back to the input via a global feedback loop. The network then synthesises speech in a freeā€“running manner. Stability of the output is ensured by using regularisation theory when learning the weights. Complexity is also kept to a minimum because the network centres are fixed on a dataā€“independent hyperā€“lattice, so only the linearā€“inā€“theā€“parameters weights need to be learnt for each vowel realisation. Pitch modification is again investigated, based around the idea of interpolating the weight vector between different realisations of the same vowel, but at differing pitch values. However modelling the interā€“pitch weight vector variations is very difficult, indicating that further study of pitch modification techniques is required before a complete nonlinear synthesiser can be implemented

    Fractal based speech recognition and synthesis

    Get PDF
    Transmitting a linguistic message is most often the primary purpose of speech comĀ­munication and the recognition of this message by machine that would be most useful. This research consists of two major parts. The first part presents a novel and promisĀ­ing approach for estimating the degree of recognition of speech phonemes and makes use of a new set of features based fractals. The main methods of computing the fracĀ­tal dimension of speech signals are reviewed and a new speaker-independent speech recognition system developed at De Montfort University is described in detail. FiĀ­nally, a Least Square Method as well as a novel Neural Network algorithm is employed to derive the recognition performance of the speech data. The second part of this work studies the synthesis of speech words, which is based mainly on the fractal dimension to create natural sounding speech. The work shows that by careful use of the fractal dimension together with the phase of the speech signal to ensure consistent intonation contours, natural-sounding speech synthesis is achievable with word level speech. In order to extend the flexibility of this framework, we focused on the filtering and the compression of the phase to maintain and produce natural sounding speech. A ā€˜naturalness levelā€™ is achieved as a result of the fractal characteristic used in the synthesis process. Finally, a novel speech synthesis system based on fractals developed at De Montfort University is discussed. Throughout our research simulation experiments were performed on continuous speech data available from the Texas Instrument Massachusetts institute of technology ( TIMIT) database, which is designed to provide the speech research community with a standarised corpus for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition system

    Nonlinear analysis of speech from a synthesis perspective

    Get PDF
    With the emergence of nonlinear dynamical systems analysis over recent years it has become clear that conventional time domain and frequency domain approaches to speech synthesis may be far from optimal. Using state space reconstructions of the time domain speech signal it is, at least in theory, possible to investigate a number of invariant geometrical measures for the underlying system which give a more thorough understanding of the dynamics of the system and therefore the form that any model should take. This thesis introduces a number of nonlinear dynamical analysis tools which are then applied to a database of vowels to extract the underlying invariant geometrical properties. The results of this analysis are then applied, using ideas taken from nonlinear dynamics, to the problem of speech synthesis and a novel synthesis technique is described and demonstrated. The tools used for the analysis are time delay embedding, singular value decomposition, correlation dimension, local singular value analysis, Lyapunov spectra and short term prediction properties. Although there have been many papers written about these tools, and algorithms proposed, there are currently no generally accepted techniques, especially for the calculation of Lyapunov spectra in the presence of noise and data length limitations. This thesis introduces all of the above tools and looks in detail at Lyapunov exponents and two major novel modifications are proposed that are demonstrated to be more robust than conventional techniques. The novel robust techniques are applied to a large database of vowel sounds showing that the vowels tested show evidence of nonlinear, low-dimensional, non-chaotic behaviour. It is particularly the evidence of non-chaotic behaviour that is of importance from a synthesis point of view and is used in the final section of the thesis which introduces a novel synthesis technique. The synthesis technique, which is based on ideas taken from nonlinear dynamics theory is detailed and demonstrated showing that it is capable of high quality natural sounding speech

    An investigation into glottal waveform based speech coding

    Get PDF
    Coding of voiced speech by extraction of the glottal waveform has shown promise in improving the efficiency of speech coding systems. This thesis describes an investigation into the performance of such a system. The effect of reverberation on the radiation impedance at the lips is shown to be negligible under normal conditions. Also, the accuracy of the Image Method for adding artificial reverberation to anechoic speech recordings is established. A new algorithm, Pre-emphasised Maximum Likelihood Epoch Detection (PMLED), for Glottal Closure Instant detection is proposed. The algorithm is tested on natural speech and is shown to be both accurate and robust. Two techniques for giottai waveform estimation, Closed Phase Inverse Filtering (CPIF) and Iterative Adaptive Inverse Filtering (IAIF), are compared. In tandem with an LF model fitting procedure, both techniques display a high degree of accuracy However, IAIF is found to be slightly more robust. Based on these results, a Glottal Excited Linear Predictive (GELP) coding system for voiced speech is proposed and tested. Using a differential LF parameter quantisation scheme, the system achieves speech quality similar to that of U S Federal Standard 1016 CELP at a lower mean bit rate while incurring no extra delay

    Fundamental frequency modelling: an articulatory perspective with target approximation and deep learning

    Get PDF
    Current statistical parametric speech synthesis (SPSS) approaches typically aim at state/frame-level acoustic modelling, which leads to a problem of frame-by-frame independence. Besides that, whichever learning technique is used, hidden Markov model (HMM), deep neural network (DNN) or recurrent neural network (RNN), the fundamental idea is to set up a direct mapping from linguistic to acoustic features. Although progress is frequently reported, this idea is questionable in terms of biological plausibility. This thesis aims at addressing the above issues by integrating dynamic mechanisms of human speech production as a core component of F0 generation and thus developing a more human-like F0 modelling paradigm. By introducing an articulatory F0 generation model ā€“ target approximation (TA) ā€“ between text and speech that controls syllable-synchronised F0 generation, contextual F0 variations are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. With the goal of demonstrating that human speech movement can be considered as a dynamic process of target approximation and that the TA model is a valid F0 generation model to be used at the motor-to-acoustic stage, a TA-based pitch control experiment is conducted first to simulate the subtle human behaviour of online compensation for pitch-shifted auditory feedback. Then, the TA parameters are collectively controlled by linguistic features via a deep or recurrent neural network (DNN/RNN) at the linguistic-to-motor stage. We trained the systems on a Mandarin Chinese dataset consisting of both statements and questions. The TA-based systems generally outperformed the baseline systems in both objective and subjective evaluations. Furthermore, the amount of required linguistic features were reduced first to syllable level only (with DNN) and then with all positional information removed (with RNN). Fewer linguistic features as input with limited number of TA parameters as output led to less training data and lower model complexity, which in turn led to more efficient training and faster synthesis

    Models and analysis of vocal emissions for biomedical applications

    Get PDF
    This book of Proceedings collects the papers presented at the 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2003, held 10-12 December 2003, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies

    Diphthong Synthesis using the Three-Dimensional Dynamic Digital Waveguide Mesh

    Get PDF
    The human voice is a complex and nuanced instrument, and despite many years of research, no system is yet capable of producing natural-sounding synthetic speech. This affects intelligibility for some groups of listeners, in applications such as automated announcements and screen readers. Furthermore, those who require a computer to speak - due to surgery or a degenerative disease - are limited to unnatural-sounding voices that lack expressive control and may not match the user's gender, age or accent. It is evident that natural, personalised and controllable synthetic speech systems are required. A three-dimensional digital waveguide model of the vocal tract, based on magnetic resonance imaging data, is proposed here in order to address these issues. The model uses a heterogeneous digital waveguide mesh method to represent the vocal tract airway and surrounding tissues, facilitating dynamic movement and hence speech output. The accuracy of the method is validated by comparison with audio recordings of natural speech, and perceptual tests are performed which confirm that the proposed model sounds significantly more natural than simpler digital waveguide mesh vocal tract models. Control of such a model is also considered, and a proof-of-concept study is presented using a deep neural network to control the parameters of a two-dimensional vocal tract model, resulting in intelligible speech output and paving the way for extension of the control system to the proposed three-dimensional vocal tract model. Future improvements to the system are also discussed in detail. This project considers both the naturalness and control issues associated with synthetic speech and therefore represents a significant step towards improved synthetic speech for use across society

    Statistical models for natural sounds

    Get PDF
    It is important to understand the rich structure of natural sounds in order to solve important tasks, like automatic speech recognition, and to understand auditory processing in the brain. This thesis takes a step in this direction by characterising the statistics of simple natural sounds. We focus on the statistics because perception often appears to depend on them, rather than on the raw waveform. For example the perception of auditory textures, like running water, wind, fire and rain, depends on summary-statistics, like the rate of falling rain droplets, rather than on the exact details of the physical source. In order to analyse the statistics of sounds accurately it is necessary to improve a number of traditional signal processing methods, including those for amplitude demodulation, time-frequency analysis, and sub-band demodulation. These estimation tasks are ill-posed and therefore it is natural to treat them as Bayesian inference problems. The new probabilistic versions of these methods have several advantages. For example, they perform more accurately on natural signals and are more robust to noise, they can also fill-in missing sections of data, and provide error-bars. Furthermore, free-parameters can be learned from the signal. Using these new algorithms we demonstrate that the energy, sparsity, modulation depth and modulation time-scale in each sub-band of a signal are critical statistics, together with the dependencies between the sub-band modulators. In order to validate this claim, a model containing co-modulated coloured noise carriers is shown to be capable of generating a range of realistic sounding auditory textures. Finally, we explored the connection between the statistics of natural sounds and perception. We demonstrate that inference in the model for auditory textures qualitatively replicates the primitive grouping rules that listeners use to understand simple acoustic scenes. This suggests that the auditory system is optimised for the statistics of natural sounds

    A study on reusing resources of speech synthesis for closely-related languages

    Get PDF
    This thesis describes research on building a text-to-speech (TTS) framework that can accommodate the lack of linguistic information of under-resource languages by using existing resources from another language. It describes the adaptation process required when such limited resource is used. The main natural languages involved in this research are Malay and Iban language. The thesis includes a study on grapheme to phoneme mapping and the substitution of phonemes. A set of substitution matrices is presented which show the phoneme confusion in term of perception among respondents. The experiments conducted study the intelligibility as well as perception based on context of utterances. The study on the phonetic prosody is then presented and compared to the Klatt duration model. This is to find the similarities of cross language duration model if one exists. Then a comparative study of Iban native speaker with an Iban polyglot TTS using Malay resources is presented. This is to confirm that the prosody of Malay can be used to generate Iban synthesised speech. The central hypothesis of this thesis is that by using a closely-related language resource, a natural sounding speech can be produced. The aim of this research was to show that by sticking to the indigenous language characteristics, it is possible to build a polyglot synthesised speech system even with insufficient speech resources

    Statistical identification of articulatory roles in speech production.

    Get PDF
    The human speech apparatus is a rich source of information and offers many cues in the speech signal due to its biomechanical constraints and physiological interdependencies. Coarticulation, a direct consequence of these speech production factors, is one of the main problems affecting the performance of speech systems. Incorporation of production knowledge could potentially benefit speech recognisers and synthesisers. Hand coded rules and scores derived from the phonological knowledge used by production oriented models of speech are simple and incomplete representations of the complex speech production process. Statistical models built from measurements of speech articulation fail to identify the cause of constraints. There is a need for building explanatory yet descriptive models of articulation for understanding and modelling the effects of coarticulation. This thesis aims at providing compact descriptive models of realistic speech articulation by identifying and capturing the essential characteristics of human articulators using measurements from electro-magnetic articulography. The constraints on articulators during speech production are identified in the form of critical, dependent and redundant roles using entirely statistical and data-driven methods. The critical role captures the maximally constrained target driven behaviour of an articulator. The dependent role models the partial constraints due to physiological interdependencies. The redundant role reflects the unconstrained behaviour of an articulator which is maximally prone to coarticulation. Statistical target models are also obtained as the by-product of the identified roles. The algorithm for identification of articulatory roles (and estimation of respective model distributions) for each phone is presented and the results are critically evaluated. The identified data-driven constraints obtained are compared with the well known and commonly used constraints derived from the IPA (International Phonetic Alphabet). The identified critical roles were not only in agreement with the place and manner descriptions of each phone but also provided a phoneme to phone transformation by capturing language and speaker specific behaviour of articulators. The models trained from the identified constraints fitted better to the phone distributions (40% improvement) . The evaluation of the proposed search procedure with respect to an exhaustive search for identification of roles demonstrated that the proposed approach performs equally well for much less computational load. Articulation models built in the planning stage using sparse yet efficient articulatory representations using standard trajectory generation techniques showed some potential in modelling articulatory behaviour. Plenty of scope exists for further developing models of articulation from the proposed framework
    corecore