    Voice Conversion by Prosody and Vocal Tract Modification

    In this paper we proposed some exible methods, which are useful in the process of voice conversion. The pro-posed methods modify the shape of the vocal tract system and the characteristics of the prosody according to the de-sired requirement. The shape of the vocal tract system is modied by shifting the major resonant frequencies (for-mants) of the short term spectrum, and altering their band-widths accordingly. In the case of prosody modication, the required durational and intonational characteristics are im-posed on the given speech signal. In the proposed method, the prosodic characteristics are manipulated using instants of signicant excitation. The instants of signicant excita-tion correspond to the instants of glottal closure (epochs) in the case of voiced speech, and to some random excita-tions like onset of burst in the case of nonvoiced speech. Instants of signicant excitation are computed from the Lin-ear Prediction (LP) residual of the speech signals by using the property of average group delay of minimum phase sig-nals. The manipulations of durational characteristics and pitch contour (intonation pattern) are achieved by manipu-lating the LP residual with the help of the knowledge of the instants of signicant excitation. The modied LP residual is used to excite the time varying lter. The lter parameters are updated according to the desired vocal tract characteris-tics. The proposed methods are evaluated using listening tests. 1

    Aspiration noise during phonation : synthesis, analysis, and pitch-scale modification

    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 139-145).The current study investigates the synthesis and analysis of aspiration noise in synthesized and spoken vowels. Based on the linear source-filter model of speech production, we implement a vowel synthesizer in which the aspiration noise source is temporally modulated by the periodic source waveform. Modulations in the noise source waveform and their synchrony with the periodic source are shown to be salient for natural-sounding vowel synthesis. After developing the synthesis framework, we research past approaches to separate the two additive components of the model. A challenge for analysis based on this model is the accurate estimation of the aspiration noise component that contains energy across the frequency spectrum and temporal characteristics due to modulations in the noise source. Spectral harmonic/noise component analysis of spoken vowels shows evidence of noise modulations with peaks in the estimated noise source component synchronous with both the open phase of the periodic source and with time instants of glottal closure. Inspired by this observation of natural modulations in the aspiration noise source, we develop an alternate approach to the speech signal processing aim of accurate pitch-scale modification. The proposed strategy takes a dual processing approach, in which the periodic and noise components of the speech signal are separately analyzed, modified, and re-synthesized. The periodic component is modified using our implementation of time-domain pitch-synchronous overlap-add, and the noise component is handled by modifying characteristics of its source waveform.(cont.) Since we have modeled an inherent coupling between the original periodic and aspiration noise sources, the modification algorithm is designed to preserve the synchrony between temporal modulations of the two sources. The reconstructed modified signal is perceived to be natural-sounding and generally reduces artifacts that are typically heard in current modification techniques.by Daryush Mehta.S.M

    Voice source characterization for prosodic and spectral manipulation

    The objective of this dissertation is to study and develop techniques to decompose the speech signal into its two main components: voice source and vocal tract. Our main efforts are on the glottal pulse analysis and characterization. We want to explore the utility of this model in different areas of speech processing: speech synthesis, voice conversion or emotion detection among others. Thus, we will study different techniques for prosodic and spectral manipulation. One of our requirements is that the methods should be robust enough to work with the large databases typical of speech synthesis. We use a speech production model in which the glottal flow produced by the vibrating vocal folds goes through the vocal (and nasal) tract cavities and its radiated by the lips. Removing the effect of the vocal tract from the speech signal to obtain the glottal pulse is known as inverse filtering. We use a parametric model fo the glottal pulse directly in the source-filter decomposition phase. In order to validate the accuracy of the parametrization algorithm, we designed a synthetic corpus using LF glottal parameters reported in the literature, complemented with our own results from the vowel database. The results show that our method gives satisfactory results in a wide range of glottal configurations and at different levels of SNR. Our method using the whitened residual compared favorably to this reference, achieving high quality ratings (Good-Excellent). Our full parametrized system scored lower than the other two ranking in third place, but still higher than the acceptance threshold (Fair-Good). Next we proposed two methods for prosody modification, one for each of the residual representations explained above. The first method used our full parametrization system and frame interpolation to perform the desired changes in pitch and duration. The second method used resampling on the residual waveform and a frame selection technique to generate a new sequence of frames to be synthesized. The results showed that both methods are rated similarly (Fair-Good) and that more work is needed in order to achieve quality levels similar to the reference methods. As part of this dissertation, we have studied the application of our models in three different areas: voice conversion, voice quality analysis and emotion recognition. We have included our speech production model in a reference voice conversion system, to evaluate the impact of our parametrization in this task. The results showed that the evaluators preferred our method over the original one, rating it with a higher score in the MOS scale. To study the voice quality, we recorded a small database consisting of isolated, sustained Spanish vowels in four different phonations (modal, rough, creaky and falsetto) and were later also used in our study of voice quality. Comparing the results with those reported in the literature, we found them to generally agree with previous findings. Some differences existed, but they could be attributed to the difficulties in comparing voice qualities produced by different speakers. At the same time we conducted experiments in the field of voice quality identification, with very good results. We have also evaluated the performance of an automatic emotion classifier based on GMM using glottal measures. For each emotion, we have trained an specific model using different features, comparing our parametrization to a baseline system using spectral and prosodic characteristics. The results of the test were very satisfactory, showing a relative error reduction of more than 20% with respect to the baseline system. The accuracy of the different emotions detection was also high, improving the results of previously reported works using the same database. Overall, we can conclude that the glottal source parameters extracted using our algorithm have a positive impact in the field of automatic emotion classification

    Psychophysical and signal-processing aspects of speech representation

    On the automatic segmentation of transcribed words

    Numerical Modeling of Vocal Control and Patient-specific Surgical Planning of Type 1 Thyroplasty

    This study aims to develop knowledge about the roles of intrinsic laryngeal muscles on voice control in both healthy and disordered conditions through comprehensive computational models. The phonation simulator was built by combining a three-dimensional high-fidelity MRI-based model of the larynx, active muscle mechanics, and fluid-structure-acoustic interaction model, which enabled the exploration of the underlayer mechanisms of the link between individual and/or group muscles contractions under both symmetric and asymmetric activations, vocal fold posture, vocal fold vibration, and voice outcomes during voice production. The first part of this research extensively investigated the effects of cricothyroid and thyroarytenoid muscle activations on voice characteristics through a parametric study. The role of these intrinsic muscles in the adjustment of geometrical and mechanical properties of vocal fold pre-phonatory posture, glottic flow aerodynamics, and acoustic and how all these components interact were explored. Results were comprehensively validated, and the link between elements of phonation was described in detail. In the next step, due to the model\u27s ability in the individual muscle activations, unilateral vocal fold paralysis was simulated, and the characteristics of disordered voice were analyzed. The voice simulator was then combined with the implant insertion model and genetic algorithm method to build a computational framework for patient-specific surgical planning of type 1 thyroplasty. This surgery is a standard procedure for treating unilateral vocal fold paralysis; however, it is subject to challenges mainly due to the small size of the implant and the high sensitivity of the voice outcome to the implant shape and position. Therefore, although the patient\u27s voice could be improved, the results might not be as satisfying as expected. Despite actual surgery, with very little room for try and error, the ideal implant could be achieved by optimizing the implant based on the patient\u27s desired voice using the presented computational framework. Both healthy and diseased cases and the corrected case using the optimized implant were simulated. Results revealed that the optimized implant could restore the aerodynamic and acoustic features of the disordered voice in producing a sustained vowel utterance. Furthermore, the performance of the implant in the pitch gliding test, which was simulated using temporal activation of the cricothyroid and thyroarytenoid muscles based on the first part of the study, was evaluated. In the final step, a physics-informed neural network-based algorithm was presented to reconstruct the three-dimensional cyclic vibration of vocal fold using two-dimensional sparse experimental data and laws of physics. Key acoustic parameters and vibratory dynamics of vocal folds and other parameters, such as flow rate, pressure distribution, and contact force, which are difficult to measure experimentally, were successfully predicted

    Singing voice analysis/synthesis

    Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2003.Includes bibliographical references (p. 109-115).The singing voice is the oldest and most variable of musical instruments. By combining music, lyrics, and expression, the voice is able to affect us in ways that no other instrument can. As listeners, we are innately drawn to the sound of the human voice, and when present it is almost always the focal point of a musical piece. But the acoustic flexibility of the voice in intimating words, shaping phrases, and conveying emotion also makes it the most difficult instrument to model computationally. Moreover, while all voices are capable of producing the common sounds necessary for language understanding and communication, each voice possesses distinctive features independent of phonemes and words. These unique acoustic qualities are the result of a combination of innate physical factors and expressive characteristics of performance, reflecting an individual's vocal identity. A great deal of prior research has focused on speech recognition and speaker identification, but relatively little work has been performed specifically on singing. There are significant differences between speech and singing in terms of both production and perception. Traditional computational models of speech have focused on the intelligibility of language, often sacrificing sound quality for model simplicity. Such models, however, are detrimental to the goal of singing, which relies on acoustic authenticity for the non-linguistic communication of expression and emotion. These differences between speech and singing dictate that a different and specialized representation is needed to capture the sound quality and musicality most valued in singing.(cont.) This dissertation proposes an analysis/synthesis framework specifically for the singing voice that models the time-varying physical and expressive characteristics unique to an individual voice. The system operates by jointly estimating source-filter voice model parameters, representing vocal physiology, and modeling the dynamic behavior of these features over time to represent aspects of expression. This framework is demonstrated to be useful for several applications, such as singing voice coding, automatic singer identification, and voice transformation.by Youngmoo Edmund Kim.Ph.D
