76 research outputs found

    Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion

    Get PDF
    In voice conversion, frame-level mean and variance normalization is typically used for fundamental frequency (F0) transformation, which is text-independent and requires no parallel training data. Some advanced methods transform pitch contours instead, but require either parallel training data or syllabic annotations. We propose a method which retains the simplicity and text-independence of the frame-level conversion while yielding high-quality conversion. We achieve these goals by (1) introducing a text-independent tri-frame alignment method, (2) including delta features of F0 into Gaussian mixture model (GMM) conversion and (3) reducing the well-known GMM oversmoothing effect by F0 histogram equalization. Our objective and subjective experiments on the CMU Arctic corpus indicate improvements over both the mean/variance normalization and the baseline GMM conversion

    Speech coding

    Full text link

    Singing voice analysis/synthesis

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2003.Includes bibliographical references (p. 109-115).The singing voice is the oldest and most variable of musical instruments. By combining music, lyrics, and expression, the voice is able to affect us in ways that no other instrument can. As listeners, we are innately drawn to the sound of the human voice, and when present it is almost always the focal point of a musical piece. But the acoustic flexibility of the voice in intimating words, shaping phrases, and conveying emotion also makes it the most difficult instrument to model computationally. Moreover, while all voices are capable of producing the common sounds necessary for language understanding and communication, each voice possesses distinctive features independent of phonemes and words. These unique acoustic qualities are the result of a combination of innate physical factors and expressive characteristics of performance, reflecting an individual's vocal identity. A great deal of prior research has focused on speech recognition and speaker identification, but relatively little work has been performed specifically on singing. There are significant differences between speech and singing in terms of both production and perception. Traditional computational models of speech have focused on the intelligibility of language, often sacrificing sound quality for model simplicity. Such models, however, are detrimental to the goal of singing, which relies on acoustic authenticity for the non-linguistic communication of expression and emotion. These differences between speech and singing dictate that a different and specialized representation is needed to capture the sound quality and musicality most valued in singing.(cont.) This dissertation proposes an analysis/synthesis framework specifically for the singing voice that models the time-varying physical and expressive characteristics unique to an individual voice. The system operates by jointly estimating source-filter voice model parameters, representing vocal physiology, and modeling the dynamic behavior of these features over time to represent aspects of expression. This framework is demonstrated to be useful for several applications, such as singing voice coding, automatic singer identification, and voice transformation.by Youngmoo Edmund Kim.Ph.D

    Mapping Techniques for Voice Conversion

    Get PDF
    Speaker identity plays an important role in human communication. In addition to the linguistic content, speech utterances contain acoustic information of the speaker characteristics. This thesis focuses on voice conversion, a technique that aims at changing the voice of one speaker (a source speaker) into the voice of another specific speaker (a target speaker) without changing the linguistic information. The relationship between the source and target speaker characteristics is learned from the training data. Voice conversion can be used in various applications and fields: text-to-speech systems, dubbing, speech-to-speech translation, games, voice restoration, voice pathology, etc. Voice conversion offers many challenges: which features to extract from speech, how to find linguistic correspondences (alignment) between source and target features, which machine learning techniques to use for creating a mapping function between the features of the speakers, and finally, how to make the desired modifications to the speech waveform. The features can be any parameters that describe the speech and the speaker identity, e.g. spectral envelope, excitation, fundamental frequency, and phone durations. The main focus of the thesis is on the design of suitable mapping techniques between frame-level source and target features, but also aspects related to parallel data alignment and prosody conversion are addressed. The perception of the quality and the success of the identity conversion are largely subjective. Conventional statistical techniques are able to produce good similarity between the original and the converted target voices but the quality is usually degraded. The objective of this thesis is to design conversion techniques that enable successful identity conversion while maintaining the original speech quality. Due to the limited amount of data, statistical techniques are usually utilized in extracting the mapping function. The most popular technique is based on a Gaussian mixture model (GMM). However, conventional GMM-based conversion suffers from many problems that result in degraded speech quality. The problems are analyzed in this thesis, and a technique that combines GMM-based conversion with partial least squares regression is introduced to alleviate these problems. Additionally, approaches to solve the time-independent mapping problem associated with many algorithms are proposed. The most significant contribution of the thesis is the proposed novel dynamic kernel partial least squares regression technique that allows creating a non-linear mapping function and improves temporal correlation. The technique is straightforward, efficient and requires very little tuning. It is shown to outperform the state-of-the-art GMM-based technique using both subjective and objective tests over a variety of speaker pairs. In addition, quality is further improved when aperiodicity and binary voicing values are predicted using the same technique. The vast majority of the existing voice conversion algorithms concern the transformation of the spectral envelopes. However, prosodic features, such as fundamental frequency movements and speaking rhythm, also contain important cues of identity. It is shown in the thesis that pure prosody alone can be used, to some extent, to recognize speakers that are familiar to the listeners. Furthermore, a prosody conversion technique is proposed that transforms fundamental frequency contours and durations at syllable level. The technique is shown to improve similarity to the target speaker’s prosody and reduce roboticness compared to a conventional frame-based conversion technique. Recently, the trend has shifted from text-dependent to text-independent use cases meaning that there is no parallel data available. The techniques proposed in the thesis currently assume parallel data, i.e. that the same texts have been spoken by both speakers. However, excluding the prosody conversion algorithm, the proposed techniques require no phonetic information and are applicable for a small amount of training data. Moreover, many text-independent approaches are based on extracting a sort of alignment as a pre-processing step. Thus the techniques proposed in the thesis can be exploited after the alignment process

    Voice Conversion

    Get PDF

    Audio Processing and Loudness Estimation Algorithms with iOS Simulations

    Get PDF
    abstract: The processing power and storage capacity of portable devices have improved considerably over the past decade. This has motivated the implementation of sophisticated audio and other signal processing algorithms on such mobile devices. Of particular interest in this thesis is audio/speech processing based on perceptual criteria. Specifically, estimation of parameters from human auditory models, such as auditory patterns and loudness, involves computationally intensive operations which can strain device resources. Hence, strategies for implementing computationally efficient human auditory models for loudness estimation have been studied in this thesis. Existing algorithms for reducing computations in auditory pattern and loudness estimation have been examined and improved algorithms have been proposed to overcome limitations of these methods. In addition, real-time applications such as perceptual loudness estimation and loudness equalization using auditory models have also been implemented. A software implementation of loudness estimation on iOS devices is also reported in this thesis. In addition to the loudness estimation algorithms and software, in this thesis project we also created new illustrations of speech and audio processing concepts for research and education. As a result, a new suite of speech/audio DSP functions was developed and integrated as part of the award-winning educational iOS App 'iJDSP." These functions are described in detail in this thesis. Several enhancements in the architecture of the application have also been introduced for providing the supporting framework for speech/audio processing. Frame-by-frame processing and visualization functionalities have been developed to facilitate speech/audio processing. In addition, facilities for easy sound recording, processing and audio rendering have also been developed to provide students, practitioners and researchers with an enriched DSP simulation tool. Simulations and assessments have been also developed for use in classes and training of practitioners and students.Dissertation/ThesisM.S. Electrical Engineering 201

    Voice source characterization for prosodic and spectral manipulation

    Get PDF
    The objective of this dissertation is to study and develop techniques to decompose the speech signal into its two main components: voice source and vocal tract. Our main efforts are on the glottal pulse analysis and characterization. We want to explore the utility of this model in different areas of speech processing: speech synthesis, voice conversion or emotion detection among others. Thus, we will study different techniques for prosodic and spectral manipulation. One of our requirements is that the methods should be robust enough to work with the large databases typical of speech synthesis. We use a speech production model in which the glottal flow produced by the vibrating vocal folds goes through the vocal (and nasal) tract cavities and its radiated by the lips. Removing the effect of the vocal tract from the speech signal to obtain the glottal pulse is known as inverse filtering. We use a parametric model fo the glottal pulse directly in the source-filter decomposition phase. In order to validate the accuracy of the parametrization algorithm, we designed a synthetic corpus using LF glottal parameters reported in the literature, complemented with our own results from the vowel database. The results show that our method gives satisfactory results in a wide range of glottal configurations and at different levels of SNR. Our method using the whitened residual compared favorably to this reference, achieving high quality ratings (Good-Excellent). Our full parametrized system scored lower than the other two ranking in third place, but still higher than the acceptance threshold (Fair-Good). Next we proposed two methods for prosody modification, one for each of the residual representations explained above. The first method used our full parametrization system and frame interpolation to perform the desired changes in pitch and duration. The second method used resampling on the residual waveform and a frame selection technique to generate a new sequence of frames to be synthesized. The results showed that both methods are rated similarly (Fair-Good) and that more work is needed in order to achieve quality levels similar to the reference methods. As part of this dissertation, we have studied the application of our models in three different areas: voice conversion, voice quality analysis and emotion recognition. We have included our speech production model in a reference voice conversion system, to evaluate the impact of our parametrization in this task. The results showed that the evaluators preferred our method over the original one, rating it with a higher score in the MOS scale. To study the voice quality, we recorded a small database consisting of isolated, sustained Spanish vowels in four different phonations (modal, rough, creaky and falsetto) and were later also used in our study of voice quality. Comparing the results with those reported in the literature, we found them to generally agree with previous findings. Some differences existed, but they could be attributed to the difficulties in comparing voice qualities produced by different speakers. At the same time we conducted experiments in the field of voice quality identification, with very good results. We have also evaluated the performance of an automatic emotion classifier based on GMM using glottal measures. For each emotion, we have trained an specific model using different features, comparing our parametrization to a baseline system using spectral and prosodic characteristics. The results of the test were very satisfactory, showing a relative error reduction of more than 20% with respect to the baseline system. The accuracy of the different emotions detection was also high, improving the results of previously reported works using the same database. Overall, we can conclude that the glottal source parameters extracted using our algorithm have a positive impact in the field of automatic emotion classification

    Proceedings of the Second International Mobile Satellite Conference (IMSC 1990)

    Get PDF
    Presented here are the proceedings of the Second International Mobile Satellite Conference (IMSC), held June 17-20, 1990 in Ottawa, Canada. Topics covered include future mobile satellite communications concepts, aeronautical applications, modulation and coding, propagation and experimental systems, mobile terminal equipment, network architecture and control, regulatory and policy considerations, vehicle antennas, and speech compression
    • …
    corecore