482 research outputs found

    Continuous Interaction with a Virtual Human

    Get PDF
    Attentive Speaking and Active Listening require that a Virtual Human be capable of simultaneous perception/interpretation and production of communicative behavior. A Virtual Human should be able to signal its attitude and attention while it is listening to its interaction partner, and be able to attend to its interaction partner while it is speaking – and modify its communicative behavior on-the-fly based on what it perceives from its partner. This report presents the results of a four week summer project that was part of eNTERFACE’10. The project resulted in progress on several aspects of continuous interaction such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and models for appropriate reactions to listener responses. A pilot user study was conducted with ten participants. In addition, the project yielded a number of deliverables that are released for public access

    Cross-Lingual Voice Conversion with Non-Parallel Data

    Get PDF
    In this project a Phonetic Posteriorgram (PPG) based Voice Conversion system is implemented. The main goal is to perform and evaluate conversions of singing voice. The cross-gender and cross-lingual scenarios are considered. Additionally, the use of spectral envelope based MFCC and pseudo-singing dataset for ASR training are proposed in order to improve the performance of the system in the singing context

    Overcoming the limitations of statistical parametric speech synthesis

    Get PDF
    At the time of beginning this thesis, statistical parametric speech synthesis (SPSS) using hidden Markov models (HMMs) was the dominant synthesis paradigm within the research community. SPSS systems are effective at generalising across the linguistic contexts present in training data to account for inevitable unseen linguistic contexts at synthesis-time, making these systems flexible and their performance stable. However HMM synthesis suffers from a ‘ceiling effect’ in the naturalness achieved, meaning that, despite great progress, the speech output is rarely confused for natural speech. There are many hypotheses for the causes of reduced synthesis quality, and subsequent required improvements, for HMM speech synthesis in literature. However, until this thesis, these hypothesised causes were rarely tested. This thesis makes two types of contributions to the field of speech synthesis; each of these appears in a separate part of the thesis. Part I introduces a methodology for testing hypothesised causes of limited quality within HMM speech synthesis systems. This investigation aims to identify what causes these systems to fall short of natural speech. Part II uses the findings from Part I of the thesis to make informed improvements to speech synthesis. The usual approach taken to improve synthesis systems is to attribute reduced synthesis quality to a hypothesised cause. A new system is then constructed with the aim of removing that hypothesised cause. However this is typically done without prior testing to verify the hypothesised cause of reduced quality. As such, even if improvements in synthesis quality are observed, there is no knowledge of whether a real underlying issue has been fixed or if a more minor issue has been fixed. In contrast, I perform a wide range of perceptual tests in Part I of the thesis to discover what the real underlying causes of reduced quality in HMM synthesis are and the level to which they contribute. Using the knowledge gained in Part I of the thesis, Part II then looks to make improvements to synthesis quality. Two well-motivated improvements to standard HMM synthesis are investigated. The first of these improvements follows on from averaging across differing linguistic contexts being identified as a major contributing factor to reduced synthesis quality. This is a practice typically performed during decision tree regression in HMM synthesis. Therefore a system which removes averaging across differing linguistic contexts and instead performs averaging only across matching linguistic contexts (called rich-context synthesis) is investigated. The second of the motivated improvements follows the finding that the parametrisation (i.e., vocoding) of speech, standard practice in SPSS, introduces a noticeable drop in quality before any modelling is even performed. Therefore the hybrid synthesis paradigm is investigated. These systems aim to remove the effect of vocoding by using SPSS to inform the selection of units in a unit selection system. Both of the motivated improvements applied in Part II are found to make significant gains in synthesis quality, demonstrating the benefit of performing the style of perceptual testing conducted in the thesis

    Expressive Modulation of Neutral Visual Speech

    Get PDF
    The need for animated graphical models of the human face is commonplace in the movies, video games and television industries, appearing in everything from low budget advertisements and free mobile apps, to Hollywood blockbusters costing hundreds of millions of dollars. Generative statistical models of animation attempt to address some of the drawbacks of industry standard practices such as labour intensity and creative inflexibility. This work describes one such method for transforming speech animation curves between different expressive styles. Beginning with the assumption that expressive speech animation is a mix of two components, a high-frequency speech component (the content) and a much lower-frequency expressive component (the style), we use Independent Component Analysis (ICA) to identify and manipulate these components independently of one another. Next we learn how the energy for different speaking styles is distributed in terms of the low-dimensional independent components model. Transforming the speaking style involves projecting new animation curves into the lowdimensional ICA space, redistributing the energy in the independent components, and finally reconstructing the animation curves by inverting the projection. We show that a single ICA model can be used for separating multiple expressive styles into their component parts. Subjective evaluations show that viewers can reliably identify the expressive style generated using our approach, and that they have difficulty in identifying transformed animated expressive speech from the equivalent ground-truth

    파형요소 도메인에서의 변조 스펙트럼 기반 음성합성 후처리

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 공과대학 전기·정보공학부, 2017. 8. 김남수.This thesis presents a wavelet-domain measure used in postfiltering applications. Quality of HMM-based (hidden Markov model-based) parametric speech synthesis is degraded due to the over-smoothing effect, where the trajectory of generated speech parameters is smoothed out and lacks dynamics. The conventional method uses the modulation spectrum (MS) to quantify the effect of over-smoothing by measuring the spectral tilt in the MS. In order to enhance the performance, a modified version of the MS called the scaled modulation spectrum (SMS), which essentially separates the MS in different bands, is proposed and utilized in postfiltering. The performance of two types of wavelets, the discrete wavelet transform (DWT) and the dual-tree complex wavelet transform (DTCWT), are evaluated. We also extend the SMS into a hidden Markov tree (HMT) model, which represents the interdependencies of the coefficients. Experimental results show that the proposed method performs better.1 Introduction 1 2 Modulation Spectrum-based Post filtering 5 2.1 Modulation Spectrum 5 2.2 Conventional Post filtering 5 3 Discrete Wavelet-based Post filtering 9 3.1 Discrete Wavelet Transform 9 3.2 Post filtering in the Wavelet Domain 10 4 Post filtering Using Dual-tree Complex Wavelet Transforms 13 4.1 Dual-tree Complex Wavelet Transform 13 4.2 Post filtering Using the DTCWT 14 5 Post filtering Using Hidden Markov Tree Models 17 5.1 Statistical Signal Processing Using Hidden Markov Trees 17 5.2 Modeling SMS with HMT 18 6 Experimental Results 23 6.1 Experimental Setup 23 6.2 Results 24 7 Conclusion and Future Work 33 7.1 Conclusion 33 7.2 Future Work 34 Bibliography 35Maste
    corecore