482 research outputs found
Continuous Interaction with a Virtual Human
Attentive Speaking and Active Listening require that a Virtual Human be capable of simultaneous perception/interpretation and production of communicative behavior. A Virtual Human should be able to signal its attitude and attention while it is listening to its interaction partner, and be able to attend to its interaction partner while it is speaking – and modify its communicative behavior on-the-fly based on what it perceives from its partner. This report presents the results of a four week summer project that was part of eNTERFACE’10. The project resulted in progress on several aspects of continuous interaction such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and models for appropriate reactions to listener responses. A pilot user study was conducted with ten participants. In addition, the project yielded a number of deliverables that are released for public access
Cross-Lingual Voice Conversion with Non-Parallel Data
In this project a Phonetic Posteriorgram (PPG) based Voice Conversion system is implemented. The main goal is to perform and evaluate conversions of singing voice. The cross-gender and cross-lingual scenarios are considered. Additionally, the use of spectral envelope based MFCC and pseudo-singing dataset for ASR training are proposed in order to improve the performance of the system in the singing context
Overcoming the limitations of statistical parametric speech synthesis
At the time of beginning this thesis, statistical parametric speech synthesis (SPSS)
using hidden Markov models (HMMs) was the dominant synthesis paradigm within the
research community. SPSS systems are effective at generalising across the linguistic
contexts present in training data to account for inevitable unseen linguistic contexts at
synthesis-time, making these systems flexible and their performance stable. However
HMM synthesis suffers from a ‘ceiling effect’ in the naturalness achieved, meaning
that, despite great progress, the speech output is rarely confused for natural speech.
There are many hypotheses for the causes of reduced synthesis quality, and subsequent
required improvements, for HMM speech synthesis in literature. However, until this
thesis, these hypothesised causes were rarely tested.
This thesis makes two types of contributions to the field of speech synthesis; each
of these appears in a separate part of the thesis. Part I introduces a methodology for
testing hypothesised causes of limited quality within HMM speech synthesis systems.
This investigation aims to identify what causes these systems to fall short of natural
speech. Part II uses the findings from Part I of the thesis to make informed improvements
to speech synthesis.
The usual approach taken to improve synthesis systems is to attribute reduced synthesis
quality to a hypothesised cause. A new system is then constructed with the aim
of removing that hypothesised cause. However this is typically done without prior testing
to verify the hypothesised cause of reduced quality. As such, even if improvements
in synthesis quality are observed, there is no knowledge of whether a real underlying
issue has been fixed or if a more minor issue has been fixed. In contrast, I perform a
wide range of perceptual tests in Part I of the thesis to discover what the real underlying
causes of reduced quality in HMM synthesis are and the level to which they contribute.
Using the knowledge gained in Part I of the thesis, Part II then looks to make improvements
to synthesis quality. Two well-motivated improvements to standard HMM
synthesis are investigated. The first of these improvements follows on from averaging
across differing linguistic contexts being identified as a major contributing factor to
reduced synthesis quality. This is a practice typically performed during decision tree
regression in HMM synthesis. Therefore a system which removes averaging across
differing linguistic contexts and instead performs averaging only across matching linguistic
contexts (called rich-context synthesis) is investigated. The second of the motivated
improvements follows the finding that the parametrisation (i.e., vocoding) of
speech, standard practice in SPSS, introduces a noticeable drop in quality before any
modelling is even performed. Therefore the hybrid synthesis paradigm is investigated.
These systems aim to remove the effect of vocoding by using SPSS to inform the selection
of units in a unit selection system. Both of the motivated improvements applied
in Part II are found to make significant gains in synthesis quality, demonstrating the
benefit of performing the style of perceptual testing conducted in the thesis
Expressive Modulation of Neutral Visual Speech
The need for animated graphical models of the human face is commonplace in
the movies, video games and television industries, appearing in everything from
low budget advertisements and free mobile apps, to Hollywood blockbusters
costing hundreds of millions of dollars. Generative statistical models of
animation attempt to address some of the drawbacks of industry standard
practices such as labour intensity and creative inflexibility.
This work describes one such method for transforming speech animation curves
between different expressive styles. Beginning with the assumption that
expressive speech animation is a mix of two components, a high-frequency
speech component (the content) and a much lower-frequency expressive
component (the style), we use Independent Component Analysis (ICA) to
identify and manipulate these components independently of one another. Next
we learn how the energy for different speaking styles is distributed in terms of
the low-dimensional independent components model. Transforming the
speaking style involves projecting new animation curves into the lowdimensional
ICA space, redistributing the energy in the independent
components, and finally reconstructing the animation curves by inverting the
projection.
We show that a single ICA model can be used for separating multiple expressive
styles into their component parts. Subjective evaluations show that viewers can
reliably identify the expressive style generated using our approach, and that they
have difficulty in identifying transformed animated expressive speech from the
equivalent ground-truth
파형요소 도메인에서의 변조 스펙트럼 기반 음성합성 후처리
학위논문 (석사)-- 서울대학교 대학원 공과대학 전기·정보공학부, 2017. 8. 김남수.This thesis presents a wavelet-domain measure used in postfiltering applications. Quality of HMM-based (hidden Markov model-based) parametric speech synthesis is degraded due to the over-smoothing effect, where the trajectory of generated speech parameters is smoothed out and lacks dynamics. The conventional method uses the modulation spectrum (MS) to quantify the effect of over-smoothing by measuring the spectral tilt in the MS. In order to enhance the performance, a modified version of the MS called the scaled modulation spectrum (SMS), which essentially separates the MS in different bands, is proposed and utilized in postfiltering. The performance of two types of wavelets, the discrete wavelet transform (DWT) and the dual-tree complex wavelet transform (DTCWT), are evaluated. We also extend the SMS into a hidden Markov tree (HMT) model, which represents the interdependencies of the coefficients. Experimental results show that the proposed method performs better.1 Introduction 1
2 Modulation Spectrum-based Post filtering 5
2.1 Modulation Spectrum 5
2.2 Conventional Post filtering 5
3 Discrete Wavelet-based Post filtering 9
3.1 Discrete Wavelet Transform 9
3.2 Post filtering in the Wavelet Domain 10
4 Post filtering Using Dual-tree Complex Wavelet Transforms 13
4.1 Dual-tree Complex Wavelet Transform 13
4.2 Post filtering Using the DTCWT 14
5 Post filtering Using Hidden Markov Tree Models 17
5.1 Statistical Signal Processing Using Hidden Markov Trees 17
5.2 Modeling SMS with HMT 18
6 Experimental Results 23
6.1 Experimental Setup 23
6.2 Results 24
7 Conclusion and Future Work 33
7.1 Conclusion 33
7.2 Future Work 34
Bibliography 35Maste
- …