322 research outputs found

    Generation of prosody and speech for Mandarin Chinese

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Unit selection and waveform concatenation strategies in Cantonese text-to-speech.

    Get PDF
    Oey Sai Lok.Thesis (M.Phil.)--Chinese University of Hong Kong, 2005.Includes bibliographical references.Abstracts in English and Chinese.Chapter 1. --- Introduction --- p.1Chapter 1.1 --- An overview of Text-to-Speech technology --- p.2Chapter 1.1.1 --- Text processing --- p.2Chapter 1.1.2 --- Acoustic synthesis --- p.3Chapter 1.1.3 --- Prosody modification --- p.4Chapter 1.2 --- Trends in Text-to-Speech technologies --- p.5Chapter 1.3 --- Objectives of this thesis --- p.7Chapter 1.4 --- Outline of the thesis --- p.9References --- p.11Chapter 2. --- Cantonese Speech --- p.13Chapter 2.1 --- The Cantonese dialect --- p.13Chapter 2.2 --- Phonology of Cantonese --- p.14Chapter 2.2.1 --- Initials --- p.15Chapter 2.2.2 --- Finals --- p.16Chapter 2.2.3 --- Tones --- p.18Chapter 2.3 --- Acoustic-phonetic properties of Cantonese syllables --- p.19References --- p.24Chapter 3. --- Cantonese Text-to-Speech --- p.25Chapter 3.1 --- General overview --- p.25Chapter 3.1.1 --- Text processing --- p.25Chapter 3.1.2 --- Corpus based acoustic synthesis --- p.26Chapter 3.1.3 --- Prosodic control --- p.27Chapter 3.2 --- Syllable based Cantonese Text-to-Speech system --- p.28Chapter 3.3 --- Sub-syllable based Cantonese Text-to-Speech system --- p.29Chapter 3.3.1 --- Definition of sub-syllable units --- p.29Chapter 3.3.2 --- Acoustic inventory --- p.31Chapter 3.3.3 --- Determination of the concatenation points --- p.33Chapter 3.4 --- Problems --- p.34References --- p.36Chapter 4. --- Waveform Concatenation for Sub-syllable Units --- p.37Chapter 4.1 --- Previous work in concatenation methods --- p.37Chapter 4.1.1 --- Determination of concatenation point --- p.38Chapter 4.1.2 --- Waveform concatenation --- p.38Chapter 4.2 --- Problems and difficulties in concatenating sub-syllable units --- p.39Chapter 4.2.1 --- Mismatch of acoustic properties --- p.40Chapter 4.2.2 --- "Allophone problem of Initials /z/, Id and /s/" --- p.42Chapter 4.3 --- General procedures in concatenation strategies --- p.44Chapter 4.3.1 --- Concatenation of unvoiced segments --- p.45Chapter 4.3.2 --- Concatenation of voiced segments --- p.45Chapter 4.3.3 --- Measurement of spectral distance --- p.48Chapter 4.4 --- Detailed procedures in concatenation points determination --- p.50Chapter 4.4.1 --- Unvoiced segments --- p.50Chapter 4.4.2 --- Voiced segments --- p.53Chapter 4.5 --- Selected examples in concatenation strategies --- p.58Chapter 4.5.1 --- Concatenation at Initial segments --- p.58Chapter 4.5.1.1 --- Plosives --- p.58Chapter 4.5.1.2 --- Fricatives --- p.59Chapter 4.5.2 --- Concatenation at Final segments --- p.60Chapter 4.5.2.1 --- V group (long vowel) --- p.60Chapter 4.5.2.2 --- D group (diphthong) --- p.61References --- p.63Chapter 5. --- Unit Selection for Sub-syllable Units --- p.65Chapter 5.1 --- Basic requirements in unit selection process --- p.65Chapter 5.1.1 --- Availability of multiple copies of sub-syllable units --- p.65Chapter 5.1.1.1 --- "Levels of ""identical""" --- p.66Chapter 5.1.1.2 --- Statistics on the availability --- p.67Chapter 5.1.2 --- Variations in acoustic parameters --- p.70Chapter 5.1.2.1 --- Pitch level --- p.71Chapter 5.1.2.2 --- Duration --- p.74Chapter 5.1.2.3 --- Intensity level --- p.75Chapter 5.2 --- Selection process: availability check on sub-syllable units --- p.77Chapter 5.2.1 --- Multiple copies found --- p.79Chapter 5.2.2 --- Unique copy found --- p.79Chapter 5.2.3 --- No matched copy found --- p.80Chapter 5.2.4 --- Illustrative examples --- p.80Chapter 5.3 --- Selection process: acoustic analysis on candidate units --- p.81References --- p.88Chapter 6. --- Performance Evaluation --- p.89Chapter 6.1 --- General information --- p.90Chapter 6.1.1 --- Objective test --- p.90Chapter 6.1.2 --- Subjective test --- p.90Chapter 6.1.3 --- Test materials --- p.91Chapter 6.2 --- Details of the objective test --- p.92Chapter 6.2.1 --- Testing method --- p.92Chapter 6.2.2 --- Results --- p.93Chapter 6.2.3 --- Analysis --- p.96Chapter 6.3 --- Details of the subjective test --- p.98Chapter 6.3.1 --- Testing method --- p.98Chapter 6.3.2 --- Results --- p.99Chapter 6.3.3 --- Analysis --- p.101Chapter 6.4 --- Summary --- p.107References --- p.108Chapter 7. --- Conclusions and Future Works --- p.109Chapter 7.1 --- Conclusions --- p.109Chapter 7.2 --- Suggested future works --- p.111References --- p.113Appendix 1 Mean pitch level of Initials and Finals stored in the inventory --- p.114Appendix 2 Mean durations of Initials and Finals stored in the inventory --- p.121Appendix 3 Mean intensity level of Initials and Finals stored in the inventory --- p.124Appendix 4 Test word used in performance evaluation --- p.127Appendix 5 Test paragraph used in performance evaluation --- p.128Appendix 6 Pitch profile used in the Text-to-Speech system --- p.131Appendix 7 Duration model used in Text-to-Speech system --- p.13

    An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era

    Get PDF
    Speech is the fundamental mode of human communication, and its synthesis has long been a core priority in human-computer interaction research. In recent years, machines have managed to master the art of generating speech that is understandable by humans. But the linguistic content of an utterance encompasses only a part of its meaning. Affect, or expressivity, has the capacity to turn speech into a medium capable of conveying intimate thoughts, feelings, and emotions -- aspects that are essential for engaging and naturalistic interpersonal communication. While the goal of imparting expressivity to synthesised utterances has so far remained elusive, following recent advances in text-to-speech synthesis, a paradigm shift is well under way in the fields of affective speech synthesis and conversion as well. Deep learning, as the technology which underlies most of the recent advances in artificial intelligence, is spearheading these efforts. In the present overview, we outline ongoing trends and summarise state-of-the-art approaches in an attempt to provide a comprehensive overview of this exciting field.Comment: Submitted to the Proceedings of IEE

    ์กฐ๊ฑด๋ถ€ ์ž๊ธฐํšŒ๊ท€ํ˜• ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ ์ œ์–ด ๊ฐ€๋Šฅํ•œ ๊ฐ€์ฐฝ ์Œ์„ฑ ํ•ฉ์„ฑ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์ง€๋Šฅ์ •๋ณด์œตํ•ฉํ•™๊ณผ, 2022. 8. ์ด๊ต๊ตฌ.Singing voice synthesis aims at synthesizing a natural singing voice from given input information. A successful singing synthesis system is important not only because it can significantly reduce the cost of the music production process, but also because it helps to more easily and conveniently reflect the creator's intentions. However, there are three challenging problems in designing such a system - 1) It should be possible to independently control the various elements that make up the singing. 2) It must be possible to generate high-quality sound sources, 3) It is difficult to secure sufficient training data. To deal with this problem, we first paid attention to the source-filter theory, which is a representative speech production modeling technique. We tried to secure training data efficiency and controllability at the same time by modeling a singing voice as a convolution of the source, which is pitch information, and filter, which is the pronunciation information, and designing a structure that can model each independently. In addition, we used a conditional autoregressive model-based deep neural network to effectively model sequential data in a situation where conditional inputs such as pronunciation, pitch, and speaker are given. In order for the entire framework to generate a high-quality sound source with a distribution more similar to that of a real singing voice, the adversarial training technique was applied to the training process. Finally, we applied a self-supervised style modeling technique to model detailed unlabeled musical expressions. We confirmed that the proposed model can flexibly control various elements such as pronunciation, pitch, timbre, singing style, and musical expression, while synthesizing high-quality singing that is difficult to distinguish from ground truth singing. Furthermore, we proposed a generation and modification framework that considers the situation applied to the actual music production process, and confirmed that it is possible to apply it to expand the limits of the creator's imagination, such as new voice design and cross-generation.๊ฐ€์ฐฝ ํ•ฉ์„ฑ์€ ์ฃผ์–ด์ง„ ์ž…๋ ฅ ์•…๋ณด๋กœ๋ถ€ํ„ฐ ์ž์—ฐ์Šค๋Ÿฌ์šด ๊ฐ€์ฐฝ ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•ด๋‚ด๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๊ฐ€์ฐฝ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์€ ์Œ์•… ์ œ์ž‘ ๋น„์šฉ์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ฐฝ์ž‘์ž์˜ ์˜๋„๋ฅผ ๋ณด๋‹ค ์‰ฝ๊ณ  ํŽธ๋ฆฌํ•˜๊ฒŒ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š”๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ์‹œ์Šคํ…œ์˜ ์„ค๊ณ„๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋‹ค์Œ ์„ธ ๊ฐ€์ง€์˜ ๋„์ „์ ์ธ ์š”๊ตฌ์‚ฌํ•ญ์ด ์กด์žฌํ•œ๋‹ค. 1) ๊ฐ€์ฐฝ์„ ์ด๋ฃจ๋Š” ๋‹ค์–‘ํ•œ ์š”์†Œ๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. 2) ๋†’์€ ํ’ˆ์งˆ ์ˆ˜์ค€ ๋ฐ ์‚ฌ์šฉ์„ฑ์„ ๋‹ฌ์„ฑํ•ด์•ผ ํ•œ๋‹ค. 3) ์ถฉ๋ถ„ํ•œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•๋ณดํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ๋Œ€ํ‘œ์ ์ธ ์Œ์„ฑ ์ƒ์„ฑ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์ธ ์†Œ์Šค-ํ•„ํ„ฐ ์ด๋ก ์— ์ฃผ๋ชฉํ•˜์˜€๋‹ค. ๊ฐ€์ฐฝ ์‹ ํ˜ธ๋ฅผ ์Œ์ • ์ •๋ณด์— ํ•ด๋‹นํ•˜๋Š” ์†Œ์Šค์™€ ๋ฐœ์Œ ์ •๋ณด์— ํ•ด๋‹นํ•˜๋Š” ํ•„ํ„ฐ์˜ ํ•ฉ์„ฑ๊ณฑ์œผ๋กœ ์ •์˜ํ•˜๊ณ , ์ด๋ฅผ ๊ฐ๊ฐ ๋…๋ฆฝ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜์—ฌ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ๊ณผ ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ์„ ๋™์‹œ์— ํ™•๋ณดํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ๋˜ํ•œ ์šฐ๋ฆฌ๋Š” ๋ฐœ์Œ, ์Œ์ •, ํ™”์ž ๋“ฑ ์กฐ๊ฑด๋ถ€ ์ž…๋ ฅ์ด ์ฃผ์–ด์ง„ ์ƒํ™ฉ์—์„œ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์กฐ๊ฑด๋ถ€ ์ž๊ธฐํšŒ๊ท€ ๋ชจ๋ธ ๊ธฐ๋ฐ˜์˜ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ํ™œ์šฉํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๋ ˆ์ด๋ธ”๋ง ๋˜์–ด์žˆ์ง€ ์•Š์€ ์Œ์•…์  ํ‘œํ˜„์„ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋„๋ก ์šฐ๋ฆฌ๋Š” ์ž๊ธฐ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ์Šคํƒ€์ผ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ–ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ œ์•ˆํ•œ ๋ชจ๋ธ์ด ๋ฐœ์Œ, ์Œ์ •, ์Œ์ƒ‰, ์ฐฝ๋ฒ•, ํ‘œํ˜„ ๋“ฑ ๋‹ค์–‘ํ•œ ์š”์†Œ๋ฅผ ์œ ์—ฐํ•˜๊ฒŒ ์ œ์–ดํ•˜๋ฉด์„œ๋„ ์‹ค์ œ ๊ฐ€์ฐฝ๊ณผ ๊ตฌ๋ถ„์ด ์–ด๋ ค์šด ์ˆ˜์ค€์˜ ๊ณ ํ’ˆ์งˆ ๊ฐ€์ฐฝ ํ•ฉ์„ฑ์ด ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ–ˆ๋‹ค. ๋‚˜์•„๊ฐ€ ์‹ค์ œ ์Œ์•… ์ œ์ž‘ ๊ณผ์ •์„ ๊ณ ๋ คํ•œ ์ƒ์„ฑ ๋ฐ ์ˆ˜์ • ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜์˜€๊ณ , ์ƒˆ๋กœ์šด ๋ชฉ์†Œ๋ฆฌ ๋””์ž์ธ, ๊ต์ฐจ ์ƒ์„ฑ ๋“ฑ ์ฐฝ์ž‘์ž์˜ ์ƒ์ƒ๋ ฅ๊ณผ ํ•œ๊ณ„๋ฅผ ๋„“ํž ์ˆ˜ ์žˆ๋Š” ์‘์šฉ์ด ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ–ˆ๋‹ค.1 Introduction 1 1.1 Motivation 1 1.2 Problems in singing voice synthesis 4 1.3 Task of interest 8 1.3.1 Single-singer SVS 9 1.3.2 Multi-singer SVS 10 1.3.3 Expressive SVS 11 1.4 Contribution 11 2 Background 13 2.1 Singing voice 14 2.2 Source-filter theory 18 2.3 Autoregressive model 21 2.4 Related works 22 2.4.1 Speech synthesis 25 2.4.2 Singing voice synthesis 29 3 Adversarially Trained End-to-end Korean Singing Voice Synthesis System 31 3.1 Introduction 31 3.2 Related work 33 3.3 Proposed method 35 3.3.1 Input representation 35 3.3.2 Mel-synthesis network 36 3.3.3 Super-resolution network 38 3.4 Experiments 42 3.4.1 Dataset 42 3.4.2 Training 42 3.4.3 Evaluation 43 3.4.4 Analysis on generated spectrogram 46 3.5 Discussion 49 3.5.1 Limitations of input representation 49 3.5.2 Advantages of using super-resolution network 53 3.6 Conclusion 55 4 Disentangling Timbre and Singing Style with multi-singer Singing Synthesis System 57 4.1Introduction 57 4.2 Related works 59 4.2.1 Multi-singer SVS system 60 4.3 Proposed Method 60 4.3.1 Singer identity encoder 62 4.3.2 Disentangling timbre & singing style 64 4.4 Experiment 64 4.4.1 Dataset and preprocessing 64 4.4.2 Training & inference 65 4.4.3 Analysis on generated spectrogram 65 4.4.4 Listening test 66 4.4.5 Timbre & style classification test 68 4.5 Discussion 70 4.5.1 Query audio selection strategy for singer identity encoder 70 4.5.2 Few-shot adaptation 72 4.6 Conclusion 74 5 Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder 77 5.1 Introduction 77 5.2 Related work 79 5.3 Proposed method 80 5.3.1 Local style token module 80 5.3.2 Dual-path pitch encoder 85 5.3.3 Bandwidth extension vocoder 85 5.4 Experiment 86 5.4.1 Dataset 86 5.4.2 Training 86 5.4.3 Qualitative evaluation 87 5.4.4 Dual-path reconstruction analysis 89 5.4.5 Qualitative analysis 90 5.5 Discussion 93 5.5.1 Difference between midi pitch and f0 93 5.5.2 Considerations for use in the actual music production process 94 5.6 Conclusion 95 6 Conclusion 97 6.1 Thesis summary 97 6.2 Limitations and future work 99 6.2.1 Improvements to a faster and robust system 99 6.2.2 Explainable and intuitive controllability 101 6.2.3 Extensions to common speech synthesis tools 103 6.2.4 Towards a collaborative and creative tool 104๋ฐ•

    Intonation Modelling for Speech Synthesis and Emphasis Preservation

    Get PDF
    Speech-to-speech translation is a framework which recognises speech in an input language, translates it to a target language and synthesises speech in this target language. In such a system, variations in the speech signal which are inherent to natural human speech are lost, as the information goes through the different building blocks of the translation process. The work presented in this thesis addresses aspects of speech synthesis which are lost in traditional speech-to-speech translation approaches. The main research axis of this thesis is the study of prosody for speech synthesis and emphasis preservation. A first investigation of regional accents of spoken French is carried out to understand the sensitivity of native listeners with respect to accented speech synthesis. Listening tests show that standard adaptation methods for speech synthesis are not sufficient for listeners to perceive accentedness. On the other hand, combining adaptation with original prosody allows perception of accents. Addressing the need of a more suitable prosody model, a physiologically plausible intonation model is proposed. Inspired by the command-response model, it has basic components, which can be related to muscle responses to nerve impulses. These components are assumed to be a representation of muscle control of the vocal folds. A motivation for such a model is its theoretical language independence, based on the fact that humans share the same vocal apparatus. An automatic parameter extraction method which integrates a perceptually relevant measure is proposed with the model. This approach is evaluated and compared with the standard command-response model. Two corpora including sentences with emphasised words are presented, in the context of the SIWIS project. The first is a multilingual corpus with speech from multiple speaker; the second is a high quality speech synthesis oriented corpus from a professional speaker. Two broad uses of the model are evaluated. The first shows that it is difficult to predict model parameters; however the second shows that parameters can be transferred in the context of emphasis synthesis. A relation between model parameters and linguistic features such as stress and accent is demonstrated. Similar observations are made between the parameters and emphasis. Following, we investigate the extraction of atoms in emphasised speech and their transfer in neutral speech, which turns out to elicit emphasis perception. Using clustering methods, this is extended to the emphasis of other words, using linguistic context. This approach is validated by listening tests, in the case of English
    • โ€ฆ
    corecore