91 research outputs found

    Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech

    Get PDF
    International audienceWe have applied two state-of-the-art speech synthesis techniques (unit selection and HMM-based synthesis) to the synthesis of emotional speech. A series of carefully designed perceptual tests to evaluate speech quality, emotion identification rates and emotional strength were used for the six emotions which we recorded -, , ,, , . For the HMM-based method, we evaluated spectral and source components separately and identified which components contribute to which emotion.Our analysis shows that, although the HMM method produces significantly better neutral speech, the two methods produce emotional speech of similar quality, except for emotions having context-dependent prosodic patterns. Whilst synthetic speech produced using the unit selection method has better emotional strength scores than the HMM-based method, the HMM-based method has the ability to manipulate the emotional strength. For emotions that are characterized by both spectral and prosodic components, synthetic speech using unit selection methods was more accurately identified by listeners. For emotions mainly characterized by prosodic components, HMM-based synthetic speech was more accurately identified. This finding differs from previous results regarding listener judgements of speaker similarity for neutral speech. We conclude that unit selection methods require improvements to prosodic modeling and that HMM-based methods require improvements to spectral modeling for emotional speech. Certain emotions cannot be reproduced well by either method

    ์กฐ๊ฑด๋ถ€ ์ž๊ธฐํšŒ๊ท€ํ˜• ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ ์ œ์–ด ๊ฐ€๋Šฅํ•œ ๊ฐ€์ฐฝ ์Œ์„ฑ ํ•ฉ์„ฑ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์ง€๋Šฅ์ •๋ณด์œตํ•ฉํ•™๊ณผ, 2022. 8. ์ด๊ต๊ตฌ.Singing voice synthesis aims at synthesizing a natural singing voice from given input information. A successful singing synthesis system is important not only because it can significantly reduce the cost of the music production process, but also because it helps to more easily and conveniently reflect the creator's intentions. However, there are three challenging problems in designing such a system - 1) It should be possible to independently control the various elements that make up the singing. 2) It must be possible to generate high-quality sound sources, 3) It is difficult to secure sufficient training data. To deal with this problem, we first paid attention to the source-filter theory, which is a representative speech production modeling technique. We tried to secure training data efficiency and controllability at the same time by modeling a singing voice as a convolution of the source, which is pitch information, and filter, which is the pronunciation information, and designing a structure that can model each independently. In addition, we used a conditional autoregressive model-based deep neural network to effectively model sequential data in a situation where conditional inputs such as pronunciation, pitch, and speaker are given. In order for the entire framework to generate a high-quality sound source with a distribution more similar to that of a real singing voice, the adversarial training technique was applied to the training process. Finally, we applied a self-supervised style modeling technique to model detailed unlabeled musical expressions. We confirmed that the proposed model can flexibly control various elements such as pronunciation, pitch, timbre, singing style, and musical expression, while synthesizing high-quality singing that is difficult to distinguish from ground truth singing. Furthermore, we proposed a generation and modification framework that considers the situation applied to the actual music production process, and confirmed that it is possible to apply it to expand the limits of the creator's imagination, such as new voice design and cross-generation.๊ฐ€์ฐฝ ํ•ฉ์„ฑ์€ ์ฃผ์–ด์ง„ ์ž…๋ ฅ ์•…๋ณด๋กœ๋ถ€ํ„ฐ ์ž์—ฐ์Šค๋Ÿฌ์šด ๊ฐ€์ฐฝ ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•ด๋‚ด๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๊ฐ€์ฐฝ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์€ ์Œ์•… ์ œ์ž‘ ๋น„์šฉ์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ฐฝ์ž‘์ž์˜ ์˜๋„๋ฅผ ๋ณด๋‹ค ์‰ฝ๊ณ  ํŽธ๋ฆฌํ•˜๊ฒŒ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š”๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ์‹œ์Šคํ…œ์˜ ์„ค๊ณ„๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋‹ค์Œ ์„ธ ๊ฐ€์ง€์˜ ๋„์ „์ ์ธ ์š”๊ตฌ์‚ฌํ•ญ์ด ์กด์žฌํ•œ๋‹ค. 1) ๊ฐ€์ฐฝ์„ ์ด๋ฃจ๋Š” ๋‹ค์–‘ํ•œ ์š”์†Œ๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. 2) ๋†’์€ ํ’ˆ์งˆ ์ˆ˜์ค€ ๋ฐ ์‚ฌ์šฉ์„ฑ์„ ๋‹ฌ์„ฑํ•ด์•ผ ํ•œ๋‹ค. 3) ์ถฉ๋ถ„ํ•œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•๋ณดํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ๋Œ€ํ‘œ์ ์ธ ์Œ์„ฑ ์ƒ์„ฑ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์ธ ์†Œ์Šค-ํ•„ํ„ฐ ์ด๋ก ์— ์ฃผ๋ชฉํ•˜์˜€๋‹ค. ๊ฐ€์ฐฝ ์‹ ํ˜ธ๋ฅผ ์Œ์ • ์ •๋ณด์— ํ•ด๋‹นํ•˜๋Š” ์†Œ์Šค์™€ ๋ฐœ์Œ ์ •๋ณด์— ํ•ด๋‹นํ•˜๋Š” ํ•„ํ„ฐ์˜ ํ•ฉ์„ฑ๊ณฑ์œผ๋กœ ์ •์˜ํ•˜๊ณ , ์ด๋ฅผ ๊ฐ๊ฐ ๋…๋ฆฝ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜์—ฌ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ๊ณผ ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ์„ ๋™์‹œ์— ํ™•๋ณดํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ๋˜ํ•œ ์šฐ๋ฆฌ๋Š” ๋ฐœ์Œ, ์Œ์ •, ํ™”์ž ๋“ฑ ์กฐ๊ฑด๋ถ€ ์ž…๋ ฅ์ด ์ฃผ์–ด์ง„ ์ƒํ™ฉ์—์„œ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์กฐ๊ฑด๋ถ€ ์ž๊ธฐํšŒ๊ท€ ๋ชจ๋ธ ๊ธฐ๋ฐ˜์˜ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ํ™œ์šฉํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๋ ˆ์ด๋ธ”๋ง ๋˜์–ด์žˆ์ง€ ์•Š์€ ์Œ์•…์  ํ‘œํ˜„์„ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋„๋ก ์šฐ๋ฆฌ๋Š” ์ž๊ธฐ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ์Šคํƒ€์ผ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ–ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ œ์•ˆํ•œ ๋ชจ๋ธ์ด ๋ฐœ์Œ, ์Œ์ •, ์Œ์ƒ‰, ์ฐฝ๋ฒ•, ํ‘œํ˜„ ๋“ฑ ๋‹ค์–‘ํ•œ ์š”์†Œ๋ฅผ ์œ ์—ฐํ•˜๊ฒŒ ์ œ์–ดํ•˜๋ฉด์„œ๋„ ์‹ค์ œ ๊ฐ€์ฐฝ๊ณผ ๊ตฌ๋ถ„์ด ์–ด๋ ค์šด ์ˆ˜์ค€์˜ ๊ณ ํ’ˆ์งˆ ๊ฐ€์ฐฝ ํ•ฉ์„ฑ์ด ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ–ˆ๋‹ค. ๋‚˜์•„๊ฐ€ ์‹ค์ œ ์Œ์•… ์ œ์ž‘ ๊ณผ์ •์„ ๊ณ ๋ คํ•œ ์ƒ์„ฑ ๋ฐ ์ˆ˜์ • ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜์˜€๊ณ , ์ƒˆ๋กœ์šด ๋ชฉ์†Œ๋ฆฌ ๋””์ž์ธ, ๊ต์ฐจ ์ƒ์„ฑ ๋“ฑ ์ฐฝ์ž‘์ž์˜ ์ƒ์ƒ๋ ฅ๊ณผ ํ•œ๊ณ„๋ฅผ ๋„“ํž ์ˆ˜ ์žˆ๋Š” ์‘์šฉ์ด ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ–ˆ๋‹ค.1 Introduction 1 1.1 Motivation 1 1.2 Problems in singing voice synthesis 4 1.3 Task of interest 8 1.3.1 Single-singer SVS 9 1.3.2 Multi-singer SVS 10 1.3.3 Expressive SVS 11 1.4 Contribution 11 2 Background 13 2.1 Singing voice 14 2.2 Source-filter theory 18 2.3 Autoregressive model 21 2.4 Related works 22 2.4.1 Speech synthesis 25 2.4.2 Singing voice synthesis 29 3 Adversarially Trained End-to-end Korean Singing Voice Synthesis System 31 3.1 Introduction 31 3.2 Related work 33 3.3 Proposed method 35 3.3.1 Input representation 35 3.3.2 Mel-synthesis network 36 3.3.3 Super-resolution network 38 3.4 Experiments 42 3.4.1 Dataset 42 3.4.2 Training 42 3.4.3 Evaluation 43 3.4.4 Analysis on generated spectrogram 46 3.5 Discussion 49 3.5.1 Limitations of input representation 49 3.5.2 Advantages of using super-resolution network 53 3.6 Conclusion 55 4 Disentangling Timbre and Singing Style with multi-singer Singing Synthesis System 57 4.1Introduction 57 4.2 Related works 59 4.2.1 Multi-singer SVS system 60 4.3 Proposed Method 60 4.3.1 Singer identity encoder 62 4.3.2 Disentangling timbre & singing style 64 4.4 Experiment 64 4.4.1 Dataset and preprocessing 64 4.4.2 Training & inference 65 4.4.3 Analysis on generated spectrogram 65 4.4.4 Listening test 66 4.4.5 Timbre & style classification test 68 4.5 Discussion 70 4.5.1 Query audio selection strategy for singer identity encoder 70 4.5.2 Few-shot adaptation 72 4.6 Conclusion 74 5 Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder 77 5.1 Introduction 77 5.2 Related work 79 5.3 Proposed method 80 5.3.1 Local style token module 80 5.3.2 Dual-path pitch encoder 85 5.3.3 Bandwidth extension vocoder 85 5.4 Experiment 86 5.4.1 Dataset 86 5.4.2 Training 86 5.4.3 Qualitative evaluation 87 5.4.4 Dual-path reconstruction analysis 89 5.4.5 Qualitative analysis 90 5.5 Discussion 93 5.5.1 Difference between midi pitch and f0 93 5.5.2 Considerations for use in the actual music production process 94 5.6 Conclusion 95 6 Conclusion 97 6.1 Thesis summary 97 6.2 Limitations and future work 99 6.2.1 Improvements to a faster and robust system 99 6.2.2 Explainable and intuitive controllability 101 6.2.3 Extensions to common speech synthesis tools 103 6.2.4 Towards a collaborative and creative tool 104๋ฐ•

    Continuous expressive speaking styles synthesis based on CVSM and MR-HMM

    Get PDF
    This paper introduces a continuous system capable of automatically producing the most adequate speaking style to synthesize a desired target text. This is done thanks to a joint modeling of the acoustic and lexical parameters of the speaker models by adapting the CVSM projection of the training texts using MR-HMM techniques. As such, we consider that as long as sufficient variety in the training data is available, we should be able to model a continuous lexical space into a continuous acoustic space. The proposed continuous automatic text to speech system was evaluated by means of a perceptual evaluation in order to compare them with traditional approaches to the task. The system proved to be capable of conveying the correct expressiveness (average adequacy of 3.6) with an expressive strength comparable to oracle traditional expressive speech synthesis (average of 3.6) although with a drop in speech quality mainly due to the semi-continuous nature of the data (average quality of 2.9). This means that the proposed system is capable of improving traditional neutral systems without requiring any additional user interaction

    Fast Speech in Unit Selection Speech Synthesis

    Get PDF
    Moers-Prinz D. Fast Speech in Unit Selection Speech Synthesis. Bielefeld: Universitรคt Bielefeld; 2020.Speech synthesis is part of the everyday life of many people with severe visual disabilities. For those who are reliant on assistive speech technology the possibility to choose a fast speaking rate is reported to be essential. But also expressive speech synthesis and other spoken language interfaces may require an integration of fast speech. Architectures like formant or diphone synthesis are able to produce synthetic speech at fast speech rates, but the generated speech does not sound very natural. Unit selection synthesis systems, however, are capable of delivering more natural output. Nevertheless, fast speech has not been adequately implemented into such systems to date. Thus, the goal of the work presented here was to determine an optimal strategy for modeling fast speech in unit selection speech synthesis to provide potential users with a more natural sounding alternative for fast speech output
    • โ€ฆ
    corecore