9,429 research outputs found

    ์กฐ๊ฑด๋ถ€ ์ž๊ธฐํšŒ๊ท€ํ˜• ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ ์ œ์–ด ๊ฐ€๋Šฅํ•œ ๊ฐ€์ฐฝ ์Œ์„ฑ ํ•ฉ์„ฑ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์ง€๋Šฅ์ •๋ณด์œตํ•ฉํ•™๊ณผ, 2022. 8. ์ด๊ต๊ตฌ.Singing voice synthesis aims at synthesizing a natural singing voice from given input information. A successful singing synthesis system is important not only because it can significantly reduce the cost of the music production process, but also because it helps to more easily and conveniently reflect the creator's intentions. However, there are three challenging problems in designing such a system - 1) It should be possible to independently control the various elements that make up the singing. 2) It must be possible to generate high-quality sound sources, 3) It is difficult to secure sufficient training data. To deal with this problem, we first paid attention to the source-filter theory, which is a representative speech production modeling technique. We tried to secure training data efficiency and controllability at the same time by modeling a singing voice as a convolution of the source, which is pitch information, and filter, which is the pronunciation information, and designing a structure that can model each independently. In addition, we used a conditional autoregressive model-based deep neural network to effectively model sequential data in a situation where conditional inputs such as pronunciation, pitch, and speaker are given. In order for the entire framework to generate a high-quality sound source with a distribution more similar to that of a real singing voice, the adversarial training technique was applied to the training process. Finally, we applied a self-supervised style modeling technique to model detailed unlabeled musical expressions. We confirmed that the proposed model can flexibly control various elements such as pronunciation, pitch, timbre, singing style, and musical expression, while synthesizing high-quality singing that is difficult to distinguish from ground truth singing. Furthermore, we proposed a generation and modification framework that considers the situation applied to the actual music production process, and confirmed that it is possible to apply it to expand the limits of the creator's imagination, such as new voice design and cross-generation.๊ฐ€์ฐฝ ํ•ฉ์„ฑ์€ ์ฃผ์–ด์ง„ ์ž…๋ ฅ ์•…๋ณด๋กœ๋ถ€ํ„ฐ ์ž์—ฐ์Šค๋Ÿฌ์šด ๊ฐ€์ฐฝ ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•ด๋‚ด๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๊ฐ€์ฐฝ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์€ ์Œ์•… ์ œ์ž‘ ๋น„์šฉ์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ฐฝ์ž‘์ž์˜ ์˜๋„๋ฅผ ๋ณด๋‹ค ์‰ฝ๊ณ  ํŽธ๋ฆฌํ•˜๊ฒŒ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š”๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ์‹œ์Šคํ…œ์˜ ์„ค๊ณ„๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋‹ค์Œ ์„ธ ๊ฐ€์ง€์˜ ๋„์ „์ ์ธ ์š”๊ตฌ์‚ฌํ•ญ์ด ์กด์žฌํ•œ๋‹ค. 1) ๊ฐ€์ฐฝ์„ ์ด๋ฃจ๋Š” ๋‹ค์–‘ํ•œ ์š”์†Œ๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. 2) ๋†’์€ ํ’ˆ์งˆ ์ˆ˜์ค€ ๋ฐ ์‚ฌ์šฉ์„ฑ์„ ๋‹ฌ์„ฑํ•ด์•ผ ํ•œ๋‹ค. 3) ์ถฉ๋ถ„ํ•œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•๋ณดํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ๋Œ€ํ‘œ์ ์ธ ์Œ์„ฑ ์ƒ์„ฑ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์ธ ์†Œ์Šค-ํ•„ํ„ฐ ์ด๋ก ์— ์ฃผ๋ชฉํ•˜์˜€๋‹ค. ๊ฐ€์ฐฝ ์‹ ํ˜ธ๋ฅผ ์Œ์ • ์ •๋ณด์— ํ•ด๋‹นํ•˜๋Š” ์†Œ์Šค์™€ ๋ฐœ์Œ ์ •๋ณด์— ํ•ด๋‹นํ•˜๋Š” ํ•„ํ„ฐ์˜ ํ•ฉ์„ฑ๊ณฑ์œผ๋กœ ์ •์˜ํ•˜๊ณ , ์ด๋ฅผ ๊ฐ๊ฐ ๋…๋ฆฝ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜์—ฌ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ๊ณผ ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ์„ ๋™์‹œ์— ํ™•๋ณดํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ๋˜ํ•œ ์šฐ๋ฆฌ๋Š” ๋ฐœ์Œ, ์Œ์ •, ํ™”์ž ๋“ฑ ์กฐ๊ฑด๋ถ€ ์ž…๋ ฅ์ด ์ฃผ์–ด์ง„ ์ƒํ™ฉ์—์„œ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์กฐ๊ฑด๋ถ€ ์ž๊ธฐํšŒ๊ท€ ๋ชจ๋ธ ๊ธฐ๋ฐ˜์˜ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ํ™œ์šฉํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๋ ˆ์ด๋ธ”๋ง ๋˜์–ด์žˆ์ง€ ์•Š์€ ์Œ์•…์  ํ‘œํ˜„์„ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋„๋ก ์šฐ๋ฆฌ๋Š” ์ž๊ธฐ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ์Šคํƒ€์ผ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ–ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ œ์•ˆํ•œ ๋ชจ๋ธ์ด ๋ฐœ์Œ, ์Œ์ •, ์Œ์ƒ‰, ์ฐฝ๋ฒ•, ํ‘œํ˜„ ๋“ฑ ๋‹ค์–‘ํ•œ ์š”์†Œ๋ฅผ ์œ ์—ฐํ•˜๊ฒŒ ์ œ์–ดํ•˜๋ฉด์„œ๋„ ์‹ค์ œ ๊ฐ€์ฐฝ๊ณผ ๊ตฌ๋ถ„์ด ์–ด๋ ค์šด ์ˆ˜์ค€์˜ ๊ณ ํ’ˆ์งˆ ๊ฐ€์ฐฝ ํ•ฉ์„ฑ์ด ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ–ˆ๋‹ค. ๋‚˜์•„๊ฐ€ ์‹ค์ œ ์Œ์•… ์ œ์ž‘ ๊ณผ์ •์„ ๊ณ ๋ คํ•œ ์ƒ์„ฑ ๋ฐ ์ˆ˜์ • ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜์˜€๊ณ , ์ƒˆ๋กœ์šด ๋ชฉ์†Œ๋ฆฌ ๋””์ž์ธ, ๊ต์ฐจ ์ƒ์„ฑ ๋“ฑ ์ฐฝ์ž‘์ž์˜ ์ƒ์ƒ๋ ฅ๊ณผ ํ•œ๊ณ„๋ฅผ ๋„“ํž ์ˆ˜ ์žˆ๋Š” ์‘์šฉ์ด ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ–ˆ๋‹ค.1 Introduction 1 1.1 Motivation 1 1.2 Problems in singing voice synthesis 4 1.3 Task of interest 8 1.3.1 Single-singer SVS 9 1.3.2 Multi-singer SVS 10 1.3.3 Expressive SVS 11 1.4 Contribution 11 2 Background 13 2.1 Singing voice 14 2.2 Source-filter theory 18 2.3 Autoregressive model 21 2.4 Related works 22 2.4.1 Speech synthesis 25 2.4.2 Singing voice synthesis 29 3 Adversarially Trained End-to-end Korean Singing Voice Synthesis System 31 3.1 Introduction 31 3.2 Related work 33 3.3 Proposed method 35 3.3.1 Input representation 35 3.3.2 Mel-synthesis network 36 3.3.3 Super-resolution network 38 3.4 Experiments 42 3.4.1 Dataset 42 3.4.2 Training 42 3.4.3 Evaluation 43 3.4.4 Analysis on generated spectrogram 46 3.5 Discussion 49 3.5.1 Limitations of input representation 49 3.5.2 Advantages of using super-resolution network 53 3.6 Conclusion 55 4 Disentangling Timbre and Singing Style with multi-singer Singing Synthesis System 57 4.1Introduction 57 4.2 Related works 59 4.2.1 Multi-singer SVS system 60 4.3 Proposed Method 60 4.3.1 Singer identity encoder 62 4.3.2 Disentangling timbre & singing style 64 4.4 Experiment 64 4.4.1 Dataset and preprocessing 64 4.4.2 Training & inference 65 4.4.3 Analysis on generated spectrogram 65 4.4.4 Listening test 66 4.4.5 Timbre & style classification test 68 4.5 Discussion 70 4.5.1 Query audio selection strategy for singer identity encoder 70 4.5.2 Few-shot adaptation 72 4.6 Conclusion 74 5 Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder 77 5.1 Introduction 77 5.2 Related work 79 5.3 Proposed method 80 5.3.1 Local style token module 80 5.3.2 Dual-path pitch encoder 85 5.3.3 Bandwidth extension vocoder 85 5.4 Experiment 86 5.4.1 Dataset 86 5.4.2 Training 86 5.4.3 Qualitative evaluation 87 5.4.4 Dual-path reconstruction analysis 89 5.4.5 Qualitative analysis 90 5.5 Discussion 93 5.5.1 Difference between midi pitch and f0 93 5.5.2 Considerations for use in the actual music production process 94 5.6 Conclusion 95 6 Conclusion 97 6.1 Thesis summary 97 6.2 Limitations and future work 99 6.2.1 Improvements to a faster and robust system 99 6.2.2 Explainable and intuitive controllability 101 6.2.3 Extensions to common speech synthesis tools 103 6.2.4 Towards a collaborative and creative tool 104๋ฐ•

    PoLyScriber: Integrated Training of Extractor and Lyrics Transcriber for Polyphonic Music

    Full text link
    Lyrics transcription of polyphonic music is challenging as the background music affects lyrics intelligibility. Typically, lyrics transcription can be performed by a two step pipeline, i.e. singing vocal extraction frontend, followed by a lyrics transcriber backend, where the frontend and backend are trained separately. Such a two step pipeline suffers from both imperfect vocal extraction and mismatch between frontend and backend. In this work, we propose a novel end-to-end integrated training framework, that we call PoLyScriber, to globally optimize the vocal extractor front-end and lyrics transcriber backend for lyrics transcription in polyphonic music. The experimental results show that our proposed integrated training model achieves substantial improvements over the existing approaches on publicly available test datasets.Comment: 13 page

    Fusion of Multimodal Information in Music Content Analysis

    Get PDF
    Music is often processed through its acoustic realization. This is restrictive in the sense that music is clearly a highly multimodal concept where various types of heterogeneous information can be associated to a given piece of music (a musical score, musicians\u27 gestures, lyrics, user-generated metadata, etc.). This has recently led researchers to apprehend music through its various facets, giving rise to "multimodal music analysis" studies. This article gives a synthetic overview of methods that have been successfully employed in multimodal signal analysis. In particular, their use in music content processing is discussed in more details through five case studies that highlight different multimodal integration techniques. The case studies include an example of cross-modal correlation for music video analysis, an audiovisual drum transcription system, a description of the concept of informed source separation, a discussion of multimodal dance-scene analysis, and an example of user-interactive music analysis. In the light of these case studies, some perspectives of multimodality in music processing are finally suggested

    Reactive Statistical Mapping: Towards the Sketching of Performative Control with Data

    Get PDF
    Part 1: Fundamental IssuesInternational audienceThis paper presents the results of our participation to the ninth eNTERFACE workshop on multimodal user interfaces. Our target for this workshop was to bring some technologies currently used in speech recognition and synthesis to a new level, i.e. being the core of a new HMM-based mapping system. The idea of statistical mapping has been investigated, more precisely how to use Gaussian Mixture Models and Hidden Markov Models for realtime and reactive generation of new trajectories from inputted labels and for realtime regression in a continuous-to-continuous use case. As a result, we have developed several proofs of concept, including an incremental speech synthesiser, a software for exploring stylistic spaces for gait and facial motion in realtime, a reactive audiovisual laughter and a prototype demonstrating the realtime reconstruction of lower body gait motion strictly from upper body motion, with conservation of the stylistic properties. This project has been the opportunity to formalise HMM-based mapping, integrate various of these innovations into the Mage library and explore the development of a realtime gesture recognition tool
    • โ€ฆ
    corecore