19 research outputs found

    CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers

    Full text link
    It is challenging to build a multi-singer high-fidelity singing voice synthesis system with cross-lingual ability by only using monolingual singers in the training stage. In this paper, we propose CrossSinger, which is a cross-lingual singing voice synthesizer based on Xiaoicesing2. Specifically, we utilize International Phonetic Alphabet to unify the representation for all languages of the training data. Moreover, we leverage conditional layer normalization to incorporate the language information into the model for better pronunciation when singers meet unseen languages. Additionally, gradient reversal layer (GRL) is utilized to remove singer biases included in lyrics since all singers are monolingual, which indicates singer's identity is implicitly associated with the text. The experiment is conducted on a combination of three singing voice datasets containing Japanese Kiritan dataset, English NUS-48E dataset, and one internal Chinese dataset. The result shows CrossSinger can synthesize high-fidelity songs for various singers with cross-lingual ability, including code-switch cases.Comment: Accepted by ASRU202

    WeSinger 2: Fully Parallel Singing Voice Synthesis via Multi-Singer Conditional Adversarial Training

    Full text link
    This paper aims to introduce a robust singing voice synthesis (SVS) system to produce very natural and realistic singing voices efficiently by leveraging the adversarial training strategy. On one hand, we designed simple but generic random area conditional discriminators to help supervise the acoustic model, which can effectively avoid the over-smoothed spectrogram prediction and improve the expressiveness of SVS. On the other hand, we subtly combined the spectrogram with the frame-level linearly-interpolated F0 sequence as the input for the neural vocoder, which is then optimized with the help of multiple adversarial conditional discriminators in the waveform domain and multi-scale distance functions in the frequency domain. The experimental results and ablation studies concluded that, compared with our previous auto-regressive work, our new system can produce high-quality singing voices efficiently by fine-tuning different singing datasets covering from several minutes to a few hours. A large number of synthesized songs with different timbres are available online https://zzw922cn.github.io/wesinger2 and we highly recommend readers to listen to them.Comment: accepted at ICASSP 202

    쑰건뢀 μžκΈ°νšŒκ·€ν˜• 인곡신경망을 μ΄μš©ν•œ μ œμ–΄ κ°€λŠ₯ν•œ κ°€μ°½ μŒμ„± ν•©μ„±

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : μœ΅ν•©κ³Όν•™κΈ°μˆ λŒ€ν•™μ› 지λŠ₯μ •λ³΄μœ΅ν•©ν•™κ³Ό, 2022. 8. 이ꡐꡬ.Singing voice synthesis aims at synthesizing a natural singing voice from given input information. A successful singing synthesis system is important not only because it can significantly reduce the cost of the music production process, but also because it helps to more easily and conveniently reflect the creator's intentions. However, there are three challenging problems in designing such a system - 1) It should be possible to independently control the various elements that make up the singing. 2) It must be possible to generate high-quality sound sources, 3) It is difficult to secure sufficient training data. To deal with this problem, we first paid attention to the source-filter theory, which is a representative speech production modeling technique. We tried to secure training data efficiency and controllability at the same time by modeling a singing voice as a convolution of the source, which is pitch information, and filter, which is the pronunciation information, and designing a structure that can model each independently. In addition, we used a conditional autoregressive model-based deep neural network to effectively model sequential data in a situation where conditional inputs such as pronunciation, pitch, and speaker are given. In order for the entire framework to generate a high-quality sound source with a distribution more similar to that of a real singing voice, the adversarial training technique was applied to the training process. Finally, we applied a self-supervised style modeling technique to model detailed unlabeled musical expressions. We confirmed that the proposed model can flexibly control various elements such as pronunciation, pitch, timbre, singing style, and musical expression, while synthesizing high-quality singing that is difficult to distinguish from ground truth singing. Furthermore, we proposed a generation and modification framework that considers the situation applied to the actual music production process, and confirmed that it is possible to apply it to expand the limits of the creator's imagination, such as new voice design and cross-generation.κ°€μ°½ 합성은 주어진 μž…λ ₯ μ•…λ³΄λ‘œλΆ€ν„° μžμ—°μŠ€λŸ¬μš΄ κ°€μ°½ μŒμ„±μ„ ν•©μ„±ν•΄λ‚΄λŠ” 것을 λͺ©ν‘œλ‘œ ν•œλ‹€. κ°€μ°½ ν•©μ„± μ‹œμŠ€ν…œμ€ μŒμ•… μ œμž‘ λΉ„μš©μ„ 크게 쀄일 수 μžˆμ„ 뿐만 μ•„λ‹ˆλΌ μ°½μž‘μžμ˜ μ˜λ„λ₯Ό 보닀 쉽고 νŽΈλ¦¬ν•˜κ²Œ λ°˜μ˜ν•  수 μžˆλ„λ‘ λ•λŠ”λ‹€. ν•˜μ§€λ§Œ μ΄λŸ¬ν•œ μ‹œμŠ€ν…œμ˜ 섀계λ₯Ό μœ„ν•΄μ„œλŠ” λ‹€μŒ μ„Έ κ°€μ§€μ˜ 도전적인 μš”κ΅¬μ‚¬ν•­μ΄ μ‘΄μž¬ν•œλ‹€. 1) 가창을 μ΄λ£¨λŠ” λ‹€μ–‘ν•œ μš”μ†Œλ₯Ό λ…λ¦½μ μœΌλ‘œ μ œμ–΄ν•  수 μžˆμ–΄μ•Ό ν•œλ‹€. 2) 높은 ν’ˆμ§ˆ μˆ˜μ€€ 및 μ‚¬μš©μ„±μ„ 달성해야 ν•œλ‹€. 3) μΆ©λΆ„ν•œ ν›ˆλ ¨ 데이터λ₯Ό ν™•λ³΄ν•˜κΈ° μ–΄λ ΅λ‹€. μ΄λŸ¬ν•œ λ¬Έμ œμ— λŒ€μ‘ν•˜κΈ° μœ„ν•΄ μš°λ¦¬λŠ” λŒ€ν‘œμ μΈ μŒμ„± 생성 λͺ¨λΈλ§ 기법인 μ†ŒμŠ€-ν•„ν„° 이둠에 μ£Όλͺ©ν•˜μ˜€λ‹€. κ°€μ°½ μ‹ ν˜Έλ₯Ό μŒμ • 정보에 ν•΄λ‹Ήν•˜λŠ” μ†ŒμŠ€μ™€ 발음 정보에 ν•΄λ‹Ήν•˜λŠ” ν•„ν„°μ˜ ν•©μ„±κ³±μœΌλ‘œ μ •μ˜ν•˜κ³ , 이λ₯Ό 각각 λ…λ¦½μ μœΌλ‘œ λͺ¨λΈλ§ν•  수 μžˆλŠ” ꡬ쑰λ₯Ό μ„€κ³„ν•˜μ—¬ ν›ˆλ ¨ 데이터 νš¨μœ¨μ„±κ³Ό μ œμ–΄ κ°€λŠ₯성을 λ™μ‹œμ— ν™•λ³΄ν•˜κ³ μž ν•˜μ˜€λ‹€. λ˜ν•œ μš°λ¦¬λŠ” 발음, μŒμ •, ν™”μž λ“± 쑰건뢀 μž…λ ₯이 주어진 μƒν™©μ—μ„œ μ‹œκ³„μ—΄ 데이터λ₯Ό 효과적으둜 λͺ¨λΈλ§ν•˜κΈ° μœ„ν•˜μ—¬ 쑰건뢀 μžκΈ°νšŒκ·€ λͺ¨λΈ 기반의 심측신경망을 ν™œμš©ν•˜μ˜€λ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ λ ˆμ΄λΈ”λ§ λ˜μ–΄μžˆμ§€ μ•Šμ€ μŒμ•…μ  ν‘œν˜„μ„ λͺ¨λΈλ§ν•  수 μžˆλ„λ‘ μš°λ¦¬λŠ” μžκΈ°μ§€λ„ν•™μŠ΅ 기반의 μŠ€νƒ€μΌ λͺ¨λΈλ§ 기법을 μ œμ•ˆν–ˆλ‹€. μš°λ¦¬λŠ” μ œμ•ˆν•œ λͺ¨λΈμ΄ 발음, μŒμ •, μŒμƒ‰, 창법, ν‘œν˜„ λ“± λ‹€μ–‘ν•œ μš”μ†Œλ₯Ό μœ μ—°ν•˜κ²Œ μ œμ–΄ν•˜λ©΄μ„œλ„ μ‹€μ œ κ°€μ°½κ³Ό ꡬ뢄이 μ–΄λ €μš΄ μˆ˜μ€€μ˜ κ³ ν’ˆμ§ˆ κ°€μ°½ 합성이 κ°€λŠ₯함을 ν™•μΈν–ˆλ‹€. λ‚˜μ•„κ°€ μ‹€μ œ μŒμ•… μ œμž‘ 과정을 κ³ λ €ν•œ 생성 및 μˆ˜μ • ν”„λ ˆμž„μ›Œν¬λ₯Ό μ œμ•ˆν•˜μ˜€κ³ , μƒˆλ‘œμš΄ λͺ©μ†Œλ¦¬ λ””μžμΈ, ꡐ차 생성 λ“± μ°½μž‘μžμ˜ 상상λ ₯κ³Ό ν•œκ³„λ₯Ό λ„“νž 수 μžˆλŠ” μ‘μš©μ΄ κ°€λŠ₯함을 ν™•μΈν–ˆλ‹€.1 Introduction 1 1.1 Motivation 1 1.2 Problems in singing voice synthesis 4 1.3 Task of interest 8 1.3.1 Single-singer SVS 9 1.3.2 Multi-singer SVS 10 1.3.3 Expressive SVS 11 1.4 Contribution 11 2 Background 13 2.1 Singing voice 14 2.2 Source-filter theory 18 2.3 Autoregressive model 21 2.4 Related works 22 2.4.1 Speech synthesis 25 2.4.2 Singing voice synthesis 29 3 Adversarially Trained End-to-end Korean Singing Voice Synthesis System 31 3.1 Introduction 31 3.2 Related work 33 3.3 Proposed method 35 3.3.1 Input representation 35 3.3.2 Mel-synthesis network 36 3.3.3 Super-resolution network 38 3.4 Experiments 42 3.4.1 Dataset 42 3.4.2 Training 42 3.4.3 Evaluation 43 3.4.4 Analysis on generated spectrogram 46 3.5 Discussion 49 3.5.1 Limitations of input representation 49 3.5.2 Advantages of using super-resolution network 53 3.6 Conclusion 55 4 Disentangling Timbre and Singing Style with multi-singer Singing Synthesis System 57 4.1Introduction 57 4.2 Related works 59 4.2.1 Multi-singer SVS system 60 4.3 Proposed Method 60 4.3.1 Singer identity encoder 62 4.3.2 Disentangling timbre & singing style 64 4.4 Experiment 64 4.4.1 Dataset and preprocessing 64 4.4.2 Training & inference 65 4.4.3 Analysis on generated spectrogram 65 4.4.4 Listening test 66 4.4.5 Timbre & style classification test 68 4.5 Discussion 70 4.5.1 Query audio selection strategy for singer identity encoder 70 4.5.2 Few-shot adaptation 72 4.6 Conclusion 74 5 Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder 77 5.1 Introduction 77 5.2 Related work 79 5.3 Proposed method 80 5.3.1 Local style token module 80 5.3.2 Dual-path pitch encoder 85 5.3.3 Bandwidth extension vocoder 85 5.4 Experiment 86 5.4.1 Dataset 86 5.4.2 Training 86 5.4.3 Qualitative evaluation 87 5.4.4 Dual-path reconstruction analysis 89 5.4.5 Qualitative analysis 90 5.5 Discussion 93 5.5.1 Difference between midi pitch and f0 93 5.5.2 Considerations for use in the actual music production process 94 5.6 Conclusion 95 6 Conclusion 97 6.1 Thesis summary 97 6.2 Limitations and future work 99 6.2.1 Improvements to a faster and robust system 99 6.2.2 Explainable and intuitive controllability 101 6.2.3 Extensions to common speech synthesis tools 103 6.2.4 Towards a collaborative and creative tool 104λ°•

    Karaoker: Alignment-free singing voice synthesis with speech training data

    Full text link
    Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice following a multi-dimensional template extracted from a source waveform of an unseen speaker/singer. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. Except for multi-tasking, we also employ a Wasserstein GAN training scheme as well as new losses on the acoustic model's output to further refine the quality of the model.Comment: Submitted to INTERSPEECH 202
    corecore