11,122 research outputs found

    λ”₯λŸ¬λ‹μ„ ν™œμš©ν•œ μŠ€νƒ€μΌ μ μ‘ν˜• μŒμ„± ν•©μ„± 기법

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·컴퓨터곡학뢀, 2020. 8. κΉ€λ‚¨μˆ˜.The neural network-based speech synthesis techniques have been developed over the years. Although neural speech synthesis has shown remarkable generated speech quality, there are still remaining problems such as modeling power in a neural statistical parametric speech synthesis system, style expressiveness, and robust attention model in the end-to-end speech synthesis system. In this thesis, novel alternatives are proposed to resolve these drawbacks of the conventional neural speech synthesis system. In the first approach, we propose an adversarially trained variational recurrent neural network (AdVRNN), which applies a variational recurrent neural network (VRNN) to represent the variability of natural speech for acoustic modeling in neural statistical parametric speech synthesis. Also, we apply an adversarial learning scheme in training AdVRNN to overcome the oversmoothing problem. From the experimental results, we have found that the proposed AdVRNN based method outperforms the conventional RNN-based techniques. In the second approach, we propose a novel style modeling method employing mutual information neural estimator (MINE) in a style-adaptive end-to-end speech synthesis system. MINE is applied to increase target-style information and suppress text information in style embedding by applying MINE loss term in the loss function. The experimental results show that the MINE-based method has shown promising performance in both speech quality and style similarity for the global style token-Tacotron. In the third approach, we propose a novel attention method called memory attention for end-to-end speech synthesis, which is inspired by the gating mechanism of long-short term memory (LSTM). Leveraging the gating technique's sequence modeling power in LSTM, memory attention obtains the stable alignment from the content-based and location-based features. We evaluate the memory attention and compare its performance with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the results, we conclude that memory attention can generate speech with large variability robustly. In the last approach, we propose selective multi-attention for style-adaptive end-to-end speech synthesis systems. The conventional single attention model may limit the expressivity representing numerous alignment paths depending on style. To achieve a variation in attention alignment, we propose using a multi-attention model with a selection network. The multi-attention plays a role in generating candidates for the target style, and the selection network choose the most proper attention among the multi-attention. The experimental results show that selective multi-attention outperforms the conventional single attention techniques in multi-speaker speech synthesis and emotional speech synthesis.λ”₯λŸ¬λ‹ 기반의 μŒμ„± ν•©μ„± κΈ°μˆ μ€ μ§€λ‚œ λͺ‡ λ…„κ°„ νš”λ°œν•˜κ²Œ 개발되고 μžˆλ‹€. λ”₯λŸ¬λ‹μ˜ λ‹€μ–‘ν•œ 기법을 μ‚¬μš©ν•˜μ—¬ μŒμ„± ν•©μ„± ν’ˆμ§ˆμ€ λΉ„μ•½μ μœΌλ‘œ λ°œμ „ν–ˆμ§€λ§Œ, 아직 λ”₯λŸ¬λ‹ 기반의 μŒμ„± ν•©μ„±μ—λŠ” μ—¬λŸ¬ λ¬Έμ œκ°€ μ‘΄μž¬ν•œλ‹€. λ”₯λŸ¬λ‹ 기반의 톡계적 νŒŒλΌλ―Έν„° κΈ°λ²•μ˜ 경우 음ν–₯ λͺ¨λΈμ˜ deterministicν•œ λͺ¨λΈμ„ ν™œμš©ν•˜μ—¬ λͺ¨λΈλ§ λŠ₯λ ₯의 ν•œκ³„κ°€ 있으며, μ’…λ‹¨ν˜• λͺ¨λΈμ˜ 경우 μŠ€νƒ€μΌμ„ ν‘œν˜„ν•˜λŠ” λŠ₯λ ₯κ³Ό κ°•μΈν•œ μ–΄ν…μ…˜(attention)에 λŒ€ν•œ μ΄μŠˆκ°€ λŠμž„μ—†μ΄ 재기되고 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ΄λŸ¬ν•œ 기쑴의 λ”₯λŸ¬λ‹ 기반 μŒμ„± ν•©μ„± μ‹œμŠ€ν…œμ˜ 단점을 ν•΄κ²°ν•  μƒˆλ‘œμš΄ λŒ€μ•ˆμ„ μ œμ•ˆν•œλ‹€. 첫 번째 μ ‘κ·Όλ²•μœΌλ‘œμ„œ, λ‰΄λŸ΄ 톡계적 νŒŒλΌλ―Έν„° λ°©μ‹μ˜ 음ν–₯ λͺ¨λΈλ§μ„ κ³ λ„ν™”ν•˜κΈ° μœ„ν•œ adversarially trained variational recurrent neural network (AdVRNN) 기법을 μ œμ•ˆν•œλ‹€. AdVRNN 기법은 VRNN을 μŒμ„± 합성에 μ μš©ν•˜μ—¬ μŒμ„±μ˜ λ³€ν™”λ₯Ό stochastic ν•˜κ³  μžμ„Έν•˜κ²Œ λͺ¨λΈλ§ν•  수 μžˆλ„λ‘ ν•˜μ˜€λ‹€. λ˜ν•œ, μ λŒ€μ  ν•™μŠ΅μ (adversarial learning) 기법을 ν™œμš©ν•˜μ—¬ oversmoothing 문제λ₯Ό μ΅œμ†Œν™” μ‹œν‚€λ„λ‘ ν•˜μ˜€λ‹€. μ΄λŸ¬ν•œ μ œμ•ˆλœ μ•Œκ³ λ¦¬μ¦˜μ€ 기쑴의 μˆœν™˜ 신경망 기반의 음ν–₯ λͺ¨λΈκ³Ό λΉ„κ΅ν•˜μ—¬ μ„±λŠ₯이 ν–₯상됨을 ν™•μΈν•˜μ˜€λ‹€. 두 번째 μ ‘κ·Όλ²•μœΌλ‘œμ„œ, μŠ€νƒ€μΌ μ μ‘ν˜• μ’…λ‹¨ν˜• μŒμ„± ν•©μ„± 기법을 μœ„ν•œ μƒν˜Έ μ •λ³΄λŸ‰ 기반의 μƒˆλ‘œμš΄ ν•™μŠ΅ 기법을 μ œμ•ˆν•œλ‹€. 기쑴의 global style token(GST) 기반의 μŠ€νƒ€μΌ μŒμ„± ν•©μ„± κΈ°λ²•μ˜ 경우, 비지도 ν•™μŠ΅μ„ μ‚¬μš©ν•˜λ―€λ‘œ μ›ν•˜λŠ” λͺ©ν‘œ μŠ€νƒ€μΌμ΄ μžˆμ–΄λ„ 이λ₯Ό μ€‘μ μ μœΌλ‘œ ν•™μŠ΅μ‹œν‚€κΈ° μ–΄λ €μ› λ‹€. 이λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ GST의 좜λ ₯κ³Ό λͺ©ν‘œ μŠ€νƒ€μΌ μž„λ² λ”© λ²‘ν„°μ˜ μƒν˜Έ μ •λ³΄λŸ‰μ„ μ΅œλŒ€ν™” ν•˜λ„λ‘ ν•™μŠ΅ μ‹œν‚€λŠ” 기법을 μ œμ•ˆν•˜μ˜€λ‹€. μƒν˜Έ μ •λ³΄λŸ‰μ„ μ’…λ‹¨ν˜• λͺ¨λΈμ˜ μ†μ‹€ν•¨μˆ˜μ— μ μš©ν•˜κΈ° μœ„ν•΄μ„œ mutual information neural estimator(MINE) 기법을 λ„μž…ν•˜μ˜€κ³  λ‹€ν™”μž λͺ¨λΈμ„ 톡해 기쑴의 GST 기법에 λΉ„ν•΄ λͺ©ν‘œ μŠ€νƒ€μΌμ„ 보닀 μ€‘μ μ μœΌλ‘œ ν•™μŠ΅μ‹œν‚¬ 수 μžˆμŒμ„ ν™•μΈν•˜μ˜€λ‹€. μ„Έλ²ˆμ§Έ μ ‘κ·Όλ²•μœΌλ‘œμ„œ, κ°•μΈν•œ μ’…λ‹¨ν˜• μŒμ„± ν•©μ„±μ˜ μ–΄ν…μ…˜μΈ memory attention을 μ œμ•ˆν•œλ‹€. Long-short term memory(LSTM)의 gating κΈ°μˆ μ€ sequenceλ₯Ό λͺ¨λΈλ§ν•˜λŠ”데 높은 μ„±λŠ₯을 보여왔닀. μ΄λŸ¬ν•œ κΈ°μˆ μ„ μ–΄ν…μ…˜μ— μ μš©ν•˜μ—¬ λ‹€μ–‘ν•œ μŠ€νƒ€μΌμ„ 가진 μŒμ„±μ—μ„œλ„ μ–΄ν…μ…˜μ˜ λŠκΉ€, 반볡 등을 μ΅œμ†Œν™”ν•  수 μžˆλŠ” 기법을 μ œμ•ˆν•œλ‹€. 단일 ν™”μžμ™€ 감정 μŒμ„± ν•©μ„± 기법을 ν† λŒ€λ‘œ memory attention의 μ„±λŠ₯을 ν™•μΈν•˜μ˜€μœΌλ©° κΈ°μ‘΄ 기법 λŒ€λΉ„ 보닀 μ•ˆμ •μ μΈ μ–΄ν…μ…˜ 곑선을 얻을 수 μžˆμŒμ„ ν™•μΈν•˜μ˜€λ‹€. λ§ˆμ§€λ§‰ μ ‘κ·Όλ²•μœΌλ‘œμ„œ, selective multi-attention (SMA)을 ν™œμš©ν•œ μŠ€νƒ€μΌ μ μ‘ν˜• μ’…λ‹¨ν˜• μŒμ„± ν•©μ„± μ–΄ν…μ…˜ 기법을 μ œμ•ˆν•œλ‹€. 기쑴의 μŠ€νƒ€μΌ μ μ‘ν˜• μ’…λ‹¨ν˜• μŒμ„± ν•©μ„±μ˜ μ—°κ΅¬μ—μ„œλŠ” 낭독체 λ‹¨μΌν™”μžμ˜ κ²½μš°μ™€ 같은 단일 μ–΄ν…μ…˜μ„ μ‚¬μš©ν•˜μ—¬ μ™”λ‹€. ν•˜μ§€λ§Œ μŠ€νƒ€μΌ μŒμ„±μ˜ 경우 보닀 λ‹€μ–‘ν•œ μ–΄ν…μ…˜ ν‘œν˜„μ„ μš”κ΅¬ν•œλ‹€. 이λ₯Ό μœ„ν•΄ 닀쀑 μ–΄ν…μ…˜μ„ ν™œμš©ν•˜μ—¬ 후보듀을 μƒμ„±ν•˜κ³  이λ₯Ό 선택 λ„€νŠΈμ›Œν¬λ₯Ό ν™œμš©ν•˜μ—¬ 졜적의 μ–΄ν…μ…˜μ„ μ„ νƒν•˜λŠ” 기법을 μ œμ•ˆν•œλ‹€. SMA 기법은 기쑴의 μ–΄ν…μ…˜κ³Όμ˜ 비ꡐ μ‹€ν—˜μ„ ν†΅ν•˜μ—¬ 보닀 λ§Žμ€ μŠ€νƒ€μΌμ„ μ•ˆμ •μ μœΌλ‘œ ν‘œν˜„ν•  수 μžˆμŒμ„ ν™•μΈν•˜μ˜€λ‹€.1 Introduction 1 1.1 Background 1 1.2 Scope of thesis 3 2 Neural Speech Synthesis System 7 2.1 Overview of a Neural Statistical Parametric Speech Synthesis System 7 2.2 Overview of End-to-end Speech Synthesis System 9 2.3 Tacotron2 10 2.4 Attention Mechanism 12 2.4.1 Location Sensitive Attention 12 2.4.2 Forward Attention 13 2.4.3 Dynamic Convolution Attention 14 3 Neural Statistical Parametric Speech Synthesis using AdVRNN 17 3.1 Introduction 17 3.2 Background 19 3.2.1 Variational Autoencoder 19 3.2.2 Variational Recurrent Neural Network 20 3.3 Speech Synthesis Using AdVRNN 22 3.3.1 AdVRNN based Acoustic Modeling 23 3.3.2 Training Procedure 24 3.4 Experiments 25 3.4.1 Objective performance evaluation 28 3.4.2 Subjective performance evaluation 29 3.5 Summary 29 4 Speech Style Modeling Method using Mutual Information for End-to-End Speech Synthesis 31 4.1 Introduction 31 4.2 Background 33 4.2.1 Mutual Information 33 4.2.2 Mutual Information Neural Estimator 34 4.2.3 Global Style Token 34 4.3 Style Token end-to-end speech synthesis using MINE 35 4.4 Experiments 36 4.5 Summary 38 5 Memory Attention: Robust Alignment using Gating Mechanism for End-to-End Speech Synthesis 45 5.1 Introduction 45 5.2 BACKGROUND 48 5.3 Memory Attention 49 5.4 Experiments 52 5.4.1 Experiments on Single Speaker Speech Synthesis 53 5.4.2 Experiments on Emotional Speech Synthesis 56 5.5 Summary 59 6 Selective Multi-attention for style-adaptive end-to-End Speech Syn-thesis 63 6.1 Introduction 63 6.2 BACKGROUND 65 6.3 Selective multi-attention model 66 6.4 EXPERIMENTS 67 6.4.1 Multi-speaker speech synthesis experiments 68 6.4.2 Experiments on Emotional Speech Synthesis 73 6.5 Summary 77 7 Conclusions 79 Bibliography 83 μš”μ•½ 93 κ°μ‚¬μ˜ κΈ€ 95Docto

    Affective social anthropomorphic intelligent system

    Full text link
    Human conversational styles are measured by the sense of humor, personality, and tone of voice. These characteristics have become essential for conversational intelligent virtual assistants. However, most of the state-of-the-art intelligent virtual assistants (IVAs) are failed to interpret the affective semantics of human voices. This research proposes an anthropomorphic intelligent system that can hold a proper human-like conversation with emotion and personality. A voice style transfer method is also proposed to map the attributes of a specific emotion. Initially, the frequency domain data (Mel-Spectrogram) is created by converting the temporal audio wave data, which comprises discrete patterns for audio features such as notes, pitch, rhythm, and melody. A collateral CNN-Transformer-Encoder is used to predict seven different affective states from voice. The voice is also fed parallelly to the deep-speech, an RNN model that generates the text transcription from the spectrogram. Then the transcripted text is transferred to the multi-domain conversation agent using blended skill talk, transformer-based retrieve-and-generate generation strategy, and beam-search decoding, and an appropriate textual response is generated. The system learns an invertible mapping of data to a latent space that can be manipulated and generates a Mel-spectrogram frame based on previous Mel-spectrogram frames to voice synthesize and style transfer. Finally, the waveform is generated using WaveGlow from the spectrogram. The outcomes of the studies we conducted on individual models were auspicious. Furthermore, users who interacted with the system provided positive feedback, demonstrating the system's effectiveness.Comment: Multimedia Tools and Applications (2023

    쑰건뢀 μžκΈ°νšŒκ·€ν˜• 인곡신경망을 μ΄μš©ν•œ μ œμ–΄ κ°€λŠ₯ν•œ κ°€μ°½ μŒμ„± ν•©μ„±

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : μœ΅ν•©κ³Όν•™κΈ°μˆ λŒ€ν•™μ› 지λŠ₯μ •λ³΄μœ΅ν•©ν•™κ³Ό, 2022. 8. 이ꡐꡬ.Singing voice synthesis aims at synthesizing a natural singing voice from given input information. A successful singing synthesis system is important not only because it can significantly reduce the cost of the music production process, but also because it helps to more easily and conveniently reflect the creator's intentions. However, there are three challenging problems in designing such a system - 1) It should be possible to independently control the various elements that make up the singing. 2) It must be possible to generate high-quality sound sources, 3) It is difficult to secure sufficient training data. To deal with this problem, we first paid attention to the source-filter theory, which is a representative speech production modeling technique. We tried to secure training data efficiency and controllability at the same time by modeling a singing voice as a convolution of the source, which is pitch information, and filter, which is the pronunciation information, and designing a structure that can model each independently. In addition, we used a conditional autoregressive model-based deep neural network to effectively model sequential data in a situation where conditional inputs such as pronunciation, pitch, and speaker are given. In order for the entire framework to generate a high-quality sound source with a distribution more similar to that of a real singing voice, the adversarial training technique was applied to the training process. Finally, we applied a self-supervised style modeling technique to model detailed unlabeled musical expressions. We confirmed that the proposed model can flexibly control various elements such as pronunciation, pitch, timbre, singing style, and musical expression, while synthesizing high-quality singing that is difficult to distinguish from ground truth singing. Furthermore, we proposed a generation and modification framework that considers the situation applied to the actual music production process, and confirmed that it is possible to apply it to expand the limits of the creator's imagination, such as new voice design and cross-generation.κ°€μ°½ 합성은 주어진 μž…λ ₯ μ•…λ³΄λ‘œλΆ€ν„° μžμ—°μŠ€λŸ¬μš΄ κ°€μ°½ μŒμ„±μ„ ν•©μ„±ν•΄λ‚΄λŠ” 것을 λͺ©ν‘œλ‘œ ν•œλ‹€. κ°€μ°½ ν•©μ„± μ‹œμŠ€ν…œμ€ μŒμ•… μ œμž‘ λΉ„μš©μ„ 크게 쀄일 수 μžˆμ„ 뿐만 μ•„λ‹ˆλΌ μ°½μž‘μžμ˜ μ˜λ„λ₯Ό 보닀 쉽고 νŽΈλ¦¬ν•˜κ²Œ λ°˜μ˜ν•  수 μžˆλ„λ‘ λ•λŠ”λ‹€. ν•˜μ§€λ§Œ μ΄λŸ¬ν•œ μ‹œμŠ€ν…œμ˜ 섀계λ₯Ό μœ„ν•΄μ„œλŠ” λ‹€μŒ μ„Έ κ°€μ§€μ˜ 도전적인 μš”κ΅¬μ‚¬ν•­μ΄ μ‘΄μž¬ν•œλ‹€. 1) 가창을 μ΄λ£¨λŠ” λ‹€μ–‘ν•œ μš”μ†Œλ₯Ό λ…λ¦½μ μœΌλ‘œ μ œμ–΄ν•  수 μžˆμ–΄μ•Ό ν•œλ‹€. 2) 높은 ν’ˆμ§ˆ μˆ˜μ€€ 및 μ‚¬μš©μ„±μ„ 달성해야 ν•œλ‹€. 3) μΆ©λΆ„ν•œ ν›ˆλ ¨ 데이터λ₯Ό ν™•λ³΄ν•˜κΈ° μ–΄λ ΅λ‹€. μ΄λŸ¬ν•œ λ¬Έμ œμ— λŒ€μ‘ν•˜κΈ° μœ„ν•΄ μš°λ¦¬λŠ” λŒ€ν‘œμ μΈ μŒμ„± 생성 λͺ¨λΈλ§ 기법인 μ†ŒμŠ€-ν•„ν„° 이둠에 μ£Όλͺ©ν•˜μ˜€λ‹€. κ°€μ°½ μ‹ ν˜Έλ₯Ό μŒμ • 정보에 ν•΄λ‹Ήν•˜λŠ” μ†ŒμŠ€μ™€ 발음 정보에 ν•΄λ‹Ήν•˜λŠ” ν•„ν„°μ˜ ν•©μ„±κ³±μœΌλ‘œ μ •μ˜ν•˜κ³ , 이λ₯Ό 각각 λ…λ¦½μ μœΌλ‘œ λͺ¨λΈλ§ν•  수 μžˆλŠ” ꡬ쑰λ₯Ό μ„€κ³„ν•˜μ—¬ ν›ˆλ ¨ 데이터 νš¨μœ¨μ„±κ³Ό μ œμ–΄ κ°€λŠ₯성을 λ™μ‹œμ— ν™•λ³΄ν•˜κ³ μž ν•˜μ˜€λ‹€. λ˜ν•œ μš°λ¦¬λŠ” 발음, μŒμ •, ν™”μž λ“± 쑰건뢀 μž…λ ₯이 주어진 μƒν™©μ—μ„œ μ‹œκ³„μ—΄ 데이터λ₯Ό 효과적으둜 λͺ¨λΈλ§ν•˜κΈ° μœ„ν•˜μ—¬ 쑰건뢀 μžκΈ°νšŒκ·€ λͺ¨λΈ 기반의 심측신경망을 ν™œμš©ν•˜μ˜€λ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ λ ˆμ΄λΈ”λ§ λ˜μ–΄μžˆμ§€ μ•Šμ€ μŒμ•…μ  ν‘œν˜„μ„ λͺ¨λΈλ§ν•  수 μžˆλ„λ‘ μš°λ¦¬λŠ” μžκΈ°μ§€λ„ν•™μŠ΅ 기반의 μŠ€νƒ€μΌ λͺ¨λΈλ§ 기법을 μ œμ•ˆν–ˆλ‹€. μš°λ¦¬λŠ” μ œμ•ˆν•œ λͺ¨λΈμ΄ 발음, μŒμ •, μŒμƒ‰, 창법, ν‘œν˜„ λ“± λ‹€μ–‘ν•œ μš”μ†Œλ₯Ό μœ μ—°ν•˜κ²Œ μ œμ–΄ν•˜λ©΄μ„œλ„ μ‹€μ œ κ°€μ°½κ³Ό ꡬ뢄이 μ–΄λ €μš΄ μˆ˜μ€€μ˜ κ³ ν’ˆμ§ˆ κ°€μ°½ 합성이 κ°€λŠ₯함을 ν™•μΈν–ˆλ‹€. λ‚˜μ•„κ°€ μ‹€μ œ μŒμ•… μ œμž‘ 과정을 κ³ λ €ν•œ 생성 및 μˆ˜μ • ν”„λ ˆμž„μ›Œν¬λ₯Ό μ œμ•ˆν•˜μ˜€κ³ , μƒˆλ‘œμš΄ λͺ©μ†Œλ¦¬ λ””μžμΈ, ꡐ차 생성 λ“± μ°½μž‘μžμ˜ 상상λ ₯κ³Ό ν•œκ³„λ₯Ό λ„“νž 수 μžˆλŠ” μ‘μš©μ΄ κ°€λŠ₯함을 ν™•μΈν–ˆλ‹€.1 Introduction 1 1.1 Motivation 1 1.2 Problems in singing voice synthesis 4 1.3 Task of interest 8 1.3.1 Single-singer SVS 9 1.3.2 Multi-singer SVS 10 1.3.3 Expressive SVS 11 1.4 Contribution 11 2 Background 13 2.1 Singing voice 14 2.2 Source-filter theory 18 2.3 Autoregressive model 21 2.4 Related works 22 2.4.1 Speech synthesis 25 2.4.2 Singing voice synthesis 29 3 Adversarially Trained End-to-end Korean Singing Voice Synthesis System 31 3.1 Introduction 31 3.2 Related work 33 3.3 Proposed method 35 3.3.1 Input representation 35 3.3.2 Mel-synthesis network 36 3.3.3 Super-resolution network 38 3.4 Experiments 42 3.4.1 Dataset 42 3.4.2 Training 42 3.4.3 Evaluation 43 3.4.4 Analysis on generated spectrogram 46 3.5 Discussion 49 3.5.1 Limitations of input representation 49 3.5.2 Advantages of using super-resolution network 53 3.6 Conclusion 55 4 Disentangling Timbre and Singing Style with multi-singer Singing Synthesis System 57 4.1Introduction 57 4.2 Related works 59 4.2.1 Multi-singer SVS system 60 4.3 Proposed Method 60 4.3.1 Singer identity encoder 62 4.3.2 Disentangling timbre & singing style 64 4.4 Experiment 64 4.4.1 Dataset and preprocessing 64 4.4.2 Training & inference 65 4.4.3 Analysis on generated spectrogram 65 4.4.4 Listening test 66 4.4.5 Timbre & style classification test 68 4.5 Discussion 70 4.5.1 Query audio selection strategy for singer identity encoder 70 4.5.2 Few-shot adaptation 72 4.6 Conclusion 74 5 Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder 77 5.1 Introduction 77 5.2 Related work 79 5.3 Proposed method 80 5.3.1 Local style token module 80 5.3.2 Dual-path pitch encoder 85 5.3.3 Bandwidth extension vocoder 85 5.4 Experiment 86 5.4.1 Dataset 86 5.4.2 Training 86 5.4.3 Qualitative evaluation 87 5.4.4 Dual-path reconstruction analysis 89 5.4.5 Qualitative analysis 90 5.5 Discussion 93 5.5.1 Difference between midi pitch and f0 93 5.5.2 Considerations for use in the actual music production process 94 5.6 Conclusion 95 6 Conclusion 97 6.1 Thesis summary 97 6.2 Limitations and future work 99 6.2.1 Improvements to a faster and robust system 99 6.2.2 Explainable and intuitive controllability 101 6.2.3 Extensions to common speech synthesis tools 103 6.2.4 Towards a collaborative and creative tool 104λ°•

    Generative models for music using transformer architectures

    Get PDF
    openThis thesis focus on growth and impact of Transformes architectures which are mainly used for Natural Language Processing tasks for Audio generation. We think that music, with its notes, chords, and volumes, is a language. You could think of symbolic representation of music as human language. A brief sound synthesis history which gives basic foundation for modern AI-generated music models is mentioned . The most recent in AI-generated audio is carefully studied and instances of AI-generated music is told in many contexts. Deep learning models and their applications to real-world issues are one of the key subjects that are covered. The main areas of interest include transformer-based audio generation, including the training procedure, encoding and decoding techniques, and post-processing stages. Transformers have several key advantages, including long-term consistency and the ability to create minute-long audio compositions. Numerous studies on the various representations of music have been explained, including how neural network and deep learning techniques can be used to apply symbolic melodies, musical arrangements, style transfer, and sound production. This thesis largely focuses on transformation models, but it also recognises the importance of numerous AI-based generative models, including GAN. Overall, this thesis enhances generative models for music composition and provides a complete understanding of transformer design. It shows the possibilities of AI-generated sound synthesis by emphasising the most current developments.This thesis focus on growth and impact of Transformes architectures which are mainly used for Natural Language Processing tasks for Audio generation. We think that music, with its notes, chords, and volumes, is a language. You could think of symbolic representation of music as human language. A brief sound synthesis history which gives basic foundation for modern AI-generated music models is mentioned . The most recent in AI-generated audio is carefully studied and instances of AI-generated music is told in many contexts. Deep learning models and their applications to real-world issues are one of the key subjects that are covered. The main areas of interest include transformer-based audio generation, including the training procedure, encoding and decoding techniques, and post-processing stages. Transformers have several key advantages, including long-term consistency and the ability to create minute-long audio compositions. Numerous studies on the various representations of music have been explained, including how neural network and deep learning techniques can be used to apply symbolic melodies, musical arrangements, style transfer, and sound production. This thesis largely focuses on transformation models, but it also recognises the importance of numerous AI-based generative models, including GAN. Overall, this thesis enhances generative models for music composition and provides a complete understanding of transformer design. It shows the possibilities of AI-generated sound synthesis by emphasising the most current developments

    Controllable Accented Text-to-Speech Synthesis

    Full text link
    Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). Accented TTS synthesis is challenging as L2 is different from L1 in both in terms of phonetic rendering and prosody pattern. Furthermore, there is no easy solution to the control of the accent intensity in an utterance. In this work, we propose a neural TTS architecture, that allows us to control the accent and its intensity during inference. This is achieved through three novel mechanisms, 1) an accent variance adaptor to model the complex accent variance with three prosody controlling factors, namely pitch, energy and duration; 2) an accent intensity modeling strategy to quantify the accent intensity; 3) a consistency constraint module to encourage the TTS system to render the expected accent intensity at a fine level. Experiments show that the proposed system attains superior performance to the baseline models in terms of accent rendering and intensity control. To our best knowledge, this is the first study of accented TTS synthesis with explicit intensity control.Comment: To be submitted for possible journal publicatio
    • …
    corecore