124 research outputs found

    The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach

    Get PDF
    As part of the Human-Computer Interaction field, Expressive speech synthesis is a very rich domain as it requires knowledge in areas such as machine learning, signal processing, sociology, and psychology. In this chapter, we will focus mostly on the technical side. From the recording of expressive speech to its modeling, the reader will have an overview of the main paradigms used in this field, through some of the most prominent systems and methods. We explain how speech can be represented and encoded with audio features. We present a history of the main methods of Text-to-Speech synthesis: concatenative, parametric and statistical parametric speech synthesis. Finally, we focus on the last one, with the last techniques modeling Text-to-Speech synthesis as a sequence-to-sequence problem. This enables the use of Deep Learning blocks such as Convolutional and Recurrent Neural Networks as well as Attention Mechanism. The last part of the chapter intends to assemble the different aspects of the theory and summarize the concepts

    NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

    Full text link
    Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens one by one, which suffer from unstable prosody, word skipping/repeating issue, and poor voice quality. In this paper, we develop NaturalSpeech 2, a TTS system that leverages a neural audio codec with residual vector quantizers to get the quantized latent vectors and uses a diffusion model to generate these latent vectors conditioned on text input. To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, robustness, and voice quality in a zero-shot setting, and performs novel zero-shot singing synthesis with only a speech prompt. Audio samples are available at https://speechresearch.github.io/naturalspeech2.Comment: A large-scale text-to-speech and singing voice synthesis system with latent diffusion model

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

    AI-generated Content for Various Data Modalities: A Survey

    Full text link
    AI-generated content (AIGC) methods aim to produce text, images, videos, 3D assets, and other media using AI algorithms. Due to its wide range of applications and the demonstrated potential of recent works, AIGC developments have been attracting lots of attention recently, and AIGC methods have been developed for various data modalities, such as image, video, text, 3D shape (as voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human avatar (body and head), 3D motion, and audio -- each presenting different characteristics and challenges. Furthermore, there have also been many significant developments in cross-modality AIGC methods, where generative methods can receive conditioning input in one modality and produce outputs in another. Examples include going from various modalities to image, video, 3D shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar), and audio modalities. In this paper, we provide a comprehensive review of AIGC methods across different data modalities, including both single-modality and cross-modality methods, highlighting the various challenges, representative works, and recent technical directions in each setting. We also survey the representative datasets throughout the modalities, and present comparative results for various modalities. Moreover, we also discuss the challenges and potential future research directions

    μŒμ•…μ  μš”μ†Œμ— λŒ€ν•œ 쑰건뢀 μƒμ„±μ˜ κ°œμ„ μ— κ΄€ν•œ 연ꡬ: ν™”μŒκ³Ό ν‘œν˜„μ„ μ€‘μ‹¬μœΌλ‘œ

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : μœ΅ν•©κ³Όν•™κΈ°μˆ λŒ€ν•™μ› μœ΅ν•©κ³Όν•™λΆ€(λ””μ§€ν„Έμ •λ³΄μœ΅ν•©μ „κ³΅), 2023. 2. 이ꡐꡬ.Conditional generation of musical components (CGMC) creates a part of music based on partial musical components such as melody or chord. CGMC is beneficial for discovering complex relationships among musical attributes. It can also assist non-experts who face difficulties in making music. However, recent studies for CGMC are still facing two challenges in terms of generation quality and model controllability. First, the structure of the generated music is not robust. Second, only limited ranges of musical factors and tasks have been examined as targets for flexible control of generation. In this thesis, we aim to mitigate these two challenges to improve the CGMC systems. For musical structure, we focus on intuitive modeling of musical hierarchy to help the model explicitly learn musically meaningful dependency. To this end, we utilize alignment paths between the raw music data and the musical units such as notes or chords. For musical creativity, we facilitate smooth control of novel musical attributes using latent representations. We attempt to achieve disentangled representations of the intended factors by regularizing them with data-driven inductive bias. This thesis verifies the proposed approaches particularly in two representative CGMC tasks, melody harmonization and expressive performance rendering. A variety of experimental results show the possibility of the proposed approaches to expand musical creativity under stable generation quality.μŒμ•…μ  μš”μ†Œλ₯Ό 쑰건뢀 μƒμ„±ν•˜λŠ” 뢄야인 CGMCλŠ” λ©œλ‘œλ””λ‚˜ ν™”μŒκ³Ό 같은 μŒμ•…μ˜ 일뢀뢄을 기반으둜 λ‚˜λ¨Έμ§€ 뢀뢄을 μƒμ„±ν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•œλ‹€. 이 λΆ„μ•ΌλŠ” μŒμ•…μ  μš”μ†Œ κ°„ λ³΅μž‘ν•œ 관계λ₯Ό νƒκ΅¬ν•˜λŠ” 데 μš©μ΄ν•˜κ³ , μŒμ•…μ„ λ§Œλ“œλŠ” 데 어렀움을 κ²ͺλŠ” 비전문가듀을 λ„μšΈ 수 μžˆλ‹€. 졜근 연ꡬ듀은 λ”₯λŸ¬λ‹ κΈ°μˆ μ„ ν™œμš©ν•˜μ—¬ CGMC μ‹œμŠ€ν…œμ˜ μ„±λŠ₯을 λ†’μ—¬μ™”λ‹€. ν•˜μ§€λ§Œ, μ΄λŸ¬ν•œ μ—°κ΅¬λ“€μ—λŠ” 아직 생성 ν’ˆμ§ˆκ³Ό μ œμ–΄κ°€λŠ₯μ„± μΈ‘λ©΄μ—μ„œ 두 κ°€μ§€μ˜ ν•œκ³„μ μ΄ μžˆλ‹€. λ¨Όμ €, μƒμ„±λœ μŒμ•…μ˜ μŒμ•…μ  ꡬ쑰가 λͺ…ν™•ν•˜μ§€ μ•Šλ‹€. λ˜ν•œ, 아직 쒁은 λ²”μœ„μ˜ μŒμ•…μ  μš”μ†Œ 및 ν…ŒμŠ€ν¬λ§Œμ΄ μœ μ—°ν•œ μ œμ–΄μ˜ λŒ€μƒμœΌλ‘œμ„œ νƒκ΅¬λ˜μ—ˆλ‹€. 이에 λ³Έ ν•™μœ„λ…Όλ¬Έμ—μ„œλŠ” CGMC의 κ°œμ„ μ„ μœ„ν•΄ μœ„ 두 κ°€μ§€μ˜ ν•œκ³„μ μ„ ν•΄κ²°ν•˜κ³ μž ν•œλ‹€. 첫 번째둜, μŒμ•… ꡬ쑰λ₯Ό μ΄λ£¨λŠ” μŒμ•…μ  μœ„κ³„λ₯Ό μ§κ΄€μ μœΌλ‘œ λͺ¨λΈλ§ν•˜λŠ” 데 μ§‘μ€‘ν•˜κ³ μž ν•œλ‹€. 본래 데이터와 음, ν™”μŒκ³Ό 같은 μŒμ•…μ  λ‹¨μœ„ κ°„ μ •λ ¬ 경둜λ₯Ό μ‚¬μš©ν•˜μ—¬ λͺ¨λΈμ΄ μŒμ•…μ μœΌλ‘œ μ˜λ―ΈμžˆλŠ” 쒅속성을 λͺ…ν™•ν•˜κ²Œ 배울 수 μžˆλ„λ‘ ν•œλ‹€. 두 번째둜, 잠재 ν‘œμƒμ„ ν™œμš©ν•˜μ—¬ μƒˆλ‘œμš΄ μŒμ•…μ  μš”μ†Œλ“€μ„ μœ μ—°ν•˜κ²Œ μ œμ–΄ν•˜κ³ μž ν•œλ‹€. 특히 잠재 ν‘œμƒμ΄ μ˜λ„λœ μš”μ†Œμ— λŒ€ν•΄ 풀리도둝 ν›ˆλ ¨ν•˜κΈ° μœ„ν•΄μ„œ 비지도 ν˜Ήμ€ μžκ°€μ§€λ„ ν•™μŠ΅ ν”„λ ˆμž„μ›Œν¬μ„ μ‚¬μš©ν•˜μ—¬ 잠재 ν‘œμƒμ„ μ œν•œν•˜λ„λ‘ ν•œλ‹€. λ³Έ ν•™μœ„λ…Όλ¬Έμ—μ„œλŠ” CGMC λΆ„μ•Όμ˜ λŒ€ν‘œμ μΈ 두 ν…ŒμŠ€ν¬μΈ λ©œλ‘œλ”” ν•˜λͺ¨λ‚˜μ΄μ œμ΄μ…˜ 및 ν‘œν˜„μ  μ—°μ£Ό λ Œλ”λ§ ν…ŒμŠ€ν¬μ— λŒ€ν•΄ μœ„μ˜ 두 가지 방법둠을 κ²€μ¦ν•œλ‹€. λ‹€μ–‘ν•œ μ‹€ν—˜μ  결과듀을 톡해 μ œμ•ˆν•œ 방법둠이 CGMC μ‹œμŠ€ν…œμ˜ μŒμ•…μ  μ°½μ˜μ„±μ„ μ•ˆμ •μ μΈ 생성 ν’ˆμ§ˆλ‘œ ν™•μž₯ν•  수 μžˆλ‹€λŠ” κ°€λŠ₯성을 μ‹œμ‚¬ν•œλ‹€.Chapter 1 Introduction 1 1.1 Motivation 5 1.2 Definitions 8 1.3 Tasks of Interest 10 1.3.1 Generation Quality 10 1.3.2 Controllability 12 1.4 Approaches 13 1.4.1 Modeling Musical Hierarchy 14 1.4.2 Regularizing Latent Representations 16 1.4.3 Target Tasks 18 1.5 Outline of the Thesis 19 Chapter 2 Background 22 2.1 Music Generation Tasks 23 2.1.1 Melody Harmonization 23 2.1.2 Expressive Performance Rendering 25 2.2 Structure-enhanced Music Generation 27 2.2.1 Hierarchical Music Generation 27 2.2.2 Transformer-based Music Generation 28 2.3 Disentanglement Learning 29 2.3.1 Unsupervised Approaches 30 2.3.2 Supervised Approaches 30 2.3.3 Self-supervised Approaches 31 2.4 Controllable Music Generation 32 2.4.1 Score Generation 32 2.4.2 Performance Rendering 33 2.5 Summary 34 Chapter 3 Translating Melody to Chord: Structured and Flexible Harmonization of Melody with Transformer 36 3.1 Introduction 36 3.2 Proposed Methods 41 3.2.1 Standard Transformer Model (STHarm) 41 3.2.2 Variational Transformer Model (VTHarm) 44 3.2.3 Regularized Variational Transformer Model (rVTHarm) 46 3.2.4 Training Objectives 47 3.3 Experimental Settings 48 3.3.1 Datasets 49 3.3.2 Comparative Methods 50 3.3.3 Training 50 3.3.4 Metrics 51 3.4 Evaluation 56 3.4.1 Chord Coherence and Diversity 57 3.4.2 Harmonic Similarity to Human 59 3.4.3 Controlling Chord Complexity 60 3.4.4 Subjective Evaluation 62 3.4.5 Qualitative Results 67 3.4.6 Ablation Study 73 3.5 Conclusion and Future Work 74 Chapter 4 Sketching the Expression: Flexible Rendering of Expressive Piano Performance with Self-supervised Learning 76 4.1 Introduction 76 4.2 Proposed Methods 79 4.2.1 Data Representation 79 4.2.2 Modeling Musical Hierarchy 80 4.2.3 Overall Network Architecture 81 4.2.4 Regularizing the Latent Variables 84 4.2.5 Overall Objective 86 4.3 Experimental Settings 87 4.3.1 Dataset and Implementation 87 4.3.2 Comparative Methods 88 4.4 Evaluation 88 4.4.1 Generation Quality 89 4.4.2 Disentangling Latent Representations 90 4.4.3 Controllability of Expressive Attributes 91 4.4.4 KL Divergence 93 4.4.5 Ablation Study 94 4.4.6 Subjective Evaluation 95 4.4.7 Qualitative Examples 97 4.4.8 Extent of Control 100 4.5 Conclusion 102 Chapter 5 Conclusion and Future Work 103 5.1 Conclusion 103 5.2 Future Work 106 5.2.1 Deeper Investigation of Controllable Factors 106 5.2.2 More Analysis of Qualitative Evaluation Results 107 5.2.3 Improving Diversity and Scale of Dataset 108 Bibliography 109 초 둝 137λ°•
    • …
    corecore