27 research outputs found

    You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

    Full text link
    Data augmentation is one of the most effective ways to make end-to-end automatic speech recognition (ASR) perform close to the conventional hybrid approach, especially when dealing with low-resource tasks. Using recent advances in speech synthesis (text-to-speech, or TTS), we build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. We argue that, when the training data amount is relatively low, this approach can allow an end-to-end model to reach hybrid systems' quality. For an artificial low-to-medium-resource setup, we compare the proposed augmentation with the semi-supervised learning technique. We also investigate the influence of vocoder usage on final ASR performance by comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an external language model, our approach outperforms a semi-supervised setup for LibriSpeech test-clean and only 33% worse than a comparable supervised setup. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other

    λ”₯λŸ¬λ‹μ„ ν™œμš©ν•œ μŠ€νƒ€μΌ μ μ‘ν˜• μŒμ„± ν•©μ„± 기법

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·컴퓨터곡학뢀, 2020. 8. κΉ€λ‚¨μˆ˜.The neural network-based speech synthesis techniques have been developed over the years. Although neural speech synthesis has shown remarkable generated speech quality, there are still remaining problems such as modeling power in a neural statistical parametric speech synthesis system, style expressiveness, and robust attention model in the end-to-end speech synthesis system. In this thesis, novel alternatives are proposed to resolve these drawbacks of the conventional neural speech synthesis system. In the first approach, we propose an adversarially trained variational recurrent neural network (AdVRNN), which applies a variational recurrent neural network (VRNN) to represent the variability of natural speech for acoustic modeling in neural statistical parametric speech synthesis. Also, we apply an adversarial learning scheme in training AdVRNN to overcome the oversmoothing problem. From the experimental results, we have found that the proposed AdVRNN based method outperforms the conventional RNN-based techniques. In the second approach, we propose a novel style modeling method employing mutual information neural estimator (MINE) in a style-adaptive end-to-end speech synthesis system. MINE is applied to increase target-style information and suppress text information in style embedding by applying MINE loss term in the loss function. The experimental results show that the MINE-based method has shown promising performance in both speech quality and style similarity for the global style token-Tacotron. In the third approach, we propose a novel attention method called memory attention for end-to-end speech synthesis, which is inspired by the gating mechanism of long-short term memory (LSTM). Leveraging the gating technique's sequence modeling power in LSTM, memory attention obtains the stable alignment from the content-based and location-based features. We evaluate the memory attention and compare its performance with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the results, we conclude that memory attention can generate speech with large variability robustly. In the last approach, we propose selective multi-attention for style-adaptive end-to-end speech synthesis systems. The conventional single attention model may limit the expressivity representing numerous alignment paths depending on style. To achieve a variation in attention alignment, we propose using a multi-attention model with a selection network. The multi-attention plays a role in generating candidates for the target style, and the selection network choose the most proper attention among the multi-attention. The experimental results show that selective multi-attention outperforms the conventional single attention techniques in multi-speaker speech synthesis and emotional speech synthesis.λ”₯λŸ¬λ‹ 기반의 μŒμ„± ν•©μ„± κΈ°μˆ μ€ μ§€λ‚œ λͺ‡ λ…„κ°„ νš”λ°œν•˜κ²Œ 개발되고 μžˆλ‹€. λ”₯λŸ¬λ‹μ˜ λ‹€μ–‘ν•œ 기법을 μ‚¬μš©ν•˜μ—¬ μŒμ„± ν•©μ„± ν’ˆμ§ˆμ€ λΉ„μ•½μ μœΌλ‘œ λ°œμ „ν–ˆμ§€λ§Œ, 아직 λ”₯λŸ¬λ‹ 기반의 μŒμ„± ν•©μ„±μ—λŠ” μ—¬λŸ¬ λ¬Έμ œκ°€ μ‘΄μž¬ν•œλ‹€. λ”₯λŸ¬λ‹ 기반의 톡계적 νŒŒλΌλ―Έν„° κΈ°λ²•μ˜ 경우 음ν–₯ λͺ¨λΈμ˜ deterministicν•œ λͺ¨λΈμ„ ν™œμš©ν•˜μ—¬ λͺ¨λΈλ§ λŠ₯λ ₯의 ν•œκ³„κ°€ 있으며, μ’…λ‹¨ν˜• λͺ¨λΈμ˜ 경우 μŠ€νƒ€μΌμ„ ν‘œν˜„ν•˜λŠ” λŠ₯λ ₯κ³Ό κ°•μΈν•œ μ–΄ν…μ…˜(attention)에 λŒ€ν•œ μ΄μŠˆκ°€ λŠμž„μ—†μ΄ 재기되고 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ΄λŸ¬ν•œ 기쑴의 λ”₯λŸ¬λ‹ 기반 μŒμ„± ν•©μ„± μ‹œμŠ€ν…œμ˜ 단점을 ν•΄κ²°ν•  μƒˆλ‘œμš΄ λŒ€μ•ˆμ„ μ œμ•ˆν•œλ‹€. 첫 번째 μ ‘κ·Όλ²•μœΌλ‘œμ„œ, λ‰΄λŸ΄ 톡계적 νŒŒλΌλ―Έν„° λ°©μ‹μ˜ 음ν–₯ λͺ¨λΈλ§μ„ κ³ λ„ν™”ν•˜κΈ° μœ„ν•œ adversarially trained variational recurrent neural network (AdVRNN) 기법을 μ œμ•ˆν•œλ‹€. AdVRNN 기법은 VRNN을 μŒμ„± 합성에 μ μš©ν•˜μ—¬ μŒμ„±μ˜ λ³€ν™”λ₯Ό stochastic ν•˜κ³  μžμ„Έν•˜κ²Œ λͺ¨λΈλ§ν•  수 μžˆλ„λ‘ ν•˜μ˜€λ‹€. λ˜ν•œ, μ λŒ€μ  ν•™μŠ΅μ (adversarial learning) 기법을 ν™œμš©ν•˜μ—¬ oversmoothing 문제λ₯Ό μ΅œμ†Œν™” μ‹œν‚€λ„λ‘ ν•˜μ˜€λ‹€. μ΄λŸ¬ν•œ μ œμ•ˆλœ μ•Œκ³ λ¦¬μ¦˜μ€ 기쑴의 μˆœν™˜ 신경망 기반의 음ν–₯ λͺ¨λΈκ³Ό λΉ„κ΅ν•˜μ—¬ μ„±λŠ₯이 ν–₯상됨을 ν™•μΈν•˜μ˜€λ‹€. 두 번째 μ ‘κ·Όλ²•μœΌλ‘œμ„œ, μŠ€νƒ€μΌ μ μ‘ν˜• μ’…λ‹¨ν˜• μŒμ„± ν•©μ„± 기법을 μœ„ν•œ μƒν˜Έ μ •λ³΄λŸ‰ 기반의 μƒˆλ‘œμš΄ ν•™μŠ΅ 기법을 μ œμ•ˆν•œλ‹€. 기쑴의 global style token(GST) 기반의 μŠ€νƒ€μΌ μŒμ„± ν•©μ„± κΈ°λ²•μ˜ 경우, 비지도 ν•™μŠ΅μ„ μ‚¬μš©ν•˜λ―€λ‘œ μ›ν•˜λŠ” λͺ©ν‘œ μŠ€νƒ€μΌμ΄ μžˆμ–΄λ„ 이λ₯Ό μ€‘μ μ μœΌλ‘œ ν•™μŠ΅μ‹œν‚€κΈ° μ–΄λ €μ› λ‹€. 이λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ GST의 좜λ ₯κ³Ό λͺ©ν‘œ μŠ€νƒ€μΌ μž„λ² λ”© λ²‘ν„°μ˜ μƒν˜Έ μ •λ³΄λŸ‰μ„ μ΅œλŒ€ν™” ν•˜λ„λ‘ ν•™μŠ΅ μ‹œν‚€λŠ” 기법을 μ œμ•ˆν•˜μ˜€λ‹€. μƒν˜Έ μ •λ³΄λŸ‰μ„ μ’…λ‹¨ν˜• λͺ¨λΈμ˜ μ†μ‹€ν•¨μˆ˜μ— μ μš©ν•˜κΈ° μœ„ν•΄μ„œ mutual information neural estimator(MINE) 기법을 λ„μž…ν•˜μ˜€κ³  λ‹€ν™”μž λͺ¨λΈμ„ 톡해 기쑴의 GST 기법에 λΉ„ν•΄ λͺ©ν‘œ μŠ€νƒ€μΌμ„ 보닀 μ€‘μ μ μœΌλ‘œ ν•™μŠ΅μ‹œν‚¬ 수 μžˆμŒμ„ ν™•μΈν•˜μ˜€λ‹€. μ„Έλ²ˆμ§Έ μ ‘κ·Όλ²•μœΌλ‘œμ„œ, κ°•μΈν•œ μ’…λ‹¨ν˜• μŒμ„± ν•©μ„±μ˜ μ–΄ν…μ…˜μΈ memory attention을 μ œμ•ˆν•œλ‹€. Long-short term memory(LSTM)의 gating κΈ°μˆ μ€ sequenceλ₯Ό λͺ¨λΈλ§ν•˜λŠ”데 높은 μ„±λŠ₯을 보여왔닀. μ΄λŸ¬ν•œ κΈ°μˆ μ„ μ–΄ν…μ…˜μ— μ μš©ν•˜μ—¬ λ‹€μ–‘ν•œ μŠ€νƒ€μΌμ„ 가진 μŒμ„±μ—μ„œλ„ μ–΄ν…μ…˜μ˜ λŠκΉ€, 반볡 등을 μ΅œμ†Œν™”ν•  수 μžˆλŠ” 기법을 μ œμ•ˆν•œλ‹€. 단일 ν™”μžμ™€ 감정 μŒμ„± ν•©μ„± 기법을 ν† λŒ€λ‘œ memory attention의 μ„±λŠ₯을 ν™•μΈν•˜μ˜€μœΌλ©° κΈ°μ‘΄ 기법 λŒ€λΉ„ 보닀 μ•ˆμ •μ μΈ μ–΄ν…μ…˜ 곑선을 얻을 수 μžˆμŒμ„ ν™•μΈν•˜μ˜€λ‹€. λ§ˆμ§€λ§‰ μ ‘κ·Όλ²•μœΌλ‘œμ„œ, selective multi-attention (SMA)을 ν™œμš©ν•œ μŠ€νƒ€μΌ μ μ‘ν˜• μ’…λ‹¨ν˜• μŒμ„± ν•©μ„± μ–΄ν…μ…˜ 기법을 μ œμ•ˆν•œλ‹€. 기쑴의 μŠ€νƒ€μΌ μ μ‘ν˜• μ’…λ‹¨ν˜• μŒμ„± ν•©μ„±μ˜ μ—°κ΅¬μ—μ„œλŠ” 낭독체 λ‹¨μΌν™”μžμ˜ κ²½μš°μ™€ 같은 단일 μ–΄ν…μ…˜μ„ μ‚¬μš©ν•˜μ—¬ μ™”λ‹€. ν•˜μ§€λ§Œ μŠ€νƒ€μΌ μŒμ„±μ˜ 경우 보닀 λ‹€μ–‘ν•œ μ–΄ν…μ…˜ ν‘œν˜„μ„ μš”κ΅¬ν•œλ‹€. 이λ₯Ό μœ„ν•΄ 닀쀑 μ–΄ν…μ…˜μ„ ν™œμš©ν•˜μ—¬ 후보듀을 μƒμ„±ν•˜κ³  이λ₯Ό 선택 λ„€νŠΈμ›Œν¬λ₯Ό ν™œμš©ν•˜μ—¬ 졜적의 μ–΄ν…μ…˜μ„ μ„ νƒν•˜λŠ” 기법을 μ œμ•ˆν•œλ‹€. SMA 기법은 기쑴의 μ–΄ν…μ…˜κ³Όμ˜ 비ꡐ μ‹€ν—˜μ„ ν†΅ν•˜μ—¬ 보닀 λ§Žμ€ μŠ€νƒ€μΌμ„ μ•ˆμ •μ μœΌλ‘œ ν‘œν˜„ν•  수 μžˆμŒμ„ ν™•μΈν•˜μ˜€λ‹€.1 Introduction 1 1.1 Background 1 1.2 Scope of thesis 3 2 Neural Speech Synthesis System 7 2.1 Overview of a Neural Statistical Parametric Speech Synthesis System 7 2.2 Overview of End-to-end Speech Synthesis System 9 2.3 Tacotron2 10 2.4 Attention Mechanism 12 2.4.1 Location Sensitive Attention 12 2.4.2 Forward Attention 13 2.4.3 Dynamic Convolution Attention 14 3 Neural Statistical Parametric Speech Synthesis using AdVRNN 17 3.1 Introduction 17 3.2 Background 19 3.2.1 Variational Autoencoder 19 3.2.2 Variational Recurrent Neural Network 20 3.3 Speech Synthesis Using AdVRNN 22 3.3.1 AdVRNN based Acoustic Modeling 23 3.3.2 Training Procedure 24 3.4 Experiments 25 3.4.1 Objective performance evaluation 28 3.4.2 Subjective performance evaluation 29 3.5 Summary 29 4 Speech Style Modeling Method using Mutual Information for End-to-End Speech Synthesis 31 4.1 Introduction 31 4.2 Background 33 4.2.1 Mutual Information 33 4.2.2 Mutual Information Neural Estimator 34 4.2.3 Global Style Token 34 4.3 Style Token end-to-end speech synthesis using MINE 35 4.4 Experiments 36 4.5 Summary 38 5 Memory Attention: Robust Alignment using Gating Mechanism for End-to-End Speech Synthesis 45 5.1 Introduction 45 5.2 BACKGROUND 48 5.3 Memory Attention 49 5.4 Experiments 52 5.4.1 Experiments on Single Speaker Speech Synthesis 53 5.4.2 Experiments on Emotional Speech Synthesis 56 5.5 Summary 59 6 Selective Multi-attention for style-adaptive end-to-End Speech Syn-thesis 63 6.1 Introduction 63 6.2 BACKGROUND 65 6.3 Selective multi-attention model 66 6.4 EXPERIMENTS 67 6.4.1 Multi-speaker speech synthesis experiments 68 6.4.2 Experiments on Emotional Speech Synthesis 73 6.5 Summary 77 7 Conclusions 79 Bibliography 83 μš”μ•½ 93 κ°μ‚¬μ˜ κΈ€ 95Docto

    DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation

    Full text link
    In the Text-to-speech(TTS) task, the latent diffusion model has excellent fidelity and generalization, but its expensive resource consumption and slow inference speed have always been a challenging. This paper proposes Discrete Diffusion Model with Contrastive Learning for Text-to-Speech Generation(DCTTS). The following contributions are made by DCTTS: 1) The TTS diffusion model based on discrete space significantly lowers the computational consumption of the diffusion model and improves sampling speed; 2) The contrastive learning method based on discrete space is used to enhance the alignment connection between speech and text and improve sampling quality; and 3) It uses an efficient text encoder to simplify the model's parameters and increase computational efficiency. The experimental results demonstrate that the approach proposed in this paper has outstanding speech synthesis quality and sampling speed while significantly reducing the resource consumption of diffusion model. The synthesized samples are available at https://github.com/lawtherWu/DCTTS.Comment: 5 pages, submitted to ICASS

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field
    corecore