21 research outputs found
μ‘μ νΉμ μμΆλ μμ±μ μν μμ°μ μ λ μ κ²½λ§μ νμ©ν λ€μ€ ν΄μλ μμ± ν₯μ
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2021.8. κΉλ¨μ.Enhancement techniques for noisy speech and speech coding are essential for various speech applications such as robust speech recognition, hearing aids, and mobile communications. The main objective of enhancement techniques is to improve the quality and intelligibility of noisy speech by suppressing the background noise or the degraded speech by lowrate speech coding. Recently, a generative model-based data modelling showed prominent results in the speech processing area. From this perspective, we propose generative model-based enhancement techniques based on a multi-resolution approach for noisy speech and speech coding.
Generative adversarial networks (GANs) have been successfully applied to speech enhancement. However, there still remain two issues that need to be addressed: (1) GAN-based training is typically unstable due to its non-convex property, and (2) most of the conventional methods do not fully take advantage of the speech characteristics, which could result in a sub-optimal solution. In order to deal with these problems, we propose a progressive generator that can handle speech in a multi-resolution fashion. Additionally, we propose a multi-scale discriminator that discriminates the real and generated speech at various sampling rates to stabilize GAN training. Experimental results showed that the proposed approach could make the training faster and more stable, which improves the performance on various metrics for speech enhancement.
Recently, speech synthesis based on generative models has been successfully applied to the speech codec area. Despite their notable improvements in the speech quality, conventional neural decoder typically requires the prior information of the original speech codec such as bit allocation or de-quantization methods, which is not a general solution for various types of codecs. To address this limitation, we propose an imitation neural decoder based on a generative model which can directly reconstruct the speech from the bitstream without any speech codec information. Additionally, we propose a de-quantization network that can find which bits are related and de-quantize the bitstreams to extract a conditional variable which helps the generative model restore the original speech. Through a number of experiments with mixed excitation linear prediction (MELP), Advanced multi-band excitation (AMBE), and SPEEX at 2.4 kb/s, it is verified that the proposed method shows better subjective and objective results than the original speech codecs.
An integrated model was proposed by applying the progressive approach of Chapter 2 to the neurally optimized decoder proposed in Chapter 3. Since parallel wavenet, a generator of parallel wavegan, in Chapter 3, requires a lot of GPU usage for training, it takes a lot of time as small batches and training. In order to solve this problem, parallel wavenet is transformed into a progressive structure. Experimental results showed that the proposed model got better objective results compare to that of the parallel WaveNet.μ‘μμ΄ μμ¬μλ μμ± λ° μμ± μ½λ©μ μν ν₯μ κΈ°μ μ μ±λ₯μ’μ μμ± μΈμ, 보μ²κΈ° λ° μ΄λ ν΅μ κ³Ό κ°μ λ€μν μμ± μμ© νλ‘κ·Έλ¨μμ νμμ μ
λλ€. μ΄λ¬ν μμ± ν₯μ κΈ°μ μ μ£Όμ λͺ©μ μ μμ±μ νμ§κ³Ό λͺ
λ£μ±μ ν₯μμν€λ κ²μ
λλ€. μ΅κ·Ό μμ± λͺ¨λΈ κΈ°λ° λ°μ΄ν° λͺ¨λΈλ§μ μμ± μ νΈ μ²λ¦¬ μμμμ μ±κ³΅μ μΈ κ²°κ³Όλ₯Ό 보μ¬μ£Όμμ΅λλ€. μ΄λ¬ν κ΄μ μμ λ³Έ λ
Όλ¬Έμμλ μ‘μμ΄ λ§μ μμ± λ° μμ± μ½λ©μ μν λ€μ€ ν΄μλλ₯Ό νμ©ν μμ°μ μ λ λͺ¨λΈ κΈ°λ° ν₯μ κΈ°μ μ μ μνμμ΅λλ€.
μ΅κ·Ό GAN (Generative Adversarial Network)μ μμ± ν₯μμ μ±κ³΅μ μΌλ‘ μ μ©λμμ΅λλ€. κ·Έλ¬λ μ΄λ¬ν GAN κΈ°λ°μ ν₯μμμλ ν¬κ² 2κ°μ§ λ¬Έμ κ° λ°μ νκ³ μλλ°, (1) GAN κΈ°λ° νμ΅μ μΌλ°μ μΌλ‘ non-convex νΉμ±μΌλ‘ μΈν΄ λΆμμ νλ©° (2) λλΆλΆμ κΈ°μ‘΄ λ°©λ²λ€μ μμ± νΉμ±μ μ νμ©νμ§ λͺ»νκ³ μμ΅λλ€. μ΄λ¬ν λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄ μ°λ¦¬λ λ€μ€ ν΄μλ λ°©μμΌλ‘ μμ±μ μ²λ¦¬ ν μ μλ μ μ§μ μμ±κΈ°λ₯Ό μ μνμμ΅λλ€. λν GAN νλ ¨μ μμ ννκΈ° μν€κΈ° μν΄ λ€μν μμ±μ μνλ§ μλμμ μ€μ λ° μμ± λ μμ±μ ꡬλ³νλ λ€μ€ μ€μΌμΌ νλ³κΈ°λ₯Ό μ μνμμ΅λλ€. μ€ν κ²°κ³Όλ μ μλ μ κ·Ό λ°©μμ΄ νλ ¨μ λ λΉ λ₯΄κ³ μμ μ μΌλ‘ λ§λ€ μ μμμ λ³΄μ¬ μ£Όμ΄ μμ± ν₯μμ μ±λ₯ μΈ‘μ λ°©λ²μμ λμ μ±λ₯μ νμΈ νμμ΅λλ€.
μ΅κ·Όμ μμ± λͺ¨λΈμ κΈ°λ°μΌλ‘ ν μμ± ν©μ±μ΄ μμ± μ½λ± μμμ μ±κ³΅μ μΌλ‘ μ μ©λκ³ μμ΅λλ€. μμ± νμ§μ λμ λλ κ°μ μλ λΆκ΅¬νκ³ , κΈ°μ‘΄μ μ κ²½ λμ½λλ μΌλ°μ μΌλ‘ λΉνΈ ν λΉ μ 보 λλ μ μμν λ°©λ²κ³Ό κ°μ μμ± μ½λ±μ μ¬μ μ 보λ₯Ό νμλ‘ νλλ°, μ΄λ λ€μν μ’
λ₯μ μ½λ±μ λν μΌλ°νλ ν΄κ²° λ°©λ²μ΄ μλλλ€. μ΄λ¬ν νκ³λ₯Ό ν΄κ²°νκΈ° μν΄ μ°λ¦¬λ μμ± μ½λ±μ μ¬μ μ 보 μμ΄ λΉνΈ μ€νΈλ¦Όμμ μμ±μ μ§μ μ¬κ΅¬μ± ν μ μλ μμ± λͺ¨λΈμ κΈ°λ° λͺ¨λ°© μ κ²½ λμ½λλ₯Ό μ μνμμ΅λλ€. λν, μμ± λͺ¨λΈμ΄ μλ μμ±μ 볡μνλ λ° λμμ΄ λλ μ‘°κ±΄λΆ λ³μλ₯Ό μΆμΆνκΈ° μν΄ μ΄λ€ λΉνΈκ° κ΄λ ¨λμ΄ μλμ§ μ°Ύμλ΄κ³ λΉνΈ μ€νΈλ¦Όμ μ μμν ν μ μλ μ μμν λ€νΈμν¬λ₯Ό μ μ νμμ΅λλ€. MELP, AMBE λ° 2.4 kb / sμ SPEEXμ λν μ¬λ¬ μ€νμ ν΅ν΄ μ μ λ λ°©λ²μ΄ μλ μμ± μ½λ±λ³΄λ€ λ λμ μ£Όκ΄μ , κ°κ΄μ μΈ‘μ κ²°κ³Όλ₯Ό 보μ¬μ£Όλ κ²μΌλ‘ νμΈνμμ΅λλ€.
λ³Έ λ
Όλ¬Έμμ μ μν 2 μ₯μ μ μ§μ μ κ·Ό λ°©μμ 3 μ₯μμ μ μν λͺ¨λ°© μ κ²½ 보μ½λμ μ μ©νμ¬ ν΅ν© λͺ¨λΈμ μ μ νμμ΅λλ€. 3 μ₯μ λ³λ ¬ μ¨μ΄λΈ λ·μ νλ ¨μ μν΄ λ§μ GPU μ¬μ©μ΄ νμνλ―λ‘, μμ λ°°μΉλ‘ μΈν΄ νμ΅μ νλλ° λ§μ μκ°μ΄ 걸립λλ€. μ΄ λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄ λ³λ ¬ μ¨μ΄λΈ λ·μ μ μ§μ κ΅¬μ‘°λ‘ λ³ννμμ΅λλ€. μ€ν κ²°κ³Ό, μ μ λ λͺ¨λΈμ΄ λ³λ ¬ WaveNetκ³Ό λΉκ΅νμ¬ λ λμ κ°κ΄μ μΈ κ²°κ³Όλ₯Ό μ»μμ΅λλ€.1 Introduction 1
1.1 Speech Enhancement 1
1.2 Speech Coding 3
1.3 Outline of Thesis 4
2 A Multi-Resolution Approach to GAN-Based Speech Enhancement 7
2.1 Introduction 7
2.2 GAN-based Speech Enhancement 10
2.3 Multi-resolution Approach for Speech Enhancement 15
2.3.1 Progressive Generator 18
2.3.2 Multi-scale Discriminator 19
2.4 Experimental Settings and Results 21
2.4.1 Dataset 21
2.4.2 Network Structure 21
2.4.3 Evaluation Methods 24
2.4.4 Experiments and Results 25
2.4.5 Performance of Multi-scale Discriminator 26
2.4.6 Analysis and Comparison of Spectorgrams 28
2.4.7 Fast and Stable Training of Proposed Model 30
2.4.8 Comparison with Conventional GAN-based Speech Enhancement Techniques 33
2.5 Summary 34
3 Neurally opimized decoder for low bitrate speech codec 37
3.1 Introduction 37
3.2 Speech Coding Overview 40
3.3 Neurally Optimized Decoder 42
3.4 Experimental Settings and Results 46
3.4.1 Database of Speech and Codecs 46
3.4.2 Experimental Setup 46
3.4.3 Analysis of Training Loss 48
3.4.4 Objective Test 51
3.4.5 Subjective Test 53
3.4.6 Speaker Transparency 53
3.5 Summary 55
4 Imitation neural decoder based on progressive approach 57
4.1 Introduction 57
4.2 Parallel WaveNet 58
4.3 Progressive WaveNet 60
4.4 Experiments and Results 61
4.4.1 Objective Measures 62
4.4.2 Analysis of Memory Usage and Inference Speed 62
4.5 Summary 63
5 Conclusions 65
Bibliography 67λ°