17 research outputs found

    작음 ν˜Ήμ€ μ••μΆ•λœ μŒμ„±μ„ μœ„ν•œ 생산적 μ λŒ€ 신경망을 ν™œμš©ν•œ 닀쀑 해상도 μŒμ„± ν–₯상

    No full text
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2021.8. κΉ€λ‚¨μˆ˜.Enhancement techniques for noisy speech and speech coding are essential for various speech applications such as robust speech recognition, hearing aids, and mobile communications. The main objective of enhancement techniques is to improve the quality and intelligibility of noisy speech by suppressing the background noise or the degraded speech by lowrate speech coding. Recently, a generative model-based data modelling showed prominent results in the speech processing area. From this perspective, we propose generative model-based enhancement techniques based on a multi-resolution approach for noisy speech and speech coding. Generative adversarial networks (GANs) have been successfully applied to speech enhancement. However, there still remain two issues that need to be addressed: (1) GAN-based training is typically unstable due to its non-convex property, and (2) most of the conventional methods do not fully take advantage of the speech characteristics, which could result in a sub-optimal solution. In order to deal with these problems, we propose a progressive generator that can handle speech in a multi-resolution fashion. Additionally, we propose a multi-scale discriminator that discriminates the real and generated speech at various sampling rates to stabilize GAN training. Experimental results showed that the proposed approach could make the training faster and more stable, which improves the performance on various metrics for speech enhancement. Recently, speech synthesis based on generative models has been successfully applied to the speech codec area. Despite their notable improvements in the speech quality, conventional neural decoder typically requires the prior information of the original speech codec such as bit allocation or de-quantization methods, which is not a general solution for various types of codecs. To address this limitation, we propose an imitation neural decoder based on a generative model which can directly reconstruct the speech from the bitstream without any speech codec information. Additionally, we propose a de-quantization network that can find which bits are related and de-quantize the bitstreams to extract a conditional variable which helps the generative model restore the original speech. Through a number of experiments with mixed excitation linear prediction (MELP), Advanced multi-band excitation (AMBE), and SPEEX at 2.4 kb/s, it is verified that the proposed method shows better subjective and objective results than the original speech codecs. An integrated model was proposed by applying the progressive approach of Chapter 2 to the neurally optimized decoder proposed in Chapter 3. Since parallel wavenet, a generator of parallel wavegan, in Chapter 3, requires a lot of GPU usage for training, it takes a lot of time as small batches and training. In order to solve this problem, parallel wavenet is transformed into a progressive structure. Experimental results showed that the proposed model got better objective results compare to that of the parallel WaveNet.작음이 μ„žμ—¬μžˆλŠ” μŒμ„± 및 μŒμ„± 코딩을 μœ„ν•œ ν–₯상 κΈ°μˆ μ€ μ„±λŠ₯쒋은 μŒμ„± 인식, 보청기 및 이동 톡신과 같은 λ‹€μ–‘ν•œ μŒμ„± μ‘μš© ν”„λ‘œκ·Έλž¨μ—μ„œ ν•„μˆ˜μ μž…λ‹ˆλ‹€. μ΄λŸ¬ν•œ μŒμ„± ν–₯상 기술의 μ£Όμš” λͺ©μ μ€ μŒμ„±μ˜ ν’ˆμ§ˆκ³Ό λͺ…λ£Œμ„±μ„ ν–₯μƒμ‹œν‚€λŠ” κ²ƒμž…λ‹ˆλ‹€. 졜근 생성 λͺ¨λΈ 기반 데이터 λͺ¨λΈλ§μ€ μŒμ„± μ‹ ν˜Έ 처리 μ˜μ—­μ—μ„œ 성곡적인 κ²°κ³Όλ₯Ό λ³΄μ—¬μ£Όμ—ˆμŠ΅λ‹ˆλ‹€. μ΄λŸ¬ν•œ κ΄€μ μ—μ„œ λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 작음이 λ§Žμ€ μŒμ„± 및 μŒμ„± 코딩을 μœ„ν•œ 닀쀑 해상도λ₯Ό ν™œμš©ν•œ 생산적 μ λŒ€ λͺ¨λΈ 기반 ν–₯상 κΈ°μˆ μ„ μ œμ•ˆν•˜μ˜€μŠ΅λ‹ˆλ‹€. 졜근 GAN (Generative Adversarial Network)은 μŒμ„± ν–₯상에 μ„±κ³΅μ μœΌλ‘œ μ μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ μ΄λŸ¬ν•œ GAN 기반의 ν–₯μƒμ—μ„œλŠ” 크게 2가지 λ¬Έμ œκ°€ λ°œμƒ ν•˜κ³  μžˆλŠ”λ°, (1) GAN 기반 ν•™μŠ΅μ€ 일반적으둜 non-convex νŠΉμ„±μœΌλ‘œ 인해 λΆˆμ•ˆμ •ν•˜λ©° (2) λŒ€λΆ€λΆ„μ˜ κΈ°μ‘΄ 방법듀은 μŒμ„± νŠΉμ„±μ„ 잘 ν™œμš©ν•˜μ§€ λͺ»ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. μ΄λŸ¬ν•œ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ μš°λ¦¬λŠ” 닀쀑 해상도 λ°©μ‹μœΌλ‘œ μŒμ„±μ„ 처리 ν•  수 μžˆλŠ” 점진적 생성기λ₯Ό μ œμ•ˆν•˜μ˜€μŠ΅λ‹ˆλ‹€. λ˜ν•œ GAN ν›ˆλ ¨μ„ μ•ˆμ •ν™”ν•˜κΈ° μ‹œν‚€κΈ° μœ„ν•΄ λ‹€μ–‘ν•œ μŒμ„±μ˜ μƒ˜ν”Œλ§ μ†λ„μ—μ„œ μ‹€μ œ 및 생성 된 μŒμ„±μ„ κ΅¬λ³„ν•˜λŠ” 닀쀑 μŠ€μΌ€μΌ νŒλ³„κΈ°λ₯Ό μ œμ•ˆν•˜μ˜€μŠ΅λ‹ˆλ‹€. μ‹€ν—˜ κ²°κ³ΌλŠ” μ œμ•ˆλœ μ ‘κ·Ό 방식이 ν›ˆλ ¨μ„ 더 λΉ λ₯΄κ³  μ•ˆμ •μ μœΌλ‘œ λ§Œλ“€ 수 μžˆμŒμ„ 보여 μ£Όμ–΄ μŒμ„± ν–₯μƒμ˜ μ„±λŠ₯ μΈ‘μ • λ°©λ²•μ—μ„œ 높은 μ„±λŠ₯을 확인 ν•˜μ˜€μŠ΅λ‹ˆλ‹€. μ΅œκ·Όμ— 생성 λͺ¨λΈμ„ 기반으둜 ν•œ μŒμ„± 합성이 μŒμ„± 코덱 μ˜μ—­μ— μ„±κ³΅μ μœΌλ‘œ 적용되고 μžˆμŠ΅λ‹ˆλ‹€. μŒμ„± ν’ˆμ§ˆμ˜ λˆˆμ— λ„λŠ” κ°œμ„ μ—λ„ λΆˆκ΅¬ν•˜κ³ , 기쑴의 μ‹ κ²½ λ””μ½”λ”λŠ” 일반적으둜 λΉ„νŠΈ ν• λ‹Ή 정보 λ˜λŠ” μ—­ μ–‘μžν™” 방법과 같은 μŒμ„± μ½”λ±μ˜ 사전 정보λ₯Ό ν•„μš”λ‘œ ν•˜λŠ”λ°, μ΄λŠ” λ‹€μ–‘ν•œ μ’…λ₯˜μ˜ 코덱에 λŒ€ν•œ μΌλ°˜ν™”λœ ν•΄κ²° 방법이 μ•„λ‹™λ‹ˆλ‹€. μ΄λŸ¬ν•œ ν•œκ³„λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ μš°λ¦¬λŠ” μŒμ„± μ½”λ±μ˜ 사전 정보 없이 λΉ„νŠΈ μŠ€νŠΈλ¦Όμ—μ„œ μŒμ„±μ„ 직접 μž¬κ΅¬μ„± ν•  수 μžˆλŠ” 생성 λͺ¨λΈμ„ 기반 λͺ¨λ°© μ‹ κ²½ 디코더λ₯Ό μ œμ•ˆν•˜μ˜€μŠ΅λ‹ˆλ‹€. λ˜ν•œ, 생성 λͺ¨λΈμ΄ μ›λž˜ μŒμ„±μ„ λ³΅μ›ν•˜λŠ” 데 도움이 λ˜λŠ” 쑰건뢀 λ³€μˆ˜λ₯Ό μΆ”μΆœν•˜κΈ° μœ„ν•΄ μ–΄λ–€ λΉ„νŠΈκ°€ κ΄€λ ¨λ˜μ–΄ μžˆλŠ”μ§€ μ°Ύμ•„λ‚΄κ³  λΉ„νŠΈ μŠ€νŠΈλ¦Όμ„ μ—­ μ–‘μžν™” ν•  수 μžˆλŠ” μ—­ μ–‘μžν™” λ„€νŠΈμ›Œν¬λ₯Ό μ œμ•ˆ ν•˜μ˜€μŠ΅λ‹ˆλ‹€. MELP, AMBE 및 2.4 kb / s의 SPEEX에 λŒ€ν•œ μ—¬λŸ¬ μ‹€ν—˜μ„ 톡해 μ œμ•ˆ 된 방법이 μ›λž˜ μŒμ„± 코덱보닀 더 λ‚˜μ€ 주관적, 객관적 μΈ‘μ • κ²°κ³Όλ₯Ό λ³΄μ—¬μ£ΌλŠ” κ²ƒμœΌλ‘œ ν™•μΈν•˜μ˜€μŠ΅λ‹ˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œ μ œμ•ˆν•œ 2 μž₯의 점진적 μ ‘κ·Ό 방식을 3 μž₯μ—μ„œ μ œμ•ˆν•œ λͺ¨λ°© μ‹ κ²½ 보코더에 μ μš©ν•˜μ—¬ 톡합 λͺ¨λΈμ„ μ œμ•ˆ ν•˜μ˜€μŠ΅λ‹ˆλ‹€. 3 μž₯의 병렬 μ›¨μ΄λΈŒ 넷은 ν›ˆλ ¨μ„ μœ„ν•΄ λ§Žμ€ GPU μ‚¬μš©μ΄ ν•„μš”ν•˜λ―€λ‘œ, μž‘μ€ 배치둜 인해 ν•™μŠ΅μ„ ν•˜λŠ”λ° λ§Žμ€ μ‹œκ°„μ΄ κ±Έλ¦½λ‹ˆλ‹€. 이 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 병렬 μ›¨μ΄λΈŒ 넷을 점진적 ꡬ쑰둜 λ³€ν™˜ν•˜μ˜€μŠ΅λ‹ˆλ‹€. μ‹€ν—˜ κ²°κ³Ό, μ œμ•ˆ 된 λͺ¨λΈμ΄ 병렬 WaveNetκ³Ό λΉ„κ΅ν•˜μ—¬ 더 λ‚˜μ€ 객관적인 κ²°κ³Όλ₯Ό μ–»μ—ˆμŠ΅λ‹ˆλ‹€.1 Introduction 1 1.1 Speech Enhancement 1 1.2 Speech Coding 3 1.3 Outline of Thesis 4 2 A Multi-Resolution Approach to GAN-Based Speech Enhancement 7 2.1 Introduction 7 2.2 GAN-based Speech Enhancement 10 2.3 Multi-resolution Approach for Speech Enhancement 15 2.3.1 Progressive Generator 18 2.3.2 Multi-scale Discriminator 19 2.4 Experimental Settings and Results 21 2.4.1 Dataset 21 2.4.2 Network Structure 21 2.4.3 Evaluation Methods 24 2.4.4 Experiments and Results 25 2.4.5 Performance of Multi-scale Discriminator 26 2.4.6 Analysis and Comparison of Spectorgrams 28 2.4.7 Fast and Stable Training of Proposed Model 30 2.4.8 Comparison with Conventional GAN-based Speech Enhancement Techniques 33 2.5 Summary 34 3 Neurally opimized decoder for low bitrate speech codec 37 3.1 Introduction 37 3.2 Speech Coding Overview 40 3.3 Neurally Optimized Decoder 42 3.4 Experimental Settings and Results 46 3.4.1 Database of Speech and Codecs 46 3.4.2 Experimental Setup 46 3.4.3 Analysis of Training Loss 48 3.4.4 Objective Test 51 3.4.5 Subjective Test 53 3.4.6 Speaker Transparency 53 3.5 Summary 55 4 Imitation neural decoder based on progressive approach 57 4.1 Introduction 57 4.2 Parallel WaveNet 58 4.3 Progressive WaveNet 60 4.4 Experiments and Results 61 4.4.1 Objective Measures 62 4.4.2 Analysis of Memory Usage and Inference Speed 62 4.5 Summary 63 5 Conclusions 65 Bibliography 67λ°•
    corecore