8 research outputs found
Deep neural network techniques for monaural speech enhancement: state of the art analysis
Deep neural networks (DNN) techniques have become pervasive in domains such
as natural language processing and computer vision. They have achieved great
success in these domains in task such as machine translation and image
generation. Due to their success, these data driven techniques have been
applied in audio domain. More specifically, DNN models have been applied in
speech enhancement domain to achieve denosing, dereverberation and
multi-speaker separation in monaural speech enhancement. In this paper, we
review some dominant DNN techniques being employed to achieve speech
separation. The review looks at the whole pipeline of speech enhancement from
feature extraction, how DNN based tools are modelling both global and local
features of speech and model training (supervised and unsupervised). We also
review the use of speech-enhancement pre-trained models to boost speech
enhancement process. The review is geared towards covering the dominant trends
with regards to DNN application in speech enhancement in speech obtained via a
single speaker.Comment: conferenc
์ก์์ ๊ฐ์ธํ ์์ฑ ๊ตฌ๊ฐ ๊ฒ์ถ๊ณผ ์์ฑ ํฅ์์ ์ํ ๋ฅ ๋ฌ๋ ๊ธฐ๋ฐ ๊ธฐ๋ฒ ์ฐ๊ตฌ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2017. 2. ๊น๋จ์.Over the past decades, a number of approaches have been proposed to improve the performances of voice activity detection (VAD) and speech enhancement algorithms which are crucial for speech communication and speech signal processing systems. In particular, the increasing use of machine learning-based techniques has led to the more robust algorithms in low SNR conditions. Among them, the deep neural network (DNN) has been one of the most popular techniques.
While the DNN-based technique is successfully applied to these tasks, the characteristics of VAD and speech enhancement tasks are not fully incorporated to the DNN structures and objective functions. In this thesis, we propose the novel training schemes and post-filter for DNN-based VAD and speech enhancement. Unlike algorithms with basic DNN-based framework, the proposed algorithm combines the knowledge from signal processing and machine learning society to develop the improve DNN-based VAD and speech enhancement algorithm. In the following chapters, the environmental mismatch problem in the VAD area is compensated by applying multi-task learning to the DNN-based VAD. Also, the DNN-based framework is proposed in the speech enhancement scenario and the novel objective function and post-filter which are derived from the characteristics on human auditory perception improve the DNN-based speech enhancement algorithm.
In the VAD task, the DNN-based algorithm was recently proposed and outperformed the traditional and other machine learning-based VAD algorithms. However, the performance of the DNN-based algorithm sometimes deteriorates when the training and test environments are not matched with each other. In order to increase the performance of the DNN-based VAD in unseen environments, we adopt the multi-task learning (MTL) framework which consists of the primary VAD and subsidiary feature enhancement tasks. By employing the MTL framework, the DNN learns the denoising function in the shared hidden layers that is useful to maintain the VAD performance in mismatched noise conditions.
Second, the DNN-based framework is applied to the speech enhancement by considering it as a regression task. The encoding vector of the conventional nonnegative matrix factorization (NMF)-based algorithm is estimated by the proposed DNN and the performance of the DNN-based algorithm is compared to the conventional NMF-based algorithm.
Third, the perceptually motivated objective function is proposed for the DNN-based speech enhancement. In the proposed technique, a new objective function which consists of the Mel-scale weighted mean square error, temporal and spectral variations similarities between the enhanced and clean speech is employed in the DNN training stage. The proposed objective function helps to compute the gradients based on a perceptually motivated non-linear frequency scale and alleviates the over-smoothness of the estimated speech.
Furthermore, the post-filter which adjusts the variance over frequency bins further compensates the lack of contrasts between spectral peaks and valleys in the enhanced speech. The conventional GV equalization post-filters do not consider the spectral dynamics over frequency bins. To consider the contrast between spectral peaks and valleys in each enhanced speech frames, the proposed algorithm matches the variance over coefficients in the log-power spectra domain.
Finally, in the speech enhancement task, an integrated technique using the proposed perceptually motivated objective function and the post-filter is described. In matched and mismatched noise conditions, the performance results of the conventional and proposed algorithm are discussed. Also, the subjective preference test result of these algorithms is also provided.1 Introduction 1
2 Conventional Approaches for Speech Enhancement 7
2.1 NMF-Based Speech Enhancement 7
3 Deep Neural Networks 13
3.1 Introduction 13
3.2 Objective Function 14
3.3 Stochastic Gradient Descent 16
4 DNN-Based Voiced Activity Detection with Multi-Task Learning Framework 19
4.1 Introduction 19
4.2 DNN-Based VAD Algorithm 21
4.3 DNN-Based VAD with MTL framework 23
4.4 Experimental Results 26
4.4.1 Experiments in Matched Noise Conditions 26
4.4.2 Experiments in Mismatched Noise Conditions 28
4.5 Summary 30
5 NMF-based Speech Enhancement Using Deep Neural Network 35
5.1 Introduction 35
5.2 Encoding Vector Estimation Using DNN 37
5.3 Experiments 42
5.4 Summary 47
6 DNN-Based Monaural Speech Enhancement with Temporal and Spectral Variations Equalization 49
6.1 Introduction 49
6.2 Conventional DNN-Based Speech Enhancement 53
6.2.1 Training Stage 53
6.2.2 Test Stage 55
6.3 Perceptually-Motivated Criteria 56
6.3.1 Perceptually Motivated Objective Function 56
6.3.2 Mel-Scale Weighted Mean Square Error 58
6.3.3 Temporal Variation Similarity 58
6.3.4 Spectral Variation Similarity 61
6.3.5 DNN Training with the Proposed Objective Function 62
6.4 Experiments 62
6.4.1 Performance Evaluation with Varying Weight Parameters 64
6.4.2 Performance Evaluation in Matched Noise Conditions 64
6.4.3 Performance Evaluation in Mismatched Noise Conditions 66
6.4.4 Comparison Between Variation Analysis Method 66
6.4.5 Subjective Test Results 67
6.5 Summary 68
7 Spectral Variance Equalization Post-filter for DNN-Based Speech Enhancement 75
7.1 Introduction 75
7.2 GV Equalization Post-Filter 76
7.3 Spectral Variance(SV) Equalization Post-Filter 77
7.4 Experiments 78
7.4.1 Objective Test Results 78
7.4.2 Subjective Test Results 79
7.5 Summary 81
8 Conclusions 83
Bibliography 85
Appendix 95
์์ฝ 97Docto
Complex Neural Networks for Audio
Audio is represented in two mathematically equivalent ways: the real-valued time domain (i.e., waveform) and the complex-valued frequency domain (i.e., spectrum). There are advantages to the frequency-domain representation, e.g., the human auditory system is known to process sound in the frequency-domain. Furthermore, linear time-invariant systems are convolved with sources in the time-domain, whereas they may be factorized in the frequency-domain. Neural networks have become rather useful when applied to audio tasks such as machine listening and audio synthesis, which are related by their dependencies on high quality acoustic models. They ideally encapsulate fine-scale temporal structure, such as that encoded in the phase of frequency-domain audio, yet there are no authoritative deep learning methods for complex audio. This manuscript is dedicated to addressing the shortcoming. Chapter 2 motivates complex networks by their affinity with complex-domain audio, while Chapter 3 contributes methods for building and optimizing complex networks. We show that the naive implementation of Adam optimization is incorrect for complex random variables and show that selection of input and output representation has a significant impact on the performance of a complex network. Experimental results with novel complex neural architectures are provided in the second half of this manuscript. Chapter 4 introduces a complex model for binaural audio source localization. We show that, like humans, the complex model can generalize to different anatomical filters, which is important in the context of machine listening. The complex model\u27s performance is better than that of the real-valued models, as well as real- and complex-valued baselines. Chapter 5 proposes a two-stage method for speech enhancement. In the first stage, a complex-valued stochastic autoencoder projects complex vectors to a discrete space. In the second stage, long-term temporal dependencies are modeled in the discrete space. The autoencoder raises the performance ceiling for state of the art speech enhancement, but the dynamic enhancement model does not outperform other baselines. We discuss areas for improvement and note that the complex Adam optimizer improves training convergence over the naive implementation
NMF-based compositional models for audio source separation
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2017. 2. ๊น๋จ์.Many classes of data can be represented by constructive combinations of parts.
Most signal and data from nature have nonnegative values and can be explained and
reconstructed by constructive models. By the constructive models, only the additive
combination is allowed and it does not result in subtraction of parts. The compositional
models include dictionary learning, exemplar-based approaches, and nonnegative
matrix factorization (NMF). Compositional models are desirable in many areas
including image or visual signal processing, text information processing, audio signal
processing, and music information retrieval. In this dissertation, we choose NMF for
compositional models and NMF-based target source separation is performed for the
application.
The target source separation is the extraction or reconstruction of the target
signals in the mixture signals which consists with the target and interfering signals.
The target source separation can be thought as blind source separation (BSS). BSS
aims that the original unknown source signals are extracted without knowing or
with very limited information. However, in these days, much of prior information is
frequently utilized, and various approaches have been proposed for single channel
source separation.
NMF basically approximates a nonnegative data matrix V with a product of nonnegative basis and encoding matrices W and H, i.e., V WH. Since both W
and H are nonnegative, NMF often leads to a part based representation of the data.
The methods based on NMF have shown impressive results in single channel source
separation The objective function of NMF is generally presented Euclidean distant,
Kullback-Leibler divergence, and Itakura-saito divergence. Many optimization
methods have been proposed and utilized, e.g., multiplicative update rule, projected
gradient descent and NeNMF. However, NMF-based audio source separation has
some issues as follows: non-uniqueness of the bases, a high dependence to the prior
information, the overlapped subspace between target bases and interfering bases, a
disregard of the encoding vectors from the training phase, and insucient analysis
of sparse NMF. In this dissertation, we propose new approaches to resolve the above
issues.
In section 4, we propose a novel speech enhancement method that combines the
statistical model-based enhancement scheme with the NMF-based gain function.
For a better performance in time-varying noise environments, both the speech and
noise bases of NMF are adapted simultaneously with the help of the estimated
speech presence probability. In section 5, we propose a discriminative NMF (DNMF)
algorithm which exploits the reconstruction error for the interfering signals as well
as the target signal based on target bases. In section 6, we propose an approach to
robust bases estimation in which an incremental strategy is adopted. Based on an
analogy between clustering and NMF analysis, we incrementally estimate the NMF
bases similar to the modied k-means and Linde-Buzo-Gray algorithms popular
in the data clustering area. In Section 7, the distribution of the encoding vector
is modeled as a multivariate exponential PDF (MVE) with a single scaling factor
for each source. In Section 8, several sparse penalty terms for NMF are analyzed and compared in terms of signal to distortion ratio, sparseness of encoding vectors,
reconstruction error, and entropy of basis vectors. The new objective function which
applied sparse representation and discriminative NMF (DNMF) is also proposed.1 Introduction 1
1.1 Audio source separation 1
1.2 Speech enhancement 3
1.3 Measurements 4
1.4 Outline of the dissertation 6
2 Compositional model and NMF 9
2.1 Compositional model 9
2.2 NMF 14
2.2.1 Update rules: MuR, PGD 16
2.2.2 Modied NMF 20
3 NMF-based audio source separation and issues 23
3.1 NMF-based audio source separation 23
3.2 Problems of NMF in audio source separation 26
3.2.1 A high dependency to the prior knowledge 26
3.2.2 A overlapped subspace between the target and interfering basis matrices 28
3.2.3 A non-uniqueness of the bases 29
3.2.4 A prior knowledge of the encoding vectors 30
3.2.5 Sparse NMF for the source separation 32
4 Online bases update 33
4.1 Introduction 33
4.2 NMF-based speech enhancement using spectral gain function 36
4.3 Speech enhancement combining statistical model-based and NMFbased methods with the on-line bases update 38
4.3.1 On-line update of speech and noise bases 40
4.3.2 Determining maximum update rates 42
4.4 Experiment result 43
5 Discriminative NMF 47
5.1 Introduction 47
5.2 Discriminative NMF utilizing cross reconstruction error 48
5.2.1 DNMF using the reconstruction error of the other source 49
5.2.2 DNMF using the interference factors 50
5.3 Experiment result 52
6 Incremental approach for bases estimate 57
6.1 Introduction 57
6.2 Incremental approach based on modied k-means clustering and Linde-Buzo-Gray algorithm 59
6.2.1 Based on modied k-means clustering 59
6.2.2 LBG based incremental approach 62
6.3 Experiment result 63
6.3.1 Modied k-means clustering based approach 63
6.3.2 LBG based approach 66
7 Prior model of encoding vectors 77
7.1 Introduction 77
7.2 Prior model of encoding vectors based on multivariate exponential distribution 78
7.3 Experiment result 82
8 Conclusions 87
Bibliography 91
๊ตญ๋ฌธ์ด๋ก 105Docto