8 research outputs found
Audio Source Separation with Discriminative Scattering Networks
In this report we describe an ongoing line of research for solving
single-channel source separation problems. Many monaural signal decomposition
techniques proposed in the literature operate on a feature space consisting of
a time-frequency representation of the input data. A challenge faced by these
approaches is to effectively exploit the temporal dependencies of the signals
at scales larger than the duration of a time-frame. In this work we propose to
tackle this problem by modeling the signals using a time-frequency
representation with multiple temporal resolutions. The proposed representation
consists of a pyramid of wavelet scattering operators, which generalizes
Constant Q Transforms (CQT) with extra layers of convolution and complex
modulus. We first show that learning standard models with this multi-resolution
setting improves source separation results over fixed-resolution methods. As
study case, we use Non-Negative Matrix Factorizations (NMF) that has been
widely considered in many audio application. Then, we investigate the inclusion
of the proposed multi-resolution setting into a discriminative training regime.
We discuss several alternatives using different deep neural network
architectures
Pre-processing techniques for improved detection of vocalization sounds in a neonatal intensive care unit
The sounds occurring in the noisy acoustical environment of a Neonatal Intensive Care Unit (NICU) are thought to affect the growth and neurodevelopment of preterm infants. Automatic sound detection in a NICU is a novel and challenging problem, and it is an essential step in the investigation of how preterm infants react to auditory stimuli of the NICU environment. In this paper, we present our work on an automatic system for detection of vocalization sounds, which are extensively present in NICUs. The proposed system reduces the presence of irrelevant sounds prior to detection. Several pre-processing techniques are compared, which are based on either spectral subtraction or non-negative matrix factorization, or a combination of both. The vocalization sounds are detected from the enhanced audio signal using either generative or discriminative classification models. An audio database acquired in a real-world NICU environment is used to assess the performance of the detection system in terms of frame-level missing and false alarm rates. The inclusion of the enhancement pre-processing step leads to up to 17.54% relative improvement over the baseline.Peer ReviewedPostprint (published version
A Perceptual Evaluation of Short-Time Fourier Transform Window Duration and Divergence Cost Function on Audio Source Separation using Non-negative Matrix Factorization
Non-negative matrix factorization (NMF) is an established method of performing audio source separation. Previous studies used NMF with supplementary systems to improve performance, but little has been done to investigate perceptual effects of NMF parameters. The present study aimed to evaluate two NMF parameters for speech enhancement: the short-time Fourier transform (STFT) window duration and divergence cost function. Two experiments were conducted: the first investigated the effect of STFT window duration on target speech intelligibility in a sentence keyword identification task. The second experiment had participants rate residual noise levels present in target speech using three different cost functions: the Euclidian Distance (EU), the Kullback-Leibler (KL) divergence, and the Itakura-Saito (IS) divergence. It was found that a 92.9 ms window duration produced the highest intelligibility scores, while the IS divergence produced significantly lower residual noise levels than the EU and KL divergences. Additionally, significant positive correlations were found between subjective residual noise scores and objective metrics from the Blind Source Separation (BSS_Eval) and Perceptual Evaluation method for Audio Source Separation (PEASS) toolboxes. Results suggest longer window durations, with increased frequency resolution, allow more accurate distinction between sources, improving intelligibility scores. Additionally, the IS divergence is able to more accurately approximate high frequency and transient components of audio, increasing separation of speech and noise. Correlation results suggest that using full bandwidth stimuli could increase reliability of objective measures
NMF-based compositional models for audio source separation
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 김남수.Many classes of data can be represented by constructive combinations of parts.
Most signal and data from nature have nonnegative values and can be explained and
reconstructed by constructive models. By the constructive models, only the additive
combination is allowed and it does not result in subtraction of parts. The compositional
models include dictionary learning, exemplar-based approaches, and nonnegative
matrix factorization (NMF). Compositional models are desirable in many areas
including image or visual signal processing, text information processing, audio signal
processing, and music information retrieval. In this dissertation, we choose NMF for
compositional models and NMF-based target source separation is performed for the
application.
The target source separation is the extraction or reconstruction of the target
signals in the mixture signals which consists with the target and interfering signals.
The target source separation can be thought as blind source separation (BSS). BSS
aims that the original unknown source signals are extracted without knowing or
with very limited information. However, in these days, much of prior information is
frequently utilized, and various approaches have been proposed for single channel
source separation.
NMF basically approximates a nonnegative data matrix V with a product of nonnegative basis and encoding matrices W and H, i.e., V WH. Since both W
and H are nonnegative, NMF often leads to a part based representation of the data.
The methods based on NMF have shown impressive results in single channel source
separation The objective function of NMF is generally presented Euclidean distant,
Kullback-Leibler divergence, and Itakura-saito divergence. Many optimization
methods have been proposed and utilized, e.g., multiplicative update rule, projected
gradient descent and NeNMF. However, NMF-based audio source separation has
some issues as follows: non-uniqueness of the bases, a high dependence to the prior
information, the overlapped subspace between target bases and interfering bases, a
disregard of the encoding vectors from the training phase, and insucient analysis
of sparse NMF. In this dissertation, we propose new approaches to resolve the above
issues.
In section 4, we propose a novel speech enhancement method that combines the
statistical model-based enhancement scheme with the NMF-based gain function.
For a better performance in time-varying noise environments, both the speech and
noise bases of NMF are adapted simultaneously with the help of the estimated
speech presence probability. In section 5, we propose a discriminative NMF (DNMF)
algorithm which exploits the reconstruction error for the interfering signals as well
as the target signal based on target bases. In section 6, we propose an approach to
robust bases estimation in which an incremental strategy is adopted. Based on an
analogy between clustering and NMF analysis, we incrementally estimate the NMF
bases similar to the modied k-means and Linde-Buzo-Gray algorithms popular
in the data clustering area. In Section 7, the distribution of the encoding vector
is modeled as a multivariate exponential PDF (MVE) with a single scaling factor
for each source. In Section 8, several sparse penalty terms for NMF are analyzed and compared in terms of signal to distortion ratio, sparseness of encoding vectors,
reconstruction error, and entropy of basis vectors. The new objective function which
applied sparse representation and discriminative NMF (DNMF) is also proposed.1 Introduction 1
1.1 Audio source separation 1
1.2 Speech enhancement 3
1.3 Measurements 4
1.4 Outline of the dissertation 6
2 Compositional model and NMF 9
2.1 Compositional model 9
2.2 NMF 14
2.2.1 Update rules: MuR, PGD 16
2.2.2 Modied NMF 20
3 NMF-based audio source separation and issues 23
3.1 NMF-based audio source separation 23
3.2 Problems of NMF in audio source separation 26
3.2.1 A high dependency to the prior knowledge 26
3.2.2 A overlapped subspace between the target and interfering basis matrices 28
3.2.3 A non-uniqueness of the bases 29
3.2.4 A prior knowledge of the encoding vectors 30
3.2.5 Sparse NMF for the source separation 32
4 Online bases update 33
4.1 Introduction 33
4.2 NMF-based speech enhancement using spectral gain function 36
4.3 Speech enhancement combining statistical model-based and NMFbased methods with the on-line bases update 38
4.3.1 On-line update of speech and noise bases 40
4.3.2 Determining maximum update rates 42
4.4 Experiment result 43
5 Discriminative NMF 47
5.1 Introduction 47
5.2 Discriminative NMF utilizing cross reconstruction error 48
5.2.1 DNMF using the reconstruction error of the other source 49
5.2.2 DNMF using the interference factors 50
5.3 Experiment result 52
6 Incremental approach for bases estimate 57
6.1 Introduction 57
6.2 Incremental approach based on modied k-means clustering and Linde-Buzo-Gray algorithm 59
6.2.1 Based on modied k-means clustering 59
6.2.2 LBG based incremental approach 62
6.3 Experiment result 63
6.3.1 Modied k-means clustering based approach 63
6.3.2 LBG based approach 66
7 Prior model of encoding vectors 77
7.1 Introduction 77
7.2 Prior model of encoding vectors based on multivariate exponential distribution 78
7.3 Experiment result 82
8 Conclusions 87
Bibliography 91
국문초록 105Docto
Complex Neural Networks for Audio
Audio is represented in two mathematically equivalent ways: the real-valued time domain (i.e., waveform) and the complex-valued frequency domain (i.e., spectrum). There are advantages to the frequency-domain representation, e.g., the human auditory system is known to process sound in the frequency-domain. Furthermore, linear time-invariant systems are convolved with sources in the time-domain, whereas they may be factorized in the frequency-domain. Neural networks have become rather useful when applied to audio tasks such as machine listening and audio synthesis, which are related by their dependencies on high quality acoustic models. They ideally encapsulate fine-scale temporal structure, such as that encoded in the phase of frequency-domain audio, yet there are no authoritative deep learning methods for complex audio. This manuscript is dedicated to addressing the shortcoming. Chapter 2 motivates complex networks by their affinity with complex-domain audio, while Chapter 3 contributes methods for building and optimizing complex networks. We show that the naive implementation of Adam optimization is incorrect for complex random variables and show that selection of input and output representation has a significant impact on the performance of a complex network. Experimental results with novel complex neural architectures are provided in the second half of this manuscript. Chapter 4 introduces a complex model for binaural audio source localization. We show that, like humans, the complex model can generalize to different anatomical filters, which is important in the context of machine listening. The complex model\u27s performance is better than that of the real-valued models, as well as real- and complex-valued baselines. Chapter 5 proposes a two-stage method for speech enhancement. In the first stage, a complex-valued stochastic autoencoder projects complex vectors to a discrete space. In the second stage, long-term temporal dependencies are modeled in the discrete space. The autoencoder raises the performance ceiling for state of the art speech enhancement, but the dynamic enhancement model does not outperform other baselines. We discuss areas for improvement and note that the complex Adam optimizer improves training convergence over the naive implementation
Single channel audio separation using deep neural networks and matrix factorizations
PhD ThesisSource Separation has become a significant research topic in the signal processing community and the machine learning area. Due to numerous applications, such as automatic speech recognition and speech communication, separation of target speech from the mixed signal is of great importance. In many practical applications, speech separation from a single recorder is most desirable from an application standpoint. In this thesis, two novel approaches have been proposed to address this single channel audio separation problem. This thesis first reviews traditional approaches for single channel source separation, and later elicits a generic approach, which is more capable of feature learning, i.e. deep graphical models.
In the first part of this thesis, a novel approach based on matrix factorization and hierarchical model has been proposed. In this work, an artificial stereo mixture is formulated to provide extra information. In addition, a hybrid framework that combines the generalized Expectation-Maximization algorithm with a multiplicative update rule is proposed to optimize the parameters of a matrix factorization based approach to approximatively separate the mixture. Furthermore, a hierarchical model based on an extreme learning machine is developed to check the validity of the approximately separated sources followed by an energy minimization method to further improve the quality of the separated sources by generating a time-frequency mask. Various experiments have been conducted and the obtained results have shown that the proposed approach outperforms conventional approaches not only in reduction of computational complexity, but also the separation performance.
In the second part, a deep neural network based ensemble system is proposed. In this work, the complementary property of different features are fully explored by ‘wide’ and ‘forward’ ensemble system. In addition, instead of using the features learned from the output layer, the features learned from the penultimate layer are investigated. The final embedded features are classified with an extreme learning machine to generate a binary mask to separate a mixed signal. The experiment focuses on speech in the presence of music and the obtained results demonstrated that the proposed ensemble system has the ability to explore the complementary property of various features thoroughly under various conditions with promising separation performance