8 research outputs found

    Audio Source Separation with Discriminative Scattering Networks

    Full text link
    In this report we describe an ongoing line of research for solving single-channel source separation problems. Many monaural signal decomposition techniques proposed in the literature operate on a feature space consisting of a time-frequency representation of the input data. A challenge faced by these approaches is to effectively exploit the temporal dependencies of the signals at scales larger than the duration of a time-frame. In this work we propose to tackle this problem by modeling the signals using a time-frequency representation with multiple temporal resolutions. The proposed representation consists of a pyramid of wavelet scattering operators, which generalizes Constant Q Transforms (CQT) with extra layers of convolution and complex modulus. We first show that learning standard models with this multi-resolution setting improves source separation results over fixed-resolution methods. As study case, we use Non-Negative Matrix Factorizations (NMF) that has been widely considered in many audio application. Then, we investigate the inclusion of the proposed multi-resolution setting into a discriminative training regime. We discuss several alternatives using different deep neural network architectures

    Pre-processing techniques for improved detection of vocalization sounds in a neonatal intensive care unit

    Get PDF
    The sounds occurring in the noisy acoustical environment of a Neonatal Intensive Care Unit (NICU) are thought to affect the growth and neurodevelopment of preterm infants. Automatic sound detection in a NICU is a novel and challenging problem, and it is an essential step in the investigation of how preterm infants react to auditory stimuli of the NICU environment. In this paper, we present our work on an automatic system for detection of vocalization sounds, which are extensively present in NICUs. The proposed system reduces the presence of irrelevant sounds prior to detection. Several pre-processing techniques are compared, which are based on either spectral subtraction or non-negative matrix factorization, or a combination of both. The vocalization sounds are detected from the enhanced audio signal using either generative or discriminative classification models. An audio database acquired in a real-world NICU environment is used to assess the performance of the detection system in terms of frame-level missing and false alarm rates. The inclusion of the enhancement pre-processing step leads to up to 17.54% relative improvement over the baseline.Peer ReviewedPostprint (published version

    A Perceptual Evaluation of Short-Time Fourier Transform Window Duration and Divergence Cost Function on Audio Source Separation using Non-negative Matrix Factorization

    Get PDF
    Non-negative matrix factorization (NMF) is an established method of performing audio source separation. Previous studies used NMF with supplementary systems to improve performance, but little has been done to investigate perceptual effects of NMF parameters. The present study aimed to evaluate two NMF parameters for speech enhancement: the short-time Fourier transform (STFT) window duration and divergence cost function. Two experiments were conducted: the first investigated the effect of STFT window duration on target speech intelligibility in a sentence keyword identification task. The second experiment had participants rate residual noise levels present in target speech using three different cost functions: the Euclidian Distance (EU), the Kullback-Leibler (KL) divergence, and the Itakura-Saito (IS) divergence. It was found that a 92.9 ms window duration produced the highest intelligibility scores, while the IS divergence produced significantly lower residual noise levels than the EU and KL divergences. Additionally, significant positive correlations were found between subjective residual noise scores and objective metrics from the Blind Source Separation (BSS_Eval) and Perceptual Evaluation method for Audio Source Separation (PEASS) toolboxes. Results suggest longer window durations, with increased frequency resolution, allow more accurate distinction between sources, improving intelligibility scores. Additionally, the IS divergence is able to more accurately approximate high frequency and transient components of audio, increasing separation of speech and noise. Correlation results suggest that using full bandwidth stimuli could increase reliability of objective measures

    NMF-based compositional models for audio source separation

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 김남수.Many classes of data can be represented by constructive combinations of parts. Most signal and data from nature have nonnegative values and can be explained and reconstructed by constructive models. By the constructive models, only the additive combination is allowed and it does not result in subtraction of parts. The compositional models include dictionary learning, exemplar-based approaches, and nonnegative matrix factorization (NMF). Compositional models are desirable in many areas including image or visual signal processing, text information processing, audio signal processing, and music information retrieval. In this dissertation, we choose NMF for compositional models and NMF-based target source separation is performed for the application. The target source separation is the extraction or reconstruction of the target signals in the mixture signals which consists with the target and interfering signals. The target source separation can be thought as blind source separation (BSS). BSS aims that the original unknown source signals are extracted without knowing or with very limited information. However, in these days, much of prior information is frequently utilized, and various approaches have been proposed for single channel source separation. NMF basically approximates a nonnegative data matrix V with a product of nonnegative basis and encoding matrices W and H, i.e., V WH. Since both W and H are nonnegative, NMF often leads to a part based representation of the data. The methods based on NMF have shown impressive results in single channel source separation The objective function of NMF is generally presented Euclidean distant, Kullback-Leibler divergence, and Itakura-saito divergence. Many optimization methods have been proposed and utilized, e.g., multiplicative update rule, projected gradient descent and NeNMF. However, NMF-based audio source separation has some issues as follows: non-uniqueness of the bases, a high dependence to the prior information, the overlapped subspace between target bases and interfering bases, a disregard of the encoding vectors from the training phase, and insucient analysis of sparse NMF. In this dissertation, we propose new approaches to resolve the above issues. In section 4, we propose a novel speech enhancement method that combines the statistical model-based enhancement scheme with the NMF-based gain function. For a better performance in time-varying noise environments, both the speech and noise bases of NMF are adapted simultaneously with the help of the estimated speech presence probability. In section 5, we propose a discriminative NMF (DNMF) algorithm which exploits the reconstruction error for the interfering signals as well as the target signal based on target bases. In section 6, we propose an approach to robust bases estimation in which an incremental strategy is adopted. Based on an analogy between clustering and NMF analysis, we incrementally estimate the NMF bases similar to the modied k-means and Linde-Buzo-Gray algorithms popular in the data clustering area. In Section 7, the distribution of the encoding vector is modeled as a multivariate exponential PDF (MVE) with a single scaling factor for each source. In Section 8, several sparse penalty terms for NMF are analyzed and compared in terms of signal to distortion ratio, sparseness of encoding vectors, reconstruction error, and entropy of basis vectors. The new objective function which applied sparse representation and discriminative NMF (DNMF) is also proposed.1 Introduction 1 1.1 Audio source separation 1 1.2 Speech enhancement 3 1.3 Measurements 4 1.4 Outline of the dissertation 6 2 Compositional model and NMF 9 2.1 Compositional model 9 2.2 NMF 14 2.2.1 Update rules: MuR, PGD 16 2.2.2 Modied NMF 20 3 NMF-based audio source separation and issues 23 3.1 NMF-based audio source separation 23 3.2 Problems of NMF in audio source separation 26 3.2.1 A high dependency to the prior knowledge 26 3.2.2 A overlapped subspace between the target and interfering basis matrices 28 3.2.3 A non-uniqueness of the bases 29 3.2.4 A prior knowledge of the encoding vectors 30 3.2.5 Sparse NMF for the source separation 32 4 Online bases update 33 4.1 Introduction 33 4.2 NMF-based speech enhancement using spectral gain function 36 4.3 Speech enhancement combining statistical model-based and NMFbased methods with the on-line bases update 38 4.3.1 On-line update of speech and noise bases 40 4.3.2 Determining maximum update rates 42 4.4 Experiment result 43 5 Discriminative NMF 47 5.1 Introduction 47 5.2 Discriminative NMF utilizing cross reconstruction error 48 5.2.1 DNMF using the reconstruction error of the other source 49 5.2.2 DNMF using the interference factors 50 5.3 Experiment result 52 6 Incremental approach for bases estimate 57 6.1 Introduction 57 6.2 Incremental approach based on modied k-means clustering and Linde-Buzo-Gray algorithm 59 6.2.1 Based on modied k-means clustering 59 6.2.2 LBG based incremental approach 62 6.3 Experiment result 63 6.3.1 Modied k-means clustering based approach 63 6.3.2 LBG based approach 66 7 Prior model of encoding vectors 77 7.1 Introduction 77 7.2 Prior model of encoding vectors based on multivariate exponential distribution 78 7.3 Experiment result 82 8 Conclusions 87 Bibliography 91 국문초록 105Docto

    Complex Neural Networks for Audio

    Get PDF
    Audio is represented in two mathematically equivalent ways: the real-valued time domain (i.e., waveform) and the complex-valued frequency domain (i.e., spectrum). There are advantages to the frequency-domain representation, e.g., the human auditory system is known to process sound in the frequency-domain. Furthermore, linear time-invariant systems are convolved with sources in the time-domain, whereas they may be factorized in the frequency-domain. Neural networks have become rather useful when applied to audio tasks such as machine listening and audio synthesis, which are related by their dependencies on high quality acoustic models. They ideally encapsulate fine-scale temporal structure, such as that encoded in the phase of frequency-domain audio, yet there are no authoritative deep learning methods for complex audio. This manuscript is dedicated to addressing the shortcoming. Chapter 2 motivates complex networks by their affinity with complex-domain audio, while Chapter 3 contributes methods for building and optimizing complex networks. We show that the naive implementation of Adam optimization is incorrect for complex random variables and show that selection of input and output representation has a significant impact on the performance of a complex network. Experimental results with novel complex neural architectures are provided in the second half of this manuscript. Chapter 4 introduces a complex model for binaural audio source localization. We show that, like humans, the complex model can generalize to different anatomical filters, which is important in the context of machine listening. The complex model\u27s performance is better than that of the real-valued models, as well as real- and complex-valued baselines. Chapter 5 proposes a two-stage method for speech enhancement. In the first stage, a complex-valued stochastic autoencoder projects complex vectors to a discrete space. In the second stage, long-term temporal dependencies are modeled in the discrete space. The autoencoder raises the performance ceiling for state of the art speech enhancement, but the dynamic enhancement model does not outperform other baselines. We discuss areas for improvement and note that the complex Adam optimizer improves training convergence over the naive implementation

    Single channel audio separation using deep neural networks and matrix factorizations

    Get PDF
    PhD ThesisSource Separation has become a significant research topic in the signal processing community and the machine learning area. Due to numerous applications, such as automatic speech recognition and speech communication, separation of target speech from the mixed signal is of great importance. In many practical applications, speech separation from a single recorder is most desirable from an application standpoint. In this thesis, two novel approaches have been proposed to address this single channel audio separation problem. This thesis first reviews traditional approaches for single channel source separation, and later elicits a generic approach, which is more capable of feature learning, i.e. deep graphical models. In the first part of this thesis, a novel approach based on matrix factorization and hierarchical model has been proposed. In this work, an artificial stereo mixture is formulated to provide extra information. In addition, a hybrid framework that combines the generalized Expectation-Maximization algorithm with a multiplicative update rule is proposed to optimize the parameters of a matrix factorization based approach to approximatively separate the mixture. Furthermore, a hierarchical model based on an extreme learning machine is developed to check the validity of the approximately separated sources followed by an energy minimization method to further improve the quality of the separated sources by generating a time-frequency mask. Various experiments have been conducted and the obtained results have shown that the proposed approach outperforms conventional approaches not only in reduction of computational complexity, but also the separation performance. In the second part, a deep neural network based ensemble system is proposed. In this work, the complementary property of different features are fully explored by ‘wide’ and ‘forward’ ensemble system. In addition, instead of using the features learned from the output layer, the features learned from the penultimate layer are investigated. The final embedded features are classified with an extreme learning machine to generate a binary mask to separate a mixed signal. The experiment focuses on speech in the presence of music and the obtained results demonstrated that the proposed ensemble system has the ability to explore the complementary property of various features thoroughly under various conditions with promising separation performance