7 research outputs found
Single channel audio separation using deep neural networks and matrix factorizations
PhD ThesisSource Separation has become a significant research topic in the signal processing community and the machine learning area. Due to numerous applications, such as automatic speech recognition and speech communication, separation of target speech from the mixed signal is of great importance. In many practical applications, speech separation from a single recorder is most desirable from an application standpoint. In this thesis, two novel approaches have been proposed to address this single channel audio separation problem. This thesis first reviews traditional approaches for single channel source separation, and later elicits a generic approach, which is more capable of feature learning, i.e. deep graphical models.
In the first part of this thesis, a novel approach based on matrix factorization and hierarchical model has been proposed. In this work, an artificial stereo mixture is formulated to provide extra information. In addition, a hybrid framework that combines the generalized Expectation-Maximization algorithm with a multiplicative update rule is proposed to optimize the parameters of a matrix factorization based approach to approximatively separate the mixture. Furthermore, a hierarchical model based on an extreme learning machine is developed to check the validity of the approximately separated sources followed by an energy minimization method to further improve the quality of the separated sources by generating a time-frequency mask. Various experiments have been conducted and the obtained results have shown that the proposed approach outperforms conventional approaches not only in reduction of computational complexity, but also the separation performance.
In the second part, a deep neural network based ensemble system is proposed. In this work, the complementary property of different features are fully explored by âwideâ and âforwardâ ensemble system. In addition, instead of using the features learned from the output layer, the features learned from the penultimate layer are investigated. The final embedded features are classified with an extreme learning machine to generate a binary mask to separate a mixed signal. The experiment focuses on speech in the presence of music and the obtained results demonstrated that the proposed ensemble system has the ability to explore the complementary property of various features thoroughly under various conditions with promising separation performance
Convex regularizations for the simultaneous recording of room impulse responses
We propose to acquire large sets of room impulse responses (RIRs) by simultaneously playing known source signals on multiple loudspeakers. We then estimate the RIRs via a convex optimization algorithm using convex penalties promoting sparsity and/or exponential amplitude envelope. We validate this approach on real-world recordings. The proposed algorithm makes it possible to estimate the RIRs to a reasonable accuracy even when the number of recorded samples is smaller than the number of RIR samples to be estimated, thereby leading to a speedup of the recording process compared to state-of-the-art RIR acquisition techniques. Moreover, the penalty promoting both sparsity and exponential amplitude envelope provides the best results in terms of robustness to the choice of its parameters, thereby consolidating the evidence in favor of sparse regularization for RIR estimation. Finally, the impact of the choice of the emitted signals is analyzed and evaluated.Nous proposons d'acquĂ©rir un grand nombre de rĂ©ponses de salles (RIRs) en Ă©mettant simultanĂ©ment des signaux connus depuis plusieurs haut-parleurs. Nous estimons ensuite les RIRs via un algorithme d'optimisation convexe muni de pĂ©nalitĂ©s convexes qui favorisent la parcimonie et/ou l'enveloppe exponentielle dĂ©croissante. Nous validons cette approche sur des enregistrements rĂ©els. L'algorithme proposĂ© permet d'estimer les RIR avec une prĂ©cision raisonnable, mĂȘme quand le nombre d'Ă©chantillons enregistrĂ©s est plus petit que le nombre de d'Ă©chantillons des RIRs Ă estimer, aboutissant Ă une accĂ©lĂ©ration du processus d'enregistrement par rapport aux mĂ©thodes d'acquisition de l'Ă©tat de l'art. De plus, la pĂ©nalitĂ© qui force la parcimonie et l'enveloppe exponentielle dĂ©croissante donne les meilleurs rĂ©sultats en terme de robustesse au choix des paramĂštre, ce qui justifie d'autant plus la rĂ©gularisation parcimonieuse pour l'estimation des RIRs. Finalement, l'impact du choix des signaux sources est analysĂ© et Ă©valuĂ©
Recommended from our members
An Adaptive Strategy for Sensory Processing
Recognizing objects and detecting associations among them is essential for the survival of organisms. The ability to perform these tasks is derived from the representations of objects obtained through processing information along sensory pathways. Our current understanding of sensory processing is based on two sets of foundational theories â The Efficient Coding Hypothesis and hierarchical assembly of object representations. These theories suggest that sensory processing aims to identify independent features of the environment and progressively represent objects in terms of comprehensive combinations of these features. Separately, the two sets of theories have successfully explained the detection of associations and perceptual invariance, respectively; however, reconciling them together in one unified theory has remained challenging. Independent features are deemed essential for detecting association by the Efficient coding hypothesis, but to achieve consistency in representations, multiple comprehensive structures corresponding to the same object must be hierarchically assembled, ignoring independence among such structures.
Here we propose an alternative framework for sensory processing in which the system, instead of finding the truly independent components of the environment, aims to represent objects based on their most informative structures. Using theoretical arguments, we show that following such a strategy allows the system to efficiently represent sensory cues without necessarily acquiring knowledge about statistical properties of all possible inputs. Through mathematical simulations, we find that the framework can describe the known characteristics of early sensory processing stages and permits consistent input representations observed at later stages of processing. We also demonstrate that the framework can be implemented in a biologically plausible neuronal circuit and explain aspects of experience and learning from corrupted inputs. Thus, this framework provides a novel perspective and a unified description of sensory processing in its entirety
Online Audio-Visual Multi-Source Tracking and Separation: A Labeled Random Finite Set Approach
The dissertation proposes an online solution for separating an unknown and time-varying number of moving sources using audio and visual data. The random finite set framework is used for the modeling and fusion of audio and visual data. This enables an online tracking algorithm to estimate the source positions and identities for each time point. With this information, a set of beamformers can be designed to separate each desired source and suppress the interfering sources
Efficient and Robust Methods for Audio and Video Signal Analysis
This thesis presents my research concerning audio and video signal processing and machine learning. Specifically, the topics of my research include computationally efficient classifier compounds, automatic speech recognition (ASR), music dereverberation, video cut point detection and video classification.Computational efficacy of information retrieval based on multiple measurement modalities has been considered in this thesis. Specifically, a cascade processing framework, including a training algorithm to set its parameters has been developed for combining multiple detectors or binary classifiers in computationally efficient way. The developed cascade processing framework has been applied on video information retrieval tasks of video cut point detection and video classification. The results in video classification, compared to others found in the literature, indicate that the developed framework is capable of both accurate and computationally efficient classification. The idea of cascade processing has been additionally adapted for the ASR task. A procedure for combining multiple speech state likelihood estimation methods within an ASR framework in cascaded manner has been developed. The results obtained clearly show that without impairing the transcription accuracy the computational load of ASR can be reduced using the cascaded speech state likelihood estimation process.Additionally, this thesis presents my work on noise robustness of ASR using a nonnegative matrix factorization (NMF) -based approach. Specifically, methods for transformation of sparse NMF-features into speech state likelihoods has been explored. The results reveal that learned transformations from NMF activations to speech state likelihoods provide better ASR transcription accuracy than dictionary label -based transformations. The results, compared to others in a noisy speech recognition -challenge show that NMF-based processing is an efficient strategy for noise robustness in ASR.The thesis also presents my work on audio signal enhancement, specifically, on removing the detrimental effect of reverberation from music audio. In the work, a linear prediction -based dereverberation algorithm, which has originally been developed for speech signal enhancement, was applied for music. The results obtained show that the algorithm performs well in conjunction with music signals and indicate that dynamic compression of music does not impair the dereverberation performance