31 research outputs found

    A Perceptual Evaluation of Short-Time Fourier Transform Window Duration and Divergence Cost Function on Audio Source Separation using Non-negative Matrix Factorization

    Get PDF
    Non-negative matrix factorization (NMF) is an established method of performing audio source separation. Previous studies used NMF with supplementary systems to improve performance, but little has been done to investigate perceptual effects of NMF parameters. The present study aimed to evaluate two NMF parameters for speech enhancement: the short-time Fourier transform (STFT) window duration and divergence cost function. Two experiments were conducted: the first investigated the effect of STFT window duration on target speech intelligibility in a sentence keyword identification task. The second experiment had participants rate residual noise levels present in target speech using three different cost functions: the Euclidian Distance (EU), the Kullback-Leibler (KL) divergence, and the Itakura-Saito (IS) divergence. It was found that a 92.9 ms window duration produced the highest intelligibility scores, while the IS divergence produced significantly lower residual noise levels than the EU and KL divergences. Additionally, significant positive correlations were found between subjective residual noise scores and objective metrics from the Blind Source Separation (BSS_Eval) and Perceptual Evaluation method for Audio Source Separation (PEASS) toolboxes. Results suggest longer window durations, with increased frequency resolution, allow more accurate distinction between sources, improving intelligibility scores. Additionally, the IS divergence is able to more accurately approximate high frequency and transient components of audio, increasing separation of speech and noise. Correlation results suggest that using full bandwidth stimuli could increase reliability of objective measures

    Reconstruction de phase et de signaux audio avec des fonctions de coût non-quadratiques

    Get PDF
    Audio signal reconstruction consists in recovering sound signals from incomplete or degraded representations. This problem can be cast as an inverse problem. Such problems are frequently tackled with the help of optimization or machine learning strategies. In this thesis, we propose to change the cost function in inverse problems related to audio signal reconstruction. We mainly address the phase retrieval problem, which is common when manipulating audio spectrograms. A first line of work tackles the optimization of non-quadratic cost functions for phase retrieval. We study this problem in two contexts: audio signal reconstruction from a single spectrogram and source separation. We introduce a novel formulation of the problem with Bregman divergences, as well as algorithms for its resolution. A second line of work proposes to learn the cost function from a given dataset. This is done under the framework of unfolded neural networks, which are derived from iterative algorithms. We introduce a neural network based on the unfolding of the Alternating Direction Method of Multipliers, that includes learnable activation functions. We expose the relation between the learning of its parameters and the learning of the cost function for phase retrieval. We conduct numerical experiments for each of the proposed methods to evaluate their performance and their potential with audio signal reconstruction

    Contributions au traitement des images multivariées

    Get PDF
    Ce mémoire résume mon activité pédagogique et scientifique en vue de l’obtention de l’habilitation à diriger des recherches

    Task-specific and interpretable feature learning

    Get PDF
    Deep learning models have had tremendous impacts in recent years, while a question has been raised by many: Is deep learning just a triumph of empiricism? There has been emerging interest in reducing the gap between the theoretical soundness and interpretability, and the empirical success of deep models. This dissertation provides a comprehensive discussion on bridging traditional model-based learning approaches that emphasize problem-specific reasoning, and deep models that allow for larger learning capacity. The overall goal is to devise the next-generation feature learning architectures that are: 1) task-specific, namely, optimizing the entire pipeline from end to end while taking advantage of available prior knowledge and domain expertise; and 2) interpretable, namely, being able to learn a representation consisting of semantically sensible variables, and to display predictable behaviors. This dissertation starts by showing how the classical sparse coding models could be improved in a task-specific way, by formulating the entire pipeline as bi-level optimization. Then, it mainly illustrates how to incorporate the structure of classical learning models, e.g., sparse coding, into the design of deep architectures. A few concrete model examples are presented, ranging from the ℓ0\ell_0 and ℓ1\ell_1 sparse approximation models, to the ℓ∞\ell_\infty constrained model and the dual-sparsity model. The analytic tools in the optimization problems can be translated to guide the architecture design and performance analysis of deep models. As a result, those customized deep models demonstrate improved performance, intuitive interpretation, and efficient parameter initialization. On the other hand, deep networks are shown to be analogous to brain mechanisms. They exhibit the ability to describe semantic content from the primitive level to the abstract level. This dissertation thus also presents a preliminary investigation of the synergy between feature learning with cognitive science and neuroscience. Two novel application domains, image aesthetics assessment and brain encoding, are explored, with promising preliminary results achieved
    corecore