1,715 research outputs found

    A Deep Generative Model of Speech Complex Spectrograms

    Full text link
    This paper proposes an approach to the joint modeling of the short-time Fourier transform magnitude and phase spectrograms with a deep generative model. We assume that the magnitude follows a Gaussian distribution and the phase follows a von Mises distribution. To improve the consistency of the phase values in the time-frequency domain, we also apply the von Mises distribution to the phase derivatives, i.e., the group delay and the instantaneous frequency. Based on these assumptions, we explore and compare several combinations of loss functions for training our models. Built upon the variational autoencoder framework, our model consists of three convolutional neural networks acting as an encoder, a magnitude decoder, and a phase decoder. In addition to the latent variables, we propose to also condition the phase estimation on the estimated magnitude. Evaluated for a time-domain speech reconstruction task, our models could generate speech with a high perceptual quality and a high intelligibility

    Deep neural network techniques for monaural speech enhancement: state of the art analysis

    Full text link
    Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in these domains in task such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement domain to achieve denosing, dereverberation and multi-speaker separation in monaural speech enhancement. In this paper, we review some dominant DNN techniques being employed to achieve speech separation. The review looks at the whole pipeline of speech enhancement from feature extraction, how DNN based tools are modelling both global and local features of speech and model training (supervised and unsupervised). We also review the use of speech-enhancement pre-trained models to boost speech enhancement process. The review is geared towards covering the dominant trends with regards to DNN application in speech enhancement in speech obtained via a single speaker.Comment: conferenc

    A Sparsity-promoting Dictionary Model for Variational Autoencoders

    Get PDF
    International audienceStructuring the latent space in probabilistic deep generative models, e.g., variational autoencoders (VAEs), is important to yield more expressive models and interpretable representations, and to avoid overfitting. One way to achieve this objective is to impose a sparsity constraint on the latent variables, e.g., via a Laplace prior. However, such approaches usually complicate the training phase, and they sacrifice the reconstruction quality to promote sparsity. In this paper, we propose a simple yet effective methodology to structure the latent space via a sparsity-promoting dictionary model, which assumes that each latent code can be written as a sparse linear combination of a dictionary's columns. In particular, we leverage a computationally efficient and tuning-free method, which relies on a zeromean Gaussian latent prior with learnable variances. We derive a variational inference scheme to train the model. Experiments on speech generative modeling demonstrate the advantage of the proposed approach over competing techniques, since it promotes sparsity while not deteriorating the output speech quality

    Single-Channel Signal Separation Using Spectral Basis Correlation with Sparse Nonnegative Tensor Factorization

    Get PDF
    A novel approach for solving the single-channel signal separation is presented the proposed sparse nonnegative tensor factorization under the framework of maximum a posteriori probability and adaptively fine-tuned using the hierarchical Bayesian approach with a new mixing mixture model. The mixing mixture is an analogy of a stereo signal concept given by one real and the other virtual microphones. An “imitated-stereo” mixture model is thus developed by weighting and time-shifting the original single-channel mixture. This leads to an artificial mixing system of dual channels which gives rise to a new form of spectral basis correlation diversity of the sources. Underlying all factorization algorithms is the principal difficulty in estimating the adequate number of latent components for each signal. This paper addresses these issues by developing a framework for pruning unnecessary components and incorporating a modified multivariate rectified Gaussian prior information into the spectral basis features. The parameters of the imitated-stereo model are estimated via the proposed sparse nonnegative tensor factorization with Itakura–Saito divergence. In addition, the separability conditions of the proposed mixture model are derived and demonstrated that the proposed method can separate real-time captured mixtures. Experimental testing on real audio sources has been conducted to verify the capability of the proposed method

    Recent Progress in Image Deblurring

    Full text link
    This paper comprehensively reviews the recent development of image deblurring, including non-blind/blind, spatially invariant/variant deblurring techniques. Indeed, these techniques share the same objective of inferring a latent sharp image from one or several corresponding blurry images, while the blind deblurring techniques are also required to derive an accurate blur kernel. Considering the critical role of image restoration in modern imaging systems to provide high-quality images under complex environments such as motion, undesirable lighting conditions, and imperfect system components, image deblurring has attracted growing attention in recent years. From the viewpoint of how to handle the ill-posedness which is a crucial issue in deblurring tasks, existing methods can be grouped into five categories: Bayesian inference framework, variational methods, sparse representation-based methods, homography-based modeling, and region-based methods. In spite of achieving a certain level of development, image deblurring, especially the blind case, is limited in its success by complex application conditions which make the blur kernel hard to obtain and be spatially variant. We provide a holistic understanding and deep insight into image deblurring in this review. An analysis of the empirical evidence for representative methods, practical issues, as well as a discussion of promising future directions are also presented.Comment: 53 pages, 17 figure
    • …
    corecore