1,784 research outputs found
A Deep Generative Model of Speech Complex Spectrograms
This paper proposes an approach to the joint modeling of the short-time
Fourier transform magnitude and phase spectrograms with a deep generative
model. We assume that the magnitude follows a Gaussian distribution and the
phase follows a von Mises distribution. To improve the consistency of the phase
values in the time-frequency domain, we also apply the von Mises distribution
to the phase derivatives, i.e., the group delay and the instantaneous
frequency. Based on these assumptions, we explore and compare several
combinations of loss functions for training our models. Built upon the
variational autoencoder framework, our model consists of three convolutional
neural networks acting as an encoder, a magnitude decoder, and a phase decoder.
In addition to the latent variables, we propose to also condition the phase
estimation on the estimated magnitude. Evaluated for a time-domain speech
reconstruction task, our models could generate speech with a high perceptual
quality and a high intelligibility
Deep neural network techniques for monaural speech enhancement: state of the art analysis
Deep neural networks (DNN) techniques have become pervasive in domains such
as natural language processing and computer vision. They have achieved great
success in these domains in task such as machine translation and image
generation. Due to their success, these data driven techniques have been
applied in audio domain. More specifically, DNN models have been applied in
speech enhancement domain to achieve denosing, dereverberation and
multi-speaker separation in monaural speech enhancement. In this paper, we
review some dominant DNN techniques being employed to achieve speech
separation. The review looks at the whole pipeline of speech enhancement from
feature extraction, how DNN based tools are modelling both global and local
features of speech and model training (supervised and unsupervised). We also
review the use of speech-enhancement pre-trained models to boost speech
enhancement process. The review is geared towards covering the dominant trends
with regards to DNN application in speech enhancement in speech obtained via a
single speaker.Comment: conferenc
A Sparsity-promoting Dictionary Model for Variational Autoencoders
International audienceStructuring the latent space in probabilistic deep generative models, e.g., variational autoencoders (VAEs), is important to yield more expressive models and interpretable representations, and to avoid overfitting. One way to achieve this objective is to impose a sparsity constraint on the latent variables, e.g., via a Laplace prior. However, such approaches usually complicate the training phase, and they sacrifice the reconstruction quality to promote sparsity. In this paper, we propose a simple yet effective methodology to structure the latent space via a sparsity-promoting dictionary model, which assumes that each latent code can be written as a sparse linear combination of a dictionary's columns. In particular, we leverage a computationally efficient and tuning-free method, which relies on a zeromean Gaussian latent prior with learnable variances. We derive a variational inference scheme to train the model. Experiments on speech generative modeling demonstrate the advantage of the proposed approach over competing techniques, since it promotes sparsity while not deteriorating the output speech quality
Single-Channel Signal Separation Using Spectral Basis Correlation with Sparse Nonnegative Tensor Factorization
A novel approach for solving the single-channel signal separation is presented the proposed sparse nonnegative tensor factorization under the framework of maximum a posteriori probability and adaptively fine-tuned using the hierarchical Bayesian approach with a new mixing mixture model. The mixing mixture is an analogy of a stereo signal concept given by one real and the other virtual microphones. An “imitated-stereo” mixture model is thus developed by weighting and time-shifting the original single-channel mixture. This leads to an artificial mixing system of dual channels which gives rise to a new form of spectral basis correlation diversity of the sources. Underlying all factorization algorithms is the principal difficulty in estimating the adequate number of latent components for each signal. This paper addresses these issues by developing a framework for pruning unnecessary components and incorporating a modified multivariate rectified Gaussian prior information into the spectral basis features. The parameters of the imitated-stereo model are estimated via the proposed sparse nonnegative tensor factorization with Itakura–Saito divergence. In addition, the separability conditions of the proposed mixture model are derived and demonstrated that the proposed method can separate real-time captured mixtures. Experimental testing on real audio sources has been conducted to verify the capability of the proposed method
Recent Progress in Image Deblurring
This paper comprehensively reviews the recent development of image
deblurring, including non-blind/blind, spatially invariant/variant deblurring
techniques. Indeed, these techniques share the same objective of inferring a
latent sharp image from one or several corresponding blurry images, while the
blind deblurring techniques are also required to derive an accurate blur
kernel. Considering the critical role of image restoration in modern imaging
systems to provide high-quality images under complex environments such as
motion, undesirable lighting conditions, and imperfect system components, image
deblurring has attracted growing attention in recent years. From the viewpoint
of how to handle the ill-posedness which is a crucial issue in deblurring
tasks, existing methods can be grouped into five categories: Bayesian inference
framework, variational methods, sparse representation-based methods,
homography-based modeling, and region-based methods. In spite of achieving a
certain level of development, image deblurring, especially the blind case, is
limited in its success by complex application conditions which make the blur
kernel hard to obtain and be spatially variant. We provide a holistic
understanding and deep insight into image deblurring in this review. An
analysis of the empirical evidence for representative methods, practical
issues, as well as a discussion of promising future directions are also
presented.Comment: 53 pages, 17 figure
- …