Unsupervised speech enhancement based on variational autoencoders has shown
promising performance compared with the commonly used supervised methods. This
approach involves the use of a pre-trained deep speech prior along with a
parametric noise model, where the noise parameters are learned from the noisy
speech signal with an expectationmaximization (EM)-based method. The E-step
involves an intractable latent posterior distribution. Existing algorithms to
solve this step are either based on computationally heavy Monte Carlo Markov
Chain sampling methods and variational inference, or inefficient
optimization-based methods. In this paper, we propose a new approach based on
Langevin dynamics that generates multiple sequences of samples and comes with a
total variation-based regularization to incorporate temporal correlations of
latent vectors. Our experiments demonstrate that the developed framework makes
an effective compromise between computational efficiency and enhancement
quality, and outperforms existing methods