75 research outputs found
Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization
In this paper we address speaker-independent multichannel speech enhancement
in unknown noisy environments. Our work is based on a well-established
multichannel local Gaussian modeling framework. We propose to use a neural
network for modeling the speech spectro-temporal content. The parameters of
this supervised model are learned using the framework of variational
autoencoders. The noisy recording environment is supposed to be unknown, so the
noise spectro-temporal modeling remains unsupervised and is based on
non-negative matrix factorization (NMF). We develop a Monte Carlo
expectation-maximization algorithm and we experimentally show that the proposed
approach outperforms its NMF-based counterpart, where speech is modeled using
supervised NMF.Comment: 5 pages, 2 figures, audio examples and code available online at
https://team.inria.fr/perception/icassp-2019-mvae
A vector quantized masked autoencoder for speech emotion recognition
Recent years have seen remarkable progress in speech emotion recognition
(SER), thanks to advances in deep learning techniques. However, the limited
availability of labeled data remains a significant challenge in the field.
Self-supervised learning has recently emerged as a promising solution to
address this challenge. In this paper, we propose the vector quantized masked
autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned
to recognize emotions from speech signals. The VQ-MAE-S model is based on a
masked autoencoder (MAE) that operates in the discrete latent space of a
vector-quantized variational autoencoder. Experimental results show that the
proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on
emotional speech data, outperforms an MAE working on the raw spectrogram
representation and other state-of-the-art methods in SER.Comment: https://samsad35.github.io/VQ-MAE-Speech
Motion-DVAE: Unsupervised learning for fast human motion denoising
Pose and motion priors are crucial for recovering realistic and accurate
human motion from noisy observations. Substantial progress has been made on
pose and shape estimation from images, and recent works showed impressive
results using priors to refine frame-wise predictions. However, a lot of motion
priors only model transitions between consecutive poses and are used in
time-consuming optimization procedures, which is problematic for many
applications requiring real-time motion capture. We introduce Motion-DVAE, a
motion prior to capture the short-term dependencies of human motion. As part of
the dynamical variational autoencoder (DVAE) models family, Motion-DVAE
combines the generative capability of VAE models and the temporal modeling of
recurrent architectures. Together with Motion-DVAE, we introduce an
unsupervised learned denoising method unifying regression- and
optimization-based approaches in a single framework for real-time 3D human pose
estimation. Experiments show that the proposed approach reaches competitive
performance with state-of-the-art methods while being much faster
Notes on the use of variational autoencoders for speech and audio spectrogram modeling
International audienceVariational autoencoders (VAEs) are powerful (deep) generative artificial neural networks. They have been recently used in several papers for speech and audio processing, in particular for the modeling of speech/audio spectrograms. In these papers, very poor theoretical support is given to justify the chosen data representation and decoder likelihood function or the corresponding cost function used for training the VAE. Yet, a nice theoretical statistical framework exists and has been extensively presented and discussed in papers dealing with nonnegative matrix factorization (NMF) of audio spectrograms and its application to audio source separation. In the present paper, we show how this statistical framework applies to VAE-based speech/audio spectrogram modeling. This provides the latter insights on the choice and interpretability of data representation and model parameterization
LatentForensics: Towards frugal deepfake detection in the StyleGAN latent space
The classification of forged videos has been a challenge for the past few
years. Deepfake classifiers can now reliably predict whether or not video
frames have been tampered with. However, their performance is tied to both the
dataset used for training and the analyst's computational power. We propose a
deepfake detection method that operates in the latent space of a
state-of-the-art generative adversarial network (GAN) trained on high-quality
face images. The proposed method leverages the structure of the latent space of
StyleGAN to learn a lightweight binary classification model. Experimental
results on standard datasets reveal that the proposed approach outperforms
other state-of-the-art deepfake classification methods, especially in contexts
where the data available to train the models is rare, such as when a new
manipulation method is introduced. To the best of our knowledge, this is the
first study showing the interest of the latent space of StyleGAN for deepfake
classification. Combined with other recent studies on the interpretation and
manipulation of this latent space, we believe that the proposed approach can
further help in developing frugal deepfake classification methods based on
interpretable high-level properties of face images.Comment: 7 pages, 3 figures, 5 table
Audio-noise Power Spectral Density Estimation Using Long Short-term Memory
International audienceWe propose a method using a long short-term memory (LSTM) network to estimate the noise power spectral density (PSD) of single-channel audio signals represented in the short time Fourier transform (STFT) domain. An LSTM network common to all frequency bands is trained, which processes each frequency band individually by mapping the noisy STFT magnitude sequence to its corresponding noise PSD sequence. Unlike deep-learning-based speech enhancement methods that learn the full-band spectral structure of speech segments, the proposed method exploits the sub-band STFT magnitude evolution of noise with a long time dependency, in the spirit of the unsupervised noise estimators described in the literature. Speaker-and speech-independent experiments with different types of noise show that the proposed method outperforms the unsupervised estimators, and generalizes well to noise types that are not present in the training set
Notes on the use of variational autoencoders for speech and audio spectrogram modeling
International audienceVariational autoencoders (VAEs) are powerful (deep) generative artificial neural networks. They have been recently used in several papers for speech and audio processing, in particular for the modeling of speech/audio spectrograms. In these papers, very poor theoretical support is given to justify the chosen data representation and decoder likelihood function or the corresponding cost function used for training the VAE. Yet, a nice theoretical statistical framework exists and has been extensively presented and discussed in papers dealing with nonnegative matrix factorization (NMF) of audio spectrograms and its application to audio source separation. In the present paper, we show how this statistical framework applies to VAE-based speech/audio spectrogram modeling. This provides the latter insights on the choice and interpretability of data representation and model parameterization
Alpha-Stable Multichannel Audio Source Separation
International audienceIn this paper, we focus on modeling multichannel audio signals in the short-time Fourier transform domain for the purpose of source separation. We propose a probabilistic model based on a class of heavy-tailed distributions, in which the observed mixtures and the latent sources are jointly modeled by using a certain class of multivariate alpha-stable distributions. As opposed to the conventional Gaussian models, where the observations are constrained to lie just within a few standard deviations near the mean, the pro- posed heavy-tailed model allows us to account for spurious data or important uncertainties in the model. We develop a Monte Carlo Expectation-Maximization algorithm for making inference in the proposed model. We show that our approach leads to significant improvements in audio source separation under corrupted mixtures and in spatial audio object coding
The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement
Supervised speech enhancement models are trained using artificially generated
mixtures of clean speech and noise signals, which may not match real-world
recording conditions at test time. This mismatch can lead to poor performance
if the test domain significantly differs from the synthetic training domain. In
this paper, we introduce the unsupervised domain adaptation for conversational
speech enhancement (UDASE) task of the 7th CHiME challenge. This task aims to
leverage real-world noisy speech recordings from the target test domain for
unsupervised domain adaptation of speech enhancement models. The target test
domain corresponds to the multi-speaker reverberant conversational speech
recordings of the CHiME-5 dataset, for which the ground-truth clean speech
reference is not available. Given a CHiME-5 recording, the task is to estimate
the clean, potentially multi-speaker, reverberant speech, removing the additive
background noise. We discuss the motivation for the CHiME-7 UDASE task and
describe the data, the task, and the baseline system
- …