38 research outputs found
Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization
In this paper we address speaker-independent multichannel speech enhancement
in unknown noisy environments. Our work is based on a well-established
multichannel local Gaussian modeling framework. We propose to use a neural
network for modeling the speech spectro-temporal content. The parameters of
this supervised model are learned using the framework of variational
autoencoders. The noisy recording environment is supposed to be unknown, so the
noise spectro-temporal modeling remains unsupervised and is based on
non-negative matrix factorization (NMF). We develop a Monte Carlo
expectation-maximization algorithm and we experimentally show that the proposed
approach outperforms its NMF-based counterpart, where speech is modeled using
supervised NMF.Comment: 5 pages, 2 figures, audio examples and code available online at
https://team.inria.fr/perception/icassp-2019-mvae
Partially Adaptive Multichannel Joint Reduction of Ego-noise and Environmental Noise
Human-robot interaction relies on a noise-robust audio processing module
capable of estimating target speech from audio recordings impacted by
environmental noise, as well as self-induced noise, so-called ego-noise. While
external ambient noise sources vary from environment to environment, ego-noise
is mainly caused by the internal motors and joints of a robot. Ego-noise and
environmental noise reduction are often decoupled, i.e., ego-noise reduction is
performed without considering environmental noise. Recently, a variational
autoencoder (VAE)-based speech model has been combined with a fully adaptive
non-negative matrix factorization (NMF) noise model to recover clean speech
under different environmental noise disturbances. However, its enhancement
performance is limited in adverse acoustic scenarios involving, e.g. ego-noise.
In this paper, we propose a multichannel partially adaptive scheme to jointly
model ego-noise and environmental noise utilizing the VAE-NMF framework, where
we take advantage of spatially and spectrally structured characteristics of
ego-noise by pre-training the ego-noise model, while retaining the ability to
adapt to unknown environmental noise. Experimental results show that our
proposed approach outperforms the methods based on a completely fixed scheme
and a fully adaptive scheme when ego-noise and environmental noise are present
simultaneously.Comment: Accepted to the 2023 IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP 2023
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Innovating with Artificial Intelligence: Capturing the Constructive Functional Capabilities of Deep Generative Learning
As an emerging species of artificial intelligence, deep generative learning models can generate an unprecedented variety of new outputs. Examples include the creation of music, text-to-image translation, or the imputation of missing data. Similar to other AI models that already evoke significant changes in society and economy, there is a need for structuring the constructive functional capabilities of DGL. To derive and discuss them, we conducted an extensive and structured literature review. Our results reveal a substantial scope of six constructive functional capabilities demonstrating that DGL is not exclusively used to generate unseen outputs. Our paper further guides companies in capturing and evaluating DGL’s potential for innovation. Besides, our paper fosters an understanding of DGL and provides a conceptual basis for further research