143 research outputs found
Training Sound Event Detection On A Heterogeneous Dataset
Training a sound event detection algorithm on a heterogeneous dataset
including both recorded and synthetic soundscapes that can have various
labeling granularity is a non-trivial task that can lead to systems requiring
several technical choices. These technical choices are often passed from one
system to another without being questioned. We propose to perform a detailed
analysis of DCASE 2020 task 4 sound event detection baseline with regards to
several aspects such as the type of data used for training, the parameters of
the mean-teacher or the transformations applied while generating the synthetic
soundscapes. Some of the parameters that are usually used as default are shown
to be sub-optimal
Posterior sampling algorithms for unsupervised speech enhancement with recurrent variational autoencoder
In this paper, we address the unsupervised speech enhancement problem based
on recurrent variational autoencoder (RVAE). This approach offers promising
generalization performance over the supervised counterpart. Nevertheless, the
involved iterative variational expectation-maximization (VEM) process at test
time, which relies on a variational inference method, results in high
computational complexity. To tackle this issue, we present efficient sampling
techniques based on Langevin dynamics and Metropolis-Hasting algorithms,
adapted to the EM-based speech enhancement with RVAE. By directly sampling from
the intractable posterior distribution within the EM process, we circumvent the
intricacies of variational inference. We conduct a series of experiments,
comparing the proposed methods with VEM and a state-of-the-art supervised
speech enhancement approach based on diffusion models. The results reveal that
our sampling-based algorithms significantly outperform VEM, not only in terms
of computational efficiency but also in overall performance. Furthermore, when
compared to the supervised baseline, our methods showcase robust generalization
performance in mismatched test conditions
Fast and efficient speech enhancement with variational autoencoders
Unsupervised speech enhancement based on variational autoencoders has shown
promising performance compared with the commonly used supervised methods. This
approach involves the use of a pre-trained deep speech prior along with a
parametric noise model, where the noise parameters are learned from the noisy
speech signal with an expectationmaximization (EM)-based method. The E-step
involves an intractable latent posterior distribution. Existing algorithms to
solve this step are either based on computationally heavy Monte Carlo Markov
Chain sampling methods and variational inference, or inefficient
optimization-based methods. In this paper, we propose a new approach based on
Langevin dynamics that generates multiple sequences of samples and comes with a
total variation-based regularization to incorporate temporal correlations of
latent vectors. Our experiments demonstrate that the developed framework makes
an effective compromise between computational efficiency and enhancement
quality, and outperforms existing methods
Rank-1 Constrained Multichannel Wiener Filter for Speech Recognition in Noisy Environments
Multichannel linear filters, such as the Multichannel Wiener Filter (MWF) and
the Generalized Eigenvalue (GEV) beamformer are popular signal processing
techniques which can improve speech recognition performance. In this paper, we
present an experimental study on these linear filters in a specific speech
recognition task, namely the CHiME-4 challenge, which features real recordings
in multiple noisy environments. Specifically, the rank-1 MWF is employed for
noise reduction and a new constant residual noise power constraint is derived
which enhances the recognition performance. To fulfill the underlying rank-1
assumption, the speech covariance matrix is reconstructed based on eigenvectors
or generalized eigenvectors. Then the rank-1 constrained MWF is evaluated with
alternative multichannel linear filters under the same framework, which
involves a Bidirectional Long Short-Term Memory (BLSTM) network for mask
estimation. The proposed filter outperforms alternative ones, leading to a 40%
relative Word Error Rate (WER) reduction compared with the baseline Weighted
Delay and Sum (WDAS) beamformer on the real test set, and a 15% relative WER
reduction compared with the GEV-BAN method. The results also suggest that the
speech recognition accuracy correlates more with the Mel-frequency cepstral
coefficients (MFCC) feature variance than with the noise reduction or the
speech distortion level.Comment: for Computer Speech and Languag
Foreground-Background Ambient Sound Scene Separation
Ambient sound scenes typically comprise multiple short events occurring on
top of a somewhat stationary background. We consider the task of separating
these events from the background, which we call foreground-background ambient
sound scene separation. We propose a deep learning-based separation framework
with a suitable feature normaliza-tion scheme and an optional auxiliary network
capturing the background statistics, and we investigate its ability to handle
the great variety of sound classes encountered in ambient sound scenes, which
have often not been seen in training. To do so, we create single-channel
foreground-background mixtures using isolated sounds from the DESED and
Audioset datasets, and we conduct extensive experiments with mixtures of seen
or unseen sound classes at various signal-to-noise ratios. Our experimental
findings demonstrate the generalization ability of the proposed approach
Contributions to speech processing and ambient sound analysis
We are constantly surrounded by sounds that we continuously exploit to adapt our actions to situations we are facing. Some of the sounds like speech can have a particular structure from which we can infer some information, explicit or not. This is one reason why speech is possibly that is the most intuitive way to communicate between humans. Within the last decade, there has been significant progress in the domain of speech andaudio processing and in particular in the domain of machine learning applied to speech and audio processing. Thanks to these progresses, speech has become a central element in many human to human distant communication tools as well as in human to machine communication systems. These solutions work pretty well on clean speech or under controlled condition. However, in scenarios that involve the presence of acoustic perturbation such as noise or reverberation systems performance tends to degrade severely. In this thesis we focus on processing speech and its environments from an audio perspective. The algorithms proposed here are relying on a variety of solutions from signal processing based approaches to data-driven solutions based on supervised matrix factorization or deep neural networks. We propose solutions to problems ranging from speech recognition, to speech enhancement or ambient sound analysis. The target is to offer a panorama of the different aspects that could improve a speech processing algorithm working in a real environments. We start by describing automatic speech recognition as a potential end application and progressively unravel the limitations and the proposed solutions ending-up to the more general ambient sound analysis.Nous sommes constamment entourés de sons que nous exploitons pour adapter nos actions aux situations auxquelles nous sommes confrontés. Certains sons comme la parole peuvent avoir une structure particulière à partir de laquelle nous pouvons déduire des informations, explicites ou non. C’est l’une des raisons pour lesquelles la parole est peut-être le moyen le plus intuitif de communiquer entre humains. Au cours de la décennie écoulée, des progrès significatifs ont été réalisés dans le domaine du traitement de la parole et du son et en particulier dans le domaine de l’apprentissage automatique appliqué au traitement de la parole et du son. Grâce à ces progrès, la parole est devenue un élément central de nombreux outils de communication à distance d’humain à humain ainsi que dans les systèmes de communication humain-machine. Ces solutions fonctionnent bien sur un signal de parole propre ou dans des conditions contrôlées. Cependant, dans les scénarios qui impliquent la présence de perturbations acoustiques telles que du bruit ou de la réverbération les performances peuvent avoir tendance à se dégrader gravement. Dans cette HDR, nous nous concentrons sur le traitement de la parole et de son environnement d’un point de vue audio. Les algorithmes proposés ici reposent sur une variété de solutions allant des approches basées sur le traitement du signal aux solutions orientées données à base de factorisation matricielle supervisée ou de réseaux de neurones profonds. Nous proposons des solutions à des problèmes allant de la reconnaissance vocale au rehaussement de la parole ou à l’analyse des sons ambiants. L’objectif est d’offrir un panorama des différents aspects qui pourraient être améliorer un algorithme de traitement de la parole fonctionnant dans un environnement réel. Nous commençons par décrire la reconnaissance automatique de la parole comme une application finale potentielle et analysons progressivement les limites et les solutions proposées aboutissant à l’analyse plus générale des sons ambiants
Regularized Contrastive Pre-training for Few-shot Bioacoustic Sound Detection
Bioacoustic sound event detection allows for better understanding of animal
behavior and for better monitoring biodiversity using audio. Deep learning
systems can help achieve this goal, however it is difficult to acquire
sufficient annotated data to train these systems from scratch. To address this
limitation, the Detection and Classification of Acoustic Scenes and Events
(DCASE) community has recasted the problem within the framework of few-shot
learning and organize an annual challenge for learning to detect animal sounds
from only five annotated examples. In this work, we regularize supervised
contrastive pre-training to learn features that can transfer well on new target
tasks with animal sounds unseen during training, achieving a high F-score of
61.52%(0.48) when no feature adaptation is applied, and an F-score of
68.19%(0.75) when we further adapt the learned features for each new target
task. This work aims to lower the entry bar to few-shot bioacoustic sound event
detection by proposing a simple and yet effective framework for this task, by
also providing open-source code
Unsupervised speech enhancement with diffusion-based generative models
Recently, conditional score-based diffusion models have gained significant
attention in the field of supervised speech enhancement, yielding
state-of-the-art performance. However, these methods may face challenges when
generalising to unseen conditions. To address this issue, we introduce an
alternative approach that operates in an unsupervised manner, leveraging the
generative power of diffusion models. Specifically, in a training phase, a
clean speech prior distribution is learnt in the short-time Fourier transform
(STFT) domain using score-based diffusion models, allowing it to
unconditionally generate clean speech from Gaussian noise. Then, we develop a
posterior sampling methodology for speech enhancement by combining the learnt
clean speech prior with a noise model for speech signal inference. The noise
parameters are simultaneously learnt along with clean speech estimation through
an iterative expectationmaximisation (EM) approach. To the best of our
knowledge, this is the first work exploring diffusion-based generative models
for unsupervised speech enhancement, demonstrating promising results compared
to a recent variational auto-encoder (VAE)-based unsupervised approach and a
state-of-the-art diffusion-based supervised method. It thus opens a new
direction for future research in unsupervised speech enhancement
Pretraining Representations for Bioacoustic Few-shot Detection using Supervised Contrastive Learning
Deep learning has been widely used recently for sound event detection and
classification. Its success is linked to the availability of sufficiently large
datasets, possibly with corresponding annotations when supervised learning is
considered. In bioacoustic applications, most tasks come with few labelled
training data, because annotating long recordings is time consuming and costly.
Therefore supervised learning is not the best suited approach to solve
bioacoustic tasks. The bioacoustic community recasted the problem of sound
event detection within the framework of few-shot learning, i.e. training a
system with only few labeled examples. The few-shot bioacoustic sound event
detection task in the DCASE challenge focuses on detecting events in long audio
recordings given only five annotated examples for each class of interest. In
this paper, we show that learning a rich feature extractor from scratch can be
achieved by leveraging data augmentation using a supervised contrastive
learning framework. We highlight the ability of this framework to transfer well
for five-shot event detection on previously unseen classes in the training
data. We obtain an F-score of 63.46\% on the validation set and 42.7\% on the
test set, ranking second in the DCASE challenge. We provide an ablation study
for the critical choices of data augmentation techniques as well as for the
learning strategy applied on the training set
- …