37 research outputs found
On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems
The performance of neural network-based speech enhancement systems is
primarily influenced by the model architecture, whereas training times and
computational resource utilization are primarily affected by training
parameters such as the batch size. Since noisy and reverberant speech mixtures
can have different duration, a batching strategy is required to handle variable
size inputs during training, in particular for state-of-the-art end-to-end
systems. Such strategies usually strive a compromise between zero-padding and
data randomization, and can be combined with a dynamic batch size for a more
consistent amount of data in each batch. However, the effect of these practices
on resource utilization and more importantly network performance is not well
documented. This paper is an empirical study of the effect of different
batching strategies and batch sizes on the training statistics and speech
enhancement performance of a Conv-TasNet, evaluated in both matched and
mismatched conditions. We find that using a small batch size during training
improves performance in both conditions for all batching strategies. Moreover,
using sorted or bucket batching with a dynamic batch size allows for reduced
training time and GPU memory usage while achieving similar performance compared
to random batching with a fixed batch size
Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler
Diffusion models are a new class of generative models that have recently been
applied to speech enhancement successfully. Previous works have demonstrated
their superior performance in mismatched conditions compared to state-of-the
art discriminative models. However, this was investigated with a single
database for training and another one for testing, which makes the results
highly dependent on the particular databases. Moreover, recent developments
from the image generation literature remain largely unexplored for speech
enhancement. These include several design aspects of diffusion models, such as
the noise schedule or the reverse sampler. In this work, we systematically
assess the generalization performance of a diffusion-based speech enhancement
model by using multiple speech, noise and binaural room impulse response (BRIR)
databases to simulate mismatched acoustic conditions. We also experiment with a
noise schedule and a sampler that have not been applied to speech enhancement
before. We show that the proposed system substantially benefits from using
multiple databases for training, and achieves superior performance compared to
state-of-the-art discriminative models in both matched and mismatched
conditions. We also show that a Heun-based sampler achieves superior
performance at a smaller computational cost compared to a sampler commonly used
for speech enhancement.Comment: Accepted to ICASSP 202
Investigating the Design Space of Diffusion Models for Speech Enhancement
Diffusion models are a new class of generative models that have shown
outstanding performance in image generation literature. As a consequence,
studies have attempted to apply diffusion models to other tasks, such as speech
enhancement. A popular approach in adapting diffusion models to speech
enhancement consists in modelling a progressive transformation between the
clean and noisy speech signals. However, one popular diffusion model framework
previously laid in image generation literature did not account for such a
transformation towards the system input, which prevents from relating the
existing diffusion-based speech enhancement systems with the aforementioned
diffusion model framework. To address this, we extend this framework to account
for the progressive transformation between the clean and noisy speech signals.
This allows us to apply recent developments from image generation literature,
and to systematically investigate design aspects of diffusion models that
remain largely unexplored for speech enhancement, such as the neural network
preconditioning, the training loss weighting, the stochastic differential
equation (SDE), or the amount of stochasticity injected in the reverse process.
We show that the performance of previous diffusion-based speech enhancement
systems cannot be attributed to the progressive transformation between the
clean and noisy speech signals. Moreover, we show that a proper choice of
preconditioning, training loss weighting, SDE and sampler allows to outperform
a popular diffusion-based speech enhancement system in terms of perceptual
metrics while using fewer sampling steps, thus reducing the computational cost
by a factor of four
Bubble merging in breathing DNA as a vicious walker problem in opposite potentials
We investigate the coalescence of two DNA-bubbles initially located at weak
domains and separated by a more stable barrier region in a designed construct
of double-stranded DNA. In a continuum Fokker-Planck approach, the
characteristic time for bubble coalescence and the corresponding distribution
are derived, as well as the distribution of coalescence positions along the
barrier. Below the melting temperature, we find a Kramers-type barrier crossing
behavior, while at high temperatures, the bubble corners perform
drift-diffusion towards coalescence. In the calculations, we map the bubble
dynamics on the problem of two vicious walkers in opposite potentials. We also
present a discrete master equation approach to the bubble coalescence problem.
Numerical evaluation and stochastic simulation of the master equation show
excellent agreement with the results from the continuum approach. Given that
the coalesced state is thermodynamically stabilized against a state where only
one or a few base pairs of the barrier region are re-established, it appears
likely that this type of setup could be useful for the quantitative
investigation of thermodynamic DNA stability data as well as the rate constants
involved in the unzipping and zipping dynamics of DNA, in single molecule
fluorescence experiments.Comment: 24 pages, 11 figures; substantially extended version of
cond-mat/0610752; v2: minor text changes, virtually identical to the
published versio
Race By Hearts
Part 2: Entertainment for Purpose and PersuasionInternational audienceIn this paper, we explore the qualities of sharing biometric data in real-time between athletes, in order to increase two motivational factors for gym-goers: Enjoyment and social interaction. We present a novel smartphone application, called Race By Hearts, which enables competition based on heart rate data sharing between users in real-time. Through an empirical study conducted in the gym, we show that sharing biometric data in real-time can strengthen social relations between participants, increase motivation, and improve the enjoyment of the fitness activity. Nevertheless, we found that introducing competition based on real-time sharing of biometric data can cause exasperation and discouragement for some athletes. Based on our findings from the study, we discuss how technology can facilitate and modify competition in fitness exercises in general
On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems
The performance of neural network-based speech enhancement systems is primarily influenced by the model architecture, whereas training times and computational resource utilization are primarily affected by training parameters such as the batch size. Since noisy and reverberant speech mixtures can have different duration, a batching strategy is required to handle variable size inputs during training, in particular for state-of-the-art end-to-end systems. Such strategies usually strive for a compromise between zero-padding and data randomization, and can be combined with a dynamic batch size for a more consistent amount of data in each batch. However, the effect of these strategies on resource utilization and more importantly network performance is not well documented. This paper systematically investigates the effect of different batching strategies and batch sizes on the training statistics and speech enhancement performance of a Conv-TasNet, evaluated in both matched and mismatched conditions. We find that using a small batch size during training improves performance in both conditions for all batching strategies. Moreover, using sorted or bucket batching with a dynamic batch size allows for reduced training time and GPU memory usage while achieving similar performance compared to random batching with a fixed batch size.<br/
A follow-up study of a successful assistive technology for children with ADHD and their families
Little research on assistive technologies for families of children with attention deficit hyperactivity disorder (ADHD) has investigated the long-term impact, after the assistive technology is returned to the researchers. In this paper, we report the outcomes of a follow-up study, conducted four-weeks after a field study of 13 children with ADHD and their families who used an assistive technology designed to help establish and change family practices. We show that some of the positive effects on parent frustration level and conflict level around morning and bedtime routines that we observed in the first phase of the study, continued even after the study period, when the technology was no longer available. We furthermore present insights into family practices in families of children with ADHD and how these could lead to unexpected challenges and implications related to the adoption, use, and outcome of the assistive technology