630 research outputs found
Foreground-Background Ambient Sound Scene Separation
Ambient sound scenes typically comprise multiple short events occurring on
top of a somewhat stationary background. We consider the task of separating
these events from the background, which we call foreground-background ambient
sound scene separation. We propose a deep learning-based separation framework
with a suitable feature normaliza-tion scheme and an optional auxiliary network
capturing the background statistics, and we investigate its ability to handle
the great variety of sound classes encountered in ambient sound scenes, which
have often not been seen in training. To do so, we create single-channel
foreground-background mixtures using isolated sounds from the DESED and
Audioset datasets, and we conduct extensive experiments with mixtures of seen
or unseen sound classes at various signal-to-noise ratios. Our experimental
findings demonstrate the generalization ability of the proposed approach
Reverberant Audio Source Separation via Sparse and Low-Rank Modeling
The performance of audio source separation from underdetermined convolutive
mixture assuming known mixing filters can be significantly improved by using an
analysis sparse prior optimized by a reweighting l1 scheme and a wideband
datafidelity term, as demonstrated by a recent article. In this letter, we show
that the performance can be improved even more significantly by exploiting a
low-rank prior on the source spectrograms.We present a new algorithm to
estimate the sources based on i) an analysis sparse prior, ii) a reweighting
scheme so as to increase the sparsity, iii) a wideband data-fidelity term in a
constrained form, and iv) a low-rank constraint on the source spectrograms.
Evaluation on reverberant music mixtures shows that the resulting algorithm
improves state-of-the-art methods by more than 2 dB of signal-to-distortion
ratio
Deep Clustering and Conventional Networks for Music Separation: Stronger Together
Deep clustering is the first method to handle general audio separation
scenarios with multiple sources of the same type and an arbitrary number of
sources, performing impressively in speaker-independent speech separation
tasks. However, little is known about its effectiveness in other challenging
situations such as music source separation. Contrary to conventional networks
that directly estimate the source signals, deep clustering generates an
embedding for each time-frequency bin, and separates sources by clustering the
bins in the embedding space. We show that deep clustering outperforms
conventional networks on a singing voice separation task, in both matched and
mismatched conditions, even though conventional networks have the advantage of
end-to-end training for best signal approximation, presumably because its more
flexible objective engenders better regularization. Since the strengths of deep
clustering and conventional network architectures appear complementary, we
explore combining them in a single hybrid network trained via an approach akin
to multi-task learning. Remarkably, the combination significantly outperforms
either of its components.Comment: Published in ICASSP 201
IMPROVED MULTIPLE BIRDSONG TRACKING WITH DISTRIBUTION DERIVATIVE METHOD AND MARKOV RENEWAL PROCESS CLUSTERING
DS & MP are supported by an EPSRC Leadership Fellowship EP/G007144/1
TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer
In this work, we address the problem of musical timbre transfer, where the
goal is to manipulate the timbre of a sound sample from one instrument to match
another instrument while preserving other musical content, such as pitch,
rhythm, and loudness. In principle, one could apply image-based style transfer
techniques to a time-frequency representation of an audio signal, but this
depends on having a representation that allows independent manipulation of
timbre as well as high-quality waveform generation. We introduce TimbreTron, a
method for musical timbre transfer which applies "image" domain style transfer
to a time-frequency representation of the audio signal, and then produces a
high-quality waveform using a conditional WaveNet synthesizer. We show that the
Constant Q Transform (CQT) representation is particularly well-suited to
convolutional architectures due to its approximate pitch equivariance. Based on
human perceptual evaluations, we confirmed that TimbreTron recognizably
transferred the timbre while otherwise preserving the musical content, for both
monophonic and polyphonic samples.Comment: 17 pages, published as a conference paper at ICLR 201
- …