2 research outputs found
Bootstrapping single-channel source separation via unsupervised spatial clustering on stereo mixtures
Separating an audio scene into isolated sources is a fundamental problem in
computer audition, analogous to image segmentation in visual scene analysis.
Source separation systems based on deep learning are currently the most
successful approaches for solving the underdetermined separation problem, where
there are more sources than channels. Traditionally, such systems are trained
on sound mixtures where the ground truth decomposition is already known. Since
most real-world recordings do not have such a decomposition available, this
limits the range of mixtures one can train on, and the range of mixtures the
learned models may successfully separate. In this work, we use a simple blind
spatial source separation algorithm to generate estimated decompositions of
stereo mixtures. These estimates, together with a weighting scheme in the
time-frequency domain, based on confidence in the separation quality, are used
to train a deep learning model that can be used for single-channel separation,
where no source direction information is available. This demonstrates how a
simple cue such as the direction of origin of source can be used to bootstrap a
model for source separation that can be used in situations where that cue is
not available.Comment: 5 pages, 2 figure
Improving Universal Sound Separation Using Sound Classification
Deep learning approaches have recently achieved impressive performance on
both audio source separation and sound classification. Most audio source
separation approaches focus only on separating sources belonging to a
restricted domain of source classes, such as speech and music. However, recent
work has demonstrated the possibility of "universal sound separation", which
aims to separate acoustic sources from an open domain, regardless of their
class. In this paper, we utilize the semantic information learned by sound
classifier networks trained on a vast amount of diverse sounds to improve
universal sound separation. In particular, we show that semantic embeddings
extracted from a sound classifier can be used to condition a separation
network, providing it with useful additional information. This approach is
especially useful in an iterative setup, where source estimates from an initial
separation stage and their corresponding classifier-derived embeddings are fed
to a second separation network. By performing a thorough hyperparameter search
consisting of over a thousand experiments, we find that classifier embeddings
from clean sources provide nearly one dB of SNR gain, and our best iterative
models achieve a significant fraction of this oracle performance, establishing
a new state-of-the-art for universal sound separation