2,085 research outputs found
Multi-modal Blind Source Separation with Microphones and Blinkies
We propose a blind source separation algorithm that jointly exploits
measurements by a conventional microphone array and an ad hoc array of low-rate
sound power sensors called blinkies. While providing less information than
microphones, blinkies circumvent some difficulties of microphone arrays in
terms of manufacturing, synchronization, and deployment. The algorithm is
derived from a joint probabilistic model of the microphone and sound power
measurements. We assume the separated sources to follow a time-varying
spherical Gaussian distribution, and the non-negative power measurement
space-time matrix to have a low-rank structure. We show that alternating
updates similar to those of independent vector analysis and Itakura-Saito
non-negative matrix factorization decrease the negative log-likelihood of the
joint distribution. The proposed algorithm is validated via numerical
experiments. Its median separation performance is found to be up to 8 dB more
than that of independent vector analysis, with significantly reduced
variability.Comment: Accepted at IEEE ICASSP 2019, Brighton, UK. 5 pages. 3 figure
Deep Clustering and Conventional Networks for Music Separation: Stronger Together
Deep clustering is the first method to handle general audio separation
scenarios with multiple sources of the same type and an arbitrary number of
sources, performing impressively in speaker-independent speech separation
tasks. However, little is known about its effectiveness in other challenging
situations such as music source separation. Contrary to conventional networks
that directly estimate the source signals, deep clustering generates an
embedding for each time-frequency bin, and separates sources by clustering the
bins in the embedding space. We show that deep clustering outperforms
conventional networks on a singing voice separation task, in both matched and
mismatched conditions, even though conventional networks have the advantage of
end-to-end training for best signal approximation, presumably because its more
flexible objective engenders better regularization. Since the strengths of deep
clustering and conventional network architectures appear complementary, we
explore combining them in a single hybrid network trained via an approach akin
to multi-task learning. Remarkably, the combination significantly outperforms
either of its components.Comment: Published in ICASSP 201
Collaborative Deep Learning for Speech Enhancement: A Run-Time Model Selection Method Using Autoencoders
We show that a Modular Neural Network (MNN) can combine various speech
enhancement modules, each of which is a Deep Neural Network (DNN) specialized
on a particular enhancement job. Differently from an ordinary ensemble
technique that averages variations in models, the propose MNN selects the best
module for the unseen test signal to produce a greedy ensemble. We see this as
Collaborative Deep Learning (CDL), because it can reuse various already-trained
DNN models without any further refining. In the proposed MNN selecting the best
module during run time is challenging. To this end, we employ a speech
AutoEncoder (AE) as an arbitrator, whose input and output are trained to be as
similar as possible if its input is clean speech. Therefore, the AE can gauge
the quality of the module-specific denoised result by seeing its AE
reconstruction error, e.g. low error means that the module output is similar to
clean speech. We propose an MNN structure with various modules that are
specialized on dealing with a specific noise type, gender, and input
Signal-to-Noise Ratio (SNR) value, and empirically prove that it almost always
works better than an arbitrarily chosen DNN module and sometimes as good as an
oracle result
- …