5,289 research outputs found
Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction
The state of the art in music source separation employs neural networks
trained in a supervised fashion on multi-track databases to estimate the
sources from a given mixture. With only few datasets available, often extensive
data augmentation is used to combat overfitting. Mixing random tracks, however,
can even reduce separation performance as instruments in real music are
strongly correlated. The key concept in our approach is that source estimates
of an optimal separator should be indistinguishable from real source signals.
Based on this idea, we drive the separator towards outputs deemed as realistic
by discriminator networks that are trained to tell apart real from separator
samples. This way, we can also use unpaired source and mixture recordings
without the drawbacks of creating unrealistic music mixtures. Our framework is
widely applicable as it does not assume a specific network architecture or
number of sources. To our knowledge, this is the first adoption of adversarial
training for music source separation. In a prototype experiment for singing
voice separation, separation performance increases with our approach compared
to purely supervised training.Comment: 5 pages, 2 figures, 1 table. Final version of manuscript accepted for
2018 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). Implementation available at
https://github.com/f90/AdversarialAudioSeparatio
A Statistically Principled and Computationally Efficient Approach to Speech Enhancement using Variational Autoencoders
Recent studies have explored the use of deep generative models of speech
spectra based of variational autoencoders (VAEs), combined with unsupervised
noise models, to perform speech enhancement. These studies developed iterative
algorithms involving either Gibbs sampling or gradient descent at each step,
making them computationally expensive. This paper proposes a variational
inference method to iteratively estimate the power spectrogram of the clean
speech. Our main contribution is the analytical derivation of the variational
steps in which the en-coder of the pre-learned VAE can be used to estimate the
varia-tional approximation of the true posterior distribution, using the very
same assumption made to train VAEs. Experiments show that the proposed method
produces results on par with the afore-mentioned iterative methods using
sampling, while decreasing the computational cost by a factor 36 to reach a
given performance .Comment: Submitted to INTERSPEECH 201
Hierarchically Clustered Representation Learning
The joint optimization of representation learning and clustering in the
embedding space has experienced a breakthrough in recent years. In spite of the
advance, clustering with representation learning has been limited to flat-level
categories, which often involves cohesive clustering with a focus on instance
relations. To overcome the limitations of flat clustering, we introduce
hierarchically-clustered representation learning (HCRL), which simultaneously
optimizes representation learning and hierarchical clustering in the embedding
space. Compared with a few prior works, HCRL firstly attempts to consider a
generation of deep embeddings from every component of the hierarchy, not just
leaf components. In addition to obtaining hierarchically clustered embeddings,
we can reconstruct data by the various abstraction levels, infer the intrinsic
hierarchical structure, and learn the level-proportion features. We conducted
evaluations with image and text domains, and our quantitative analyses showed
competent likelihoods and the best accuracies compared with the baselines.Comment: 10 pages, 7 figures, Under review as a conference pape
From neural PCA to deep unsupervised learning
A network supporting deep unsupervised learning is presented. The network is
an autoencoder with lateral shortcut connections from the encoder to decoder at
each level of the hierarchy. The lateral shortcut connections allow the higher
levels of the hierarchy to focus on abstract invariant features. While standard
autoencoders are analogous to latent variable models with a single layer of
stochastic variables, the proposed network is analogous to hierarchical latent
variables models. Learning combines denoising autoencoder and denoising sources
separation frameworks. Each layer of the network contributes to the cost
function a term which measures the distance of the representations produced by
the encoder and the decoder. Since training signals originate from all levels
of the network, all layers can learn efficiently even in deep networks. The
speedup offered by cost terms from higher levels of the hierarchy and the
ability to learn invariant features are demonstrated in experiments.Comment: A revised version of an article that has been accepted for
publication in Advances in Independent Component Analysis and Learning
Machines (2015), edited by Ella Bingham, Samuel Kaski, Jorma Laaksonen and
Jouko Lampine
- …