2 research outputs found
TzK Flow - Conditional Generative Model
We introduce TzK (pronounced "task"), a conditional probability flow-based
model that exploits attributes (e.g., style, class membership, or other side
information) in order to learn tight conditional prior around manifolds of the
target observations. The model is trained via approximated ML, and offers
efficient approximation of arbitrary data sample distributions (similar to GAN
and flow-based ML), and stable training (similar to VAE and ML), while avoiding
variational approximations. TzK exploits meta-data to facilitate a bottleneck,
similar to autoencoders, thereby producing a low-dimensional representation.
Unlike autoencoders, the bottleneck does not limit model expressiveness,
similar to flow-based ML. Supervised, unsupervised, and semi-supervised
learning are supported by replacing missing observations with samples from
learned priors. We demonstrate TzK by training jointly on MNIST and Omniglot
datasets with minimal preprocessing, and weak supervision, with results
comparable to state-of-the-art.Comment: 5 pages, 4 figures, Accepted to Bayesian Deep Learning Workshop NIPS
2018, camera ready NOTE: This workshop paper has been replaced. Please refer
to the following work: arXiv:1902.0189
Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion
End-to-end models for raw audio generation are a challenge, specially if they
have to work with non-parallel data, which is a desirable setup in many
situations. Voice conversion, in which a model has to impersonate a speaker in
a recording, is one of those situations. In this paper, we propose Blow, a
single-scale normalizing flow using hypernetwork conditioning to perform
many-to-many voice conversion between raw audio. Blow is trained end-to-end,
with non-parallel data, on a frame-by-frame basis using a single speaker
identifier. We show that Blow compares favorably to existing flow-based
architectures and other competitive baselines, obtaining equal or better
performance in both objective and subjective evaluations. We further assess the
impact of its main components with an ablation study, and quantify a number of
properties such as the necessary amount of training data or the preference for
source or target speakers.Comment: Includes appendix. Accepted for NeurIPS201