Search CORE

2 research outputs found

TzK Flow - Conditional Generative Model

Author: Fleet David J.
Livne Micha
Publication venue
Publication date: 19/02/2019
Field of study

We introduce TzK (pronounced "task"), a conditional probability flow-based model that exploits attributes (e.g., style, class membership, or other side information) in order to learn tight conditional prior around manifolds of the target observations. The model is trained via approximated ML, and offers efficient approximation of arbitrary data sample distributions (similar to GAN and flow-based ML), and stable training (similar to VAE and ML), while avoiding variational approximations. TzK exploits meta-data to facilitate a bottleneck, similar to autoencoders, thereby producing a low-dimensional representation. Unlike autoencoders, the bottleneck does not limit model expressiveness, similar to flow-based ML. Supervised, unsupervised, and semi-supervised learning are supported by replacing missing observations with samples from learned priors. We demonstrate TzK by training jointly on MNIST and Omniglot datasets with minimal preprocessing, and weak supervision, with results comparable to state-of-the-art.Comment: 5 pages, 4 figures, Accepted to Bayesian Deep Learning Workshop NIPS 2018, camera ready NOTE: This workshop paper has been replaced. Please refer to the following work: arXiv:1902.0189

arXiv.org e-Print Archive

Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion

Author: Pascual Santiago
Segura Carlos
Serrà Joan
Publication venue
Publication date: 05/09/2019
Field of study

End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations. Voice conversion, in which a model has to impersonate a speaker in a recording, is one of those situations. In this paper, we propose Blow, a single-scale normalizing flow using hypernetwork conditioning to perform many-to-many voice conversion between raw audio. Blow is trained end-to-end, with non-parallel data, on a frame-by-frame basis using a single speaker identifier. We show that Blow compares favorably to existing flow-based architectures and other competitive baselines, obtaining equal or better performance in both objective and subjective evaluations. We further assess the impact of its main components with an ablation study, and quantify a number of properties such as the necessary amount of training data or the preference for source or target speakers.Comment: Includes appendix. Accepted for NeurIPS201

arXiv.org e-Print Archive