Can we develop a model that can synthesize realistic speech directly from a
latent space, without explicit conditioning? Despite several efforts over the
last decade, previous adversarial and diffusion-based approaches still struggle
to achieve this, even on small-vocabulary datasets. To address this, we propose
AudioStyleGAN (ASGAN) -- a generative adversarial network for unconditional
speech synthesis tailored to learn a disentangled latent space. Building upon
the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a
disentangled latent vector which is then mapped to a sequence of audio features
so that signal aliasing is suppressed at every layer. To successfully train
ASGAN, we introduce a number of new techniques, including a modification to
adaptive discriminator augmentation which probabilistically skips discriminator
updates. We apply it on the small-vocabulary Google Speech Commands digits
dataset, where it achieves state-of-the-art results in unconditional speech
synthesis. It is also substantially faster than existing top-performing
diffusion models. We confirm that ASGAN's latent space is disentangled: we
demonstrate how simple linear operations in the space can be used to perform
several tasks unseen during training. Specifically, we perform evaluations in
voice conversion, speech enhancement, speaker verification, and keyword
classification. Our work indicates that GANs are still highly competitive in
the unconditional speech synthesis landscape, and that disentangled latent
spaces can be used to aid generalization to unseen tasks. Code, models,
samples: https://github.com/RF5/simple-asgan/Comment: 12 pages, 5 tables, 4 figures. Submitted to IEEE TASLP. arXiv admin
note: substantial text overlap with arXiv:2210.0527