Generative adversarial network (GAN)-based neural vocoders have been widely
used in audio synthesis tasks due to their high generation quality, efficient
inference, and small computation footprint. However, it is still challenging to
train a universal vocoder which can generalize well to out-of-domain (OOD)
scenarios, such as unseen speaking styles, non-speech vocalization, singing,
and musical pieces. In this work, we propose SnakeGAN, a GAN-based universal
vocoder, which can synthesize high-fidelity audio in various OOD scenarios.
SnakeGAN takes a coarse-grained signal generated by a differentiable digital
signal processing (DDSP) model as prior knowledge, aiming at recovering
high-fidelity waveform from a Mel-spectrogram. We introduce periodic
nonlinearities through the Snake activation function and anti-aliased
representation into the generator, which further brings the desired inductive
bias for audio synthesis and significantly improves the extrapolation capacity
for universal vocoding in unseen scenarios. To validate the effectiveness of
our proposed method, we train SnakeGAN with only speech data and evaluate its
performance for various OOD distributions with both subjective and objective
metrics. Experimental results show that SnakeGAN significantly outperforms the
compared approaches and can generate high-fidelity audio samples including
unseen speakers with unseen styles, singing voices, instrumental pieces, and
nonverbal vocalization.Comment: Accepted by ICME 202