4,119 research outputs found
DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP
Recent development of neural vocoders based on the generative adversarial
neural network (GAN) has shown their advantages of generating raw waveform
conditioned on mel-spectrogram with fast inference speed and lightweight
networks. Whereas, it is still challenging to train a universal neural vocoder
that can synthesize high-fidelity speech from various scenarios with unseen
speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a
GAN-based universal vocoder for high-fidelity speech synthesis by applying the
time-frequency domain supervision from digital signal processing (DSP). To
eliminate the mismatch problem caused by the ground-truth spectrograms in
training phase and the predicted spectrograms in inference phase, we leverage
the mel-spectrogram extracted from the waveform generated by a DSP module,
rather than the predicted mel-spectrogram from the Text-to-Speech (TTS)
acoustic model, as the time-frequency domain supervision to the GAN-based
vocoder. We also utilize sine excitation as the time-domain supervision to
improve the harmonic modeling and eliminate various artifacts of the GAN-based
vocoder. Experimental results show that DSPGAN significantly outperforms the
compared approaches and can generate high-fidelity speech based on diverse data
in TTS.Comment: Submitted to ICASSP 202
- …