132 research outputs found
Waveform Generation for Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks
The state-of-the-art in text-to-speech synthesis has recently improved
considerably due to novel neural waveform generation methods, such as WaveNet.
However, these methods suffer from their slow sequential inference process,
while their parallel versions are difficult to train and even more expensive
computationally. Meanwhile, generative adversarial networks (GANs) have
achieved impressive results in image generation and are making their way into
audio applications; parallel inference is among their lucrative properties. By
adopting recent advances in GAN training techniques, this investigation studies
waveform generation for TTS in two domains (speech signal and glottal
excitation). Listening test results show that while direct waveform generation
with GAN is still far behind WaveNet, a GAN-based glottal excitation model can
achieve quality and voice similarity on par with a WaveNet vocoder.Comment: Submitted to ICASSP 201
Speaker-independent raw waveform model for glottal excitation
Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical vocoders in many tasks, such as text-to-speech synthesis and voice conversion. Furthermore, conditioning WaveNets with acoustic features allows sharing the waveform generator model across multiple speakers without additional speaker codes. However, multi-speaker WaveNet models require large amounts of training data and computation to cover the entire acoustic space. This paper proposes leveraging the source-filter model of speech production to more effectively train a speaker-independent waveform generator with limited resources. We present a multi-speaker ’GlotNet’ vocoder, which utilizes a WaveNet to generate glottal excitation waveforms, which are then used to excite the corresponding vocal tract filter to produce speech. Listening tests show that the proposed model performs favourably to a direct WaveNet vocoder trained with the same model architecture and data.Peer reviewe
Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis
Neural vocoders model the raw audio waveform and synthesize high-quality
audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to
run real-time on a low-end device like a smartglass. A pure digital signal
processing (DSP) based vocoder can be implemented via lightweight fast Fourier
transforms (FFT), and therefore, is a magnitude faster than any neural vocoder.
A DSP vocoder often gets a lower audio quality due to consuming over-smoothed
acoustic model predictions of approximate representations for the vocal tract.
In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder
that uses a jointly optimized acoustic model with a DSP vocoder, and learns
without an extracted spectral feature for the vocal tract. The model achieves
audio quality comparable to neural vocoders with a high average MOS of 4.36
while being efficient as a DSP vocoder. Our C++ implementation, without any
hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340
times in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall
RTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU.Comment: Accepted for ICASSP 202
- …