2 research outputs found
Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder
In this paper, we investigate the effectiveness of a quasi-periodic WaveNet
(QPNet) vocoder combined with a statistical spectral conversion technique for a
voice conversion task. The WaveNet (WN) vocoder has been applied as the
waveform generation module in many different voice conversion frameworks and
achieves significant improvement over conventional vocoders. However, because
of the fixed dilated convolution and generic network architecture, the WN
vocoder lacks robustness against unseen input features and often requires a
huge network size to achieve acceptable speech quality. Such limitations
usually lead to performance degradation in the voice conversion task. To
overcome this problem, the QPNet vocoder is applied, which includes a
pitch-dependent dilated convolution component to enhance the pitch
controllability and attain a more compact network than the WN vocoder. In the
proposed method, input spectral features are first converted using a framewise
deep neural network, and then the QPNet vocoder generates converted speech
conditioned on the linearly converted prosodic and transformed spectral
features. The experimental results confirm that the QPNet vocoder achieves
significantly better performance than the same-size WN vocoder while
maintaining comparable speech quality to the double-size WN vocoder. Index
Terms: WaveNet, vocoder, voice conversion, pitch-dependent dilated convolution,
pitch controllabilityComment: 6pages, 7figures, Proc. SSW10, 201
Neural Voice Cloning with a Few Samples
Voice cloning is a highly desired feature for personalized speech interfaces.
Neural network based speech synthesis has been shown to generate high quality
speech for a large number of speakers. In this paper, we introduce a neural
voice cloning system that takes a few audio samples as input. We study two
approaches: speaker adaptation and speaker encoding. Speaker adaptation is
based on fine-tuning a multi-speaker generative model with a few cloning
samples. Speaker encoding is based on training a separate model to directly
infer a new speaker embedding from cloning audios and to be used with a
multi-speaker generative model. In terms of naturalness of the speech and its
similarity to original speaker, both approaches can achieve good performance,
even with very few cloning audios. While speaker adaptation can achieve better
naturalness and similarity, the cloning time or required memory for the speaker
encoding approach is significantly less, making it favorable for low-resource
deployment