2 research outputs found
Multi-task WaveNet: A Multi-task Generative Model for Statistical Parametric Speech Synthesis without Fundamental Frequency Conditions
This paper introduces an improved generative model for statistical parametric
speech synthesis (SPSS) based on WaveNet under a multi-task learning framework.
Different from the original WaveNet model, the proposed Multi-task WaveNet
employs the frame-level acoustic feature prediction as the secondary task and
the external fundamental frequency prediction model for the original WaveNet
can be removed. Therefore the improved WaveNet can generate high-quality speech
waveforms only conditioned on linguistic features. Multi-task WaveNet can
produce more natural and expressive speech by addressing the pitch prediction
error accumulation issue and possesses more succinct inference procedures than
the original WaveNet. Experimental results prove that the SPSS method proposed
in this paper can achieve better performance than the state-of-the-art approach
utilizing the original WaveNet in both objective and subjective preference
tests.Comment: Accepted by Interspeech 201
Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension
This paper presents a waveform modeling and generation method using
hierarchical recurrent neural networks (HRNN) for speech bandwidth extension
(BWE). Different from conventional BWE methods which predict spectral
parameters for reconstructing wideband speech waveforms, this BWE method models
and predicts waveform samples directly without using vocoders. Inspired by
SampleRNN which is an unconditional neural audio generator, the HRNN model
represents the distribution of each wideband or high-frequency waveform sample
conditioned on the input narrowband waveform samples using a neural network
composed of long short-term memory (LSTM) layers and feed-forward (FF) layers.
The LSTM layers form a hierarchical structure and each layer operates at a
specific temporal resolution to efficiently capture long-span dependencies
between temporal sequences. Furthermore, additional conditions, such as the
bottleneck (BN) features derived from narrowband speech using a deep neural
network (DNN)-based state classifier, are employed as auxiliary input to
further improve the quality of generated wideband speech. The experimental
results of comparing several waveform modeling methods show that the HRNN-based
method can achieve better speech quality and run-time efficiency than the
dilated convolutional neural network (DCNN)-based method and the plain
sample-level recurrent neural network (SRNN)-based method. Our proposed method
also outperforms the conventional vocoder-based BWE method using LSTM-RNNs in
terms of the subjective quality of the reconstructed wideband speech.Comment: Accepted by IEEE Transactions on Audio, Speech and Language
Processin