491 research outputs found
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech
synthesis directly from text. The system is composed of a recurrent
sequence-to-sequence feature prediction network that maps character embeddings
to mel-scale spectrograms, followed by a modified WaveNet model acting as a
vocoder to synthesize timedomain waveforms from those spectrograms. Our model
achieves a mean opinion score (MOS) of comparable to a MOS of for
professionally recorded speech. To validate our design choices, we present
ablation studies of key components of our system and evaluate the impact of
using mel spectrograms as the input to WaveNet instead of linguistic, duration,
and features. We further demonstrate that using a compact acoustic
intermediate representation enables significant simplification of the WaveNet
architecture.Comment: Accepted to ICASSP 201
A Fully Time-domain Neural Model for Subband-based Speech Synthesizer
This paper introduces a deep neural network model for subband-based speech
synthesizer. The model benefits from the short bandwidth of the subband signals
to reduce the complexity of the time-domain speech generator. We employed the
multi-level wavelet analysis/synthesis to decompose/reconstruct the signal into
subbands in time domain. Inspired from the WaveNet, a convolutional neural
network (CNN) model predicts subband speech signals fully in time domain. Due
to the short bandwidth of the subbands, a simple network architecture is enough
to train the simple patterns of the subbands accurately. In the ground truth
experiments with teacher-forcing, the subband synthesizer outperforms the
fullband model significantly in terms of both subjective and objective
measures. In addition, by conditioning the model on the phoneme sequence using
a pronunciation dictionary, we have achieved the fully time-domain neural model
for subband-based text-to-speech (TTS) synthesizer, which is nearly end-to-end.
The generated speech of the subband TTS shows comparable quality as the
fullband one with a slighter network architecture for each subband.Comment: 5 pages, 3 figur
- …