1,615 research outputs found
ExcitNet vocoder: A neural excitation model for parametric speech synthesis systems
This paper proposes a WaveNet-based neural excitation model (ExcitNet) for
statistical parametric speech synthesis systems. Conventional WaveNet-based
neural vocoding systems significantly improve the perceptual quality of
synthesized speech by statistically generating a time sequence of speech
waveforms through an auto-regressive framework. However, they often suffer from
noisy outputs because of the difficulties in capturing the complicated
time-varying nature of speech signals. To improve modeling efficiency, the
proposed ExcitNet vocoder employs an adaptive inverse filter to decouple
spectral components from the speech signal. The residual component, i.e.
excitation signal, is then trained and generated within the WaveNet framework.
In this way, the quality of the synthesized speech signal can be further
improved since the spectral component is well represented by a deep learning
framework and, moreover, the residual component is efficiently generated by the
WaveNet framework. Experimental results show that the proposed ExcitNet
vocoder, trained both speaker-dependently and speaker-independently,
outperforms traditional linear prediction vocoders and similarly configured
conventional WaveNet vocoders.Comment: Accepted to the conference of EUSIPCO 2019. arXiv admin note: text
overlap with arXiv:1811.0331
Singing voice synthesis based on convolutional neural networks
The present paper describes a singing voice synthesis based on convolutional
neural networks (CNNs). Singing voice synthesis systems based on deep neural
networks (DNNs) are currently being proposed and are improving the naturalness
of synthesized singing voices. In these systems, the relationship between
musical score feature sequences and acoustic feature sequences extracted from
singing voices is modeled by DNNs. Then, an acoustic feature sequence of an
arbitrary musical score is output in units of frames by the trained DNNs, and a
natural trajectory of a singing voice is obtained by using a parameter
generation algorithm. As singing voices contain rich expression, a powerful
technique to model them accurately is required. In the proposed technique,
long-term dependencies of singing voices are modeled by CNNs. An acoustic
feature sequence is generated in units of segments that consist of long-term
frames, and a natural trajectory is obtained without the parameter generation
algorithm. Experimental results in a subjective listening test show that the
proposed architecture can synthesize natural sounding singing voices.Comment: Singing voice samples (Japanese, English, Chinese):
https://www.techno-speech.com/news-20181214a-e
Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis
Recent studies have shown that text-to-speech synthesis quality can be
improved by using glottal vocoding. This refers to vocoders that parameterize
speech into two parts, the glottal excitation and vocal tract, that occur in
the human speech production apparatus. Current glottal vocoders generate the
glottal excitation waveform by using deep neural networks (DNNs). However, the
squared error-based training of the present glottal excitation models is
limited to generating conditional average waveforms, which fails to capture the
stochastic variation of the waveforms. As a result, shaped noise is added as
post-processing. In this study, we propose a new method for predicting glottal
waveforms by generative adversarial networks (GANs). GANs are generative models
that aim to embed the data distribution in a latent space, enabling generation
of new instances very similar to the original by randomly sampling the latent
distribution. The glottal pulses generated by GANs show a stochastic component
similar to natural glottal pulses. In our experiments, we compare synthetic
speech generated using glottal waveforms produced by both DNNs and GANs. The
results show that the newly proposed GANs achieve synthesis quality comparable
to that of widely-used DNNs, without using an additive noise component.Comment: Accepted in Interspeec
A Waveform Representation Framework for High-quality Statistical Parametric Speech Synthesis
State-of-the-art statistical parametric speech synthesis (SPSS) generally
uses a vocoder to represent speech signals and parameterize them into features
for subsequent modeling. Magnitude spectrum has been a dominant feature over
the years. Although perceptual studies have shown that phase spectrum is
essential to the quality of synthesized speech, it is often ignored by using a
minimum phase filter during synthesis and the speech quality suffers. To bypass
this bottleneck in vocoded speech, this paper proposes a phase-embedded
waveform representation framework and establishes a magnitude-phase joint
modeling platform for high-quality SPSS. Our experiments on waveform
reconstruction show that the performance is better than that of the widely-used
STRAIGHT. Furthermore, the proposed modeling and synthesis platform outperforms
a leading-edge, vocoded, deep bidirectional long short-term memory recurrent
neural network (DBLSTM-RNN)-based baseline system in various objective
evaluation metrics conducted.Comment: accepted and will appear in APSIPA2015; keywords: speech synthesis,
LSTM-RNN, vocoder, phase, waveform, modelin
WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation
WaveCycleGAN has recently been proposed to bridge the gap between natural and
synthesized speech waveforms in statistical parametric speech synthesis and
provides fast inference with a moving average model rather than an
autoregressive model and high-quality speech synthesis with the adversarial
training. However, the human ear can still distinguish the processed speech
waveforms from natural ones. One possible cause of this distinguishability is
the aliasing observed in the processed speech waveform via down/up-sampling
modules. To solve the aliasing and provide higher quality speech synthesis, we
propose WaveCycleGAN2, which 1) uses generators without down/up-sampling
modules and 2) combines discriminators of the waveform domain and acoustic
parameter domain. The results show that the proposed method 1) alleviates the
aliasing well, 2) is useful for both speech waveforms generated by
analysis-and-synthesis and statistical parametric speech synthesis, and 3)
achieves a mean opinion score comparable to those of natural speech and speech
synthesized by WaveNet (open WaveNet) and WaveGlow while processing speech
samples at a rate of more than 150 kHz on an NVIDIA Tesla P100.Comment: Submitted to INTERSPEECH201
Speaker-adaptive neural vocoders for parametric speech synthesis systems
This paper proposes speaker-adaptive neural vocoders for parametric
text-to-speech (TTS) systems. Recently proposed WaveNet-based neural vocoding
systems successfully generate a time sequence of speech signal with an
autoregressive framework. However, it remains a challenge to synthesize
high-quality speech when the amount of a target speaker's training data is
insufficient. To generate more natural speech signals with the constraint of
limited training data, we propose a speaker adaptation task with an effective
variation of neural vocoding models. In the proposed method, a
speaker-independent training method is applied to capture universal attributes
embedded in multiple speakers, and the trained model is then optimized to
represent the specific characteristics of the target speaker. Experimental
results verify that the proposed TTS systems with speaker-adaptive neural
vocoders outperform those with traditional source-filter model-based vocoders
and those with WaveNet vocoders, trained either speaker-dependently or
speaker-independently. In particular, our TTS system achieves 3.80 and 3.77 MOS
for the Korean male and Korean female speakers, respectively, even though we
use only ten minutes' speech corpus for training the model.Comment: Accepted to the IEEE Workshop of MMSP 202
Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks
A method for statistical parametric speech synthesis incorporating generative
adversarial networks (GANs) is proposed. Although powerful deep neural networks
(DNNs) techniques can be applied to artificially synthesize speech waveform,
the synthetic speech quality is low compared with that of natural speech. One
of the issues causing the quality degradation is an over-smoothing effect often
observed in the generated speech parameters. A GAN introduced in this paper
consists of two neural networks: a discriminator to distinguish natural and
generated samples, and a generator to deceive the discriminator. In the
proposed framework incorporating the GANs, the discriminator is trained to
distinguish natural and generated speech parameters, while the acoustic models
are trained to minimize the weighted sum of the conventional minimum generation
loss and an adversarial loss for deceiving the discriminator. Since the
objective of the GANs is to minimize the divergence (i.e., distribution
difference) between the natural and generated speech parameters, the proposed
method effectively alleviates the over-smoothing effect on the generated speech
parameters. We evaluated the effectiveness for text-to-speech and voice
conversion, and found that the proposed method can generate more natural
spectral parameters and than conventional minimum generation error
training algorithm regardless its hyper-parameter settings. Furthermore, we
investigated the effect of the divergence of various GANs, and found that a
Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms
of improving synthetic speech quality.Comment: Preprint manuscript of IEEE/ACM Transactions on Audio, Speech and
Language Processin
WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks
We propose a learning-based filter that allows us to directly modify a
synthetic speech waveform into a natural speech waveform. Speech-processing
systems using a vocoder framework such as statistical parametric speech
synthesis and voice conversion are convenient especially for a limited number
of data because it is possible to represent and process interpretable acoustic
features over a compact space, such as the fundamental frequency (F0) and
mel-cepstrum. However, a well-known problem that leads to the quality
degradation of generated speech is an over-smoothing effect that eliminates
some detailed structure of generated/converted acoustic features. To address
this issue, we propose a synthetic-to-natural speech waveform conversion
technique that uses cycle-consistent adversarial networks and which does not
require any explicit assumption about speech waveform in adversarial learning.
In contrast to current techniques, since our modification is performed at the
waveform level, we expect that the proposed method will also make it possible
to generate `vocoder-less' sounding speech even if the input speech is
synthesized using a vocoder framework. The experimental results demonstrate
that our proposed method can 1) alleviate the over-smoothing effect of the
acoustic features despite the direct modification method used for the waveform
and 2) greatly improve the naturalness of the generated speech sounds.Comment: SLT201
Analysing Shortcomings of Statistical Parametric Speech Synthesis
Output from statistical parametric speech synthesis (SPSS) remains noticeably
worse than natural speech recordings in terms of quality, naturalness, speaker
similarity, and intelligibility in noise. There are many hypotheses regarding
the origins of these shortcomings, but these hypotheses are often kept vague
and presented without empirical evidence that could confirm and quantify how a
specific shortcoming contributes to imperfections in the synthesised speech.
Throughout speech synthesis literature, surprisingly little work is dedicated
towards identifying the perceptually most important problems in speech
synthesis, even though such knowledge would be of great value for creating
better SPSS systems.
In this book chapter, we analyse some of the shortcomings of SPSS. In
particular, we discuss issues with vocoding and present a general methodology
for quantifying the effect of any of the many assumptions and design choices
that hold SPSS back. The methodology is accompanied by an example that
carefully measures and compares the severity of perceptual limitations imposed
by vocoding as well as other factors such as the statistical model and its use.Comment: 34 pages with 4 figures; draft book chapte
Fast and High-Quality Singing Voice Synthesis System based on Convolutional Neural Networks
The present paper describes singing voice synthesis based on convolutional
neural networks (CNNs). Singing voice synthesis systems based on deep neural
networks (DNNs) are currently being proposed and are improving the naturalness
of synthesized singing voices. As singing voices represent a rich form of
expression, a powerful technique to model them accurately is required. In the
proposed technique, long-term dependencies of singing voices are modeled by
CNNs. An acoustic feature sequence is generated for each segment that consists
of long-term frames, and a natural trajectory is obtained without the parameter
generation algorithm. Furthermore, a computational complexity reduction
technique, which drives the DNNs in different time units depending on type of
musical score features, is proposed. Experimental results show that the
proposed method can synthesize natural sounding singing voices much faster than
the conventional method.Comment: Accepted to ICASSP 2020. Singing voice samples (Japanese, English,
Chinese): https://www.techno-speech.com/news-20181214a-en. arXiv admin note:
substantial text overlap with arXiv:1904.0686
- …