2,007 research outputs found
A comparison of Vietnamese Statistical Parametric Speech Synthesis Systems
In recent years, statistical parametric speech synthesis (SPSS) systems have
been widely utilized in many interactive speech-based systems (e.g.~Amazon's
Alexa, Bose's headphones). To select a suitable SPSS system, both speech
quality and performance efficiency (e.g.~decoding time) must be taken into
account. In the paper, we compared four popular Vietnamese SPSS techniques
using: 1) hidden Markov models (HMM), 2) deep neural networks (DNN), 3)
generative adversarial networks (GAN), and 4) end-to-end (E2E) architectures,
which consists of Tacontron~2 and WaveGlow vocoder in terms of speech quality
and performance efficiency. We showed that the E2E systems accomplished the
best quality, but required the power of GPU to achieve real-time performance.
We also showed that the HMM-based system had inferior speech quality, but it
was the most efficient system. Surprisingly, the E2E systems were more
efficient than the DNN and GAN in inference on GPU. Surprisingly, the GAN-based
system did not outperform the DNN in term of quality.Comment: 9 pages, submitted to KSE 202
A Waveform Representation Framework for High-quality Statistical Parametric Speech Synthesis
State-of-the-art statistical parametric speech synthesis (SPSS) generally
uses a vocoder to represent speech signals and parameterize them into features
for subsequent modeling. Magnitude spectrum has been a dominant feature over
the years. Although perceptual studies have shown that phase spectrum is
essential to the quality of synthesized speech, it is often ignored by using a
minimum phase filter during synthesis and the speech quality suffers. To bypass
this bottleneck in vocoded speech, this paper proposes a phase-embedded
waveform representation framework and establishes a magnitude-phase joint
modeling platform for high-quality SPSS. Our experiments on waveform
reconstruction show that the performance is better than that of the widely-used
STRAIGHT. Furthermore, the proposed modeling and synthesis platform outperforms
a leading-edge, vocoded, deep bidirectional long short-term memory recurrent
neural network (DBLSTM-RNN)-based baseline system in various objective
evaluation metrics conducted.Comment: accepted and will appear in APSIPA2015; keywords: speech synthesis,
LSTM-RNN, vocoder, phase, waveform, modelin
Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis
Generating versatile and appropriate synthetic speech requires control over
the output expression separate from the spoken text. Important non-textual
speech variation is seldom annotated, in which case output control must be
learned in an unsupervised fashion. In this paper, we perform an in-depth study
of methods for unsupervised learning of control in statistical speech
synthesis. For example, we show that popular unsupervised training heuristics
can be interpreted as variational inference in certain autoencoder models. We
additionally connect these models to VQ-VAEs, another, recently-proposed class
of deep variational autoencoders, which we show can be derived from a very
similar mathematical argument. The implications of these new probabilistic
interpretations are discussed. We illustrate the utility of the various
approaches with an application to acoustic modelling for emotional speech
synthesis, where the unsupervised methods for learning expression control
(without access to emotional labels) are found to give results that in many
aspects match or surpass the previous best supervised approach.Comment: 17 pages, 4 figure
Deep Learning for Singing Processing: Achievements, Challenges and Impact on Singers and Listeners
This paper summarizes some recent advances on a set of tasks related to the
processing of singing using state-of-the-art deep learning techniques. We
discuss their achievements in terms of accuracy and sound quality, and the
current challenges, such as availability of data and computing resources. We
also discuss the impact that these advances do and will have on listeners and
singers when they are integrated in commercial applications.Comment: Keynote speech, 2018 Joint Workshop on Machine Learning for Music.
The Federated Artificial Intelligence Meeting (FAIM), a joint workshop
program of ICML, IJCAI/ECAI, and AAMA
Analysing Shortcomings of Statistical Parametric Speech Synthesis
Output from statistical parametric speech synthesis (SPSS) remains noticeably
worse than natural speech recordings in terms of quality, naturalness, speaker
similarity, and intelligibility in noise. There are many hypotheses regarding
the origins of these shortcomings, but these hypotheses are often kept vague
and presented without empirical evidence that could confirm and quantify how a
specific shortcoming contributes to imperfections in the synthesised speech.
Throughout speech synthesis literature, surprisingly little work is dedicated
towards identifying the perceptually most important problems in speech
synthesis, even though such knowledge would be of great value for creating
better SPSS systems.
In this book chapter, we analyse some of the shortcomings of SPSS. In
particular, we discuss issues with vocoding and present a general methodology
for quantifying the effect of any of the many assumptions and design choices
that hold SPSS back. The methodology is accompanied by an example that
carefully measures and compares the severity of perceptual limitations imposed
by vocoding as well as other factors such as the statistical model and its use.Comment: 34 pages with 4 figures; draft book chapte
Probability density distillation with generative adversarial networks for high-quality parallel waveform generation
This paper proposes an effective probability density distillation (PDD)
algorithm for WaveNet-based parallel waveform generation (PWG) systems.
Recently proposed teacher-student frameworks in the PWG system have
successfully achieved a real-time generation of speech signals. However, the
difficulties optimizing the PDD criteria without auxiliary losses result in
quality degradation of synthesized speech. To generate more natural speech
signals within the teacher-student framework, we propose a novel optimization
criterion based on generative adversarial networks (GANs). In the proposed
method, the inverse autoregressive flow-based student model is incorporated as
a generator in the GAN framework, and jointly optimized by the PDD mechanism
with the proposed adversarial learning method. As this process encourages the
student to model the distribution of realistic speech waveform, the perceptual
quality of the synthesized speech becomes much more natural. Our experimental
results verify that the PWG systems with the proposed method outperform both
those using conventional approaches, and also autoregressive generation systems
with a well-trained teacher WaveNet.Comment: Accepted to the conference of INTERSPEECH 201
High quality voice conversion using prosodic and high-resolution spectral features
Voice conversion methods have advanced rapidly over the last decade. Studies
have shown that speaker characteristics are captured by spectral feature as
well as various prosodic features. Most existing conversion methods focus on
the spectral feature as it directly represents the timbre characteristics,
while some conversion methods have focused only on the prosodic feature
represented by the fundamental frequency. In this paper, a comprehensive
framework using deep neural networks to convert both timbre and prosodic
features is proposed. The timbre feature is represented by a high-resolution
spectral feature. The prosodic features include F0, intensity and duration. It
is well known that DNN is useful as a tool to model high-dimensional features.
In this work, we show that DNN initialized by our proposed autoencoder
pretraining yields good quality DNN conversion models. This pretraining is
tailor-made for voice conversion and leverages on autoencoder to capture the
generic spectral shape of source speech. Additionally, our framework uses
segmental DNN models to capture the evolution of the prosodic features over
time. To reconstruct the converted speech, the spectral feature produced by the
DNN model is combined with the three prosodic features produced by the DNN
segmental models. Our experimental results show that the application of both
prosodic and high-resolution spectral features leads to quality converted
speech as measured by objective evaluation and subjective listening tests
ExcitNet vocoder: A neural excitation model for parametric speech synthesis systems
This paper proposes a WaveNet-based neural excitation model (ExcitNet) for
statistical parametric speech synthesis systems. Conventional WaveNet-based
neural vocoding systems significantly improve the perceptual quality of
synthesized speech by statistically generating a time sequence of speech
waveforms through an auto-regressive framework. However, they often suffer from
noisy outputs because of the difficulties in capturing the complicated
time-varying nature of speech signals. To improve modeling efficiency, the
proposed ExcitNet vocoder employs an adaptive inverse filter to decouple
spectral components from the speech signal. The residual component, i.e.
excitation signal, is then trained and generated within the WaveNet framework.
In this way, the quality of the synthesized speech signal can be further
improved since the spectral component is well represented by a deep learning
framework and, moreover, the residual component is efficiently generated by the
WaveNet framework. Experimental results show that the proposed ExcitNet
vocoder, trained both speaker-dependently and speaker-independently,
outperforms traditional linear prediction vocoders and similarly configured
conventional WaveNet vocoders.Comment: Accepted to the conference of EUSIPCO 2019. arXiv admin note: text
overlap with arXiv:1811.0331
Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data
Emotional voice conversion aims to convert the spectrum and prosody to change
the emotional patterns of speech, while preserving the speaker identity and
linguistic content. Many studies require parallel speech data between different
emotional patterns, which is not practical in real life. Moreover, they often
model the conversion of fundamental frequency (F0) with a simple linear
transform. As F0 is a key aspect of intonation that is hierarchical in nature,
we believe that it is more adequate to model F0 in different temporal scales by
using wavelet transform. We propose a CycleGAN network to find an optimal
pseudo pair from non-parallel training data by learning forward and inverse
mappings simultaneously using adversarial and cycle-consistency losses. We also
study the use of continuous wavelet transform (CWT) to decompose F0 into ten
temporal scales, that describes speech prosody at different time resolution,
for effective F0 conversion. Experimental results show that our proposed
framework outperforms the baselines both in objective and subjective
evaluations.Comment: accepted by Speaker Odyssey 2020 in Tokyo, Japa
Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion
Emotional voice conversion aims to convert the emotion of speech from one
state to another while preserving the linguistic content and speaker identity.
The prior studies on emotional voice conversion are mostly carried out under
the assumption that emotion is speaker-dependent. We consider that there is a
common code between speakers for emotional expression in a spoken language,
therefore, a speaker-independent mapping between emotional states is possible.
In this paper, we propose a speaker-independent emotional voice conversion
framework, that can convert anyone's emotion without the need for parallel
data. We propose a VAW-GAN based encoder-decoder structure to learn the
spectrum and prosody mapping. We perform prosody conversion by using continuous
wavelet transform (CWT) to model the temporal dependencies. We also investigate
the use of F0 as an additional input to the decoder to improve emotion
conversion performance. Experiments show that the proposed speaker-independent
framework achieves competitive results for both seen and unseen speakers.Comment: Accepted by Interspeech 202
- …