3,297 research outputs found
Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis
Generating versatile and appropriate synthetic speech requires control over
the output expression separate from the spoken text. Important non-textual
speech variation is seldom annotated, in which case output control must be
learned in an unsupervised fashion. In this paper, we perform an in-depth study
of methods for unsupervised learning of control in statistical speech
synthesis. For example, we show that popular unsupervised training heuristics
can be interpreted as variational inference in certain autoencoder models. We
additionally connect these models to VQ-VAEs, another, recently-proposed class
of deep variational autoencoders, which we show can be derived from a very
similar mathematical argument. The implications of these new probabilistic
interpretations are discussed. We illustrate the utility of the various
approaches with an application to acoustic modelling for emotional speech
synthesis, where the unsupervised methods for learning expression control
(without access to emotional labels) are found to give results that in many
aspects match or surpass the previous best supervised approach.Comment: 17 pages, 4 figure
CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network
The prosodic aspects of speech signals produced by current text-to-speech
systems are typically averaged over training material, and as such lack the
variety and liveliness found in natural speech. To avoid monotony and averaged
prosody contours, it is desirable to have a way of modeling the variation in
the prosodic aspects of speech, so audio signals can be synthesized in multiple
ways for a given text. We present a new, hierarchically structured conditional
variational autoencoder to generate prosodic features (fundamental frequency,
energy and duration) suitable for use with a vocoder or a generative model like
WaveNet. At inference time, an embedding representing the prosody of a sentence
may be sampled from the variational layer to allow for prosodic variation. To
efficiently capture the hierarchical nature of the linguistic input (words,
syllables and phones), both the encoder and decoder parts of the auto-encoder
are hierarchical, in line with the linguistic structure, with layers being
clocked dynamically at the respective rates. We show in our experiments that
our dynamic hierarchical network outperforms a non-hierarchical
state-of-the-art baseline, and, additionally, that prosody transfer across
sentences is possible by employing the prosody embedding of one sentence to
generate the speech signal of another
HpRNet : Incorporating Residual Noise Modeling for Violin in a Variational Parametric Synthesizer
Generative Models for Audio Synthesis have been gaining momentum in the last
few years. More recently, parametric representations of the audio signal have
been incorporated to facilitate better musical control of the synthesized
output. In this work, we investigate a parametric model for violin tones, in
particular the generative modeling of the residual bow noise to make for more
natural tone quality. To aid in our analysis, we introduce a dataset of
Carnatic Violin Recordings where bow noise is an integral part of the playing
style of higher pitched notes in specific gestural contexts. We obtain insights
about each of the harmonic and residual components of the signal, as well as
their interdependence, via observations on the latent space derived in the
course of variational encoding of the spectral envelopes of the sustained
sounds.Comment: https://github.com/SubramaniKrishna/HpRNe
Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework
In this paper, we aim at improving the performance of synthesized speech in
statistical parametric speech synthesis (SPSS) based on a generative
adversarial network (GAN). In particular, we propose a novel architecture
combining the traditional acoustic loss function and the GAN's discriminative
loss under a multi-task learning (MTL) framework. The mean squared error (MSE)
is usually used to estimate the parameters of deep neural networks, which only
considers the numerical difference between the raw audio and the synthesized
one. To mitigate this problem, we introduce the GAN as a second task to
determine if the input is a natural speech with specific conditions. In this
MTL framework, the MSE optimization improves the stability of GAN, and at the
same time GAN produces samples with a distribution closer to natural speech.
Listening tests show that the multi-task architecture can generate more natural
speech that satisfies human perception than the conventional methods.Comment: Submitted to Automatic Speech Recognition and Understanding (ASRU)
2017 Worksho
Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder
Recent advances in neural autoregressive models have improve the performance
of speech synthesis (SS). However, as they lack the ability to model global
characteristics of speech (such as speaker individualities or speaking styles),
particularly when these characteristics have not been labeled, making neural
autoregressive SS systems more expressive is still an open issue. In this
paper, we propose to combine VoiceLoop, an autoregressive SS model, with
Variational Autoencoder (VAE). This approach, unlike traditional autoregressive
SS systems, uses VAE to model the global characteristics explicitly, enabling
the expressiveness of the synthesized speech to be controlled in an
unsupervised manner. Experiments using the VCTK and Blizzard2012 datasets show
the VAE helps VoiceLoop to generate higher quality speech and to control the
expressions in its synthesized speech by incorporating global characteristics
into the speech generating process.Comment: Accepted by Interspeech 201
VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019
We describe our submitted system for the ZeroSpeech Challenge 2019. The
current challenge theme addresses the difficulty of constructing a speech
synthesizer without any text or phonetic labels and requires a system that can
(1) discover subword units in an unsupervised way, and (2) synthesize the
speech with a target speaker's voice. Moreover, the system should also balance
the discrimination score ABX, the bit-rate compression rate, and the
naturalness and the intelligibility of the constructed voice. To tackle these
problems and achieve the best trade-off, we utilize a vector quantized
variational autoencoder (VQ-VAE) and a multi-scale codebook-to-spectrogram
(Code2Spec) inverter trained by mean square error and adversarial loss. The
VQ-VAE extracts the speech to a latent space, forces itself to map it into the
nearest codebook and produces compressed representation. Next, the inverter
generates a magnitude spectrogram to the target voice, given the codebook
vectors from VQ-VAE. In our experiments, we also investigated several other
clustering algorithms, including K-Means and GMM, and compared them with the
VQ-VAE result on ABX scores and bit rates. Our proposed approach significantly
improved the intelligibility (in CER), the MOS, and discrimination ABX scores
compared to the official ZeroSpeech 2019 baseline or even the topline.Comment: Submitted to Interspeech 201
Bayesian Subspace HMM for the Zerospeech 2020 Challenge
In this paper we describe our submission to the Zerospeech 2020 challenge,
where the participants are required to discover latent representations from
unannotated speech, and to use those representations to perform speech
synthesis, with synthesis quality used as a proxy metric for the unit quality.
In our system, we use the Bayesian Subspace Hidden Markov Model (SHMM) for unit
discovery. The SHMM models each unit as an HMM whose parameters are constrained
to lie in a low dimensional subspace of the total parameter space which is
trained to model phonetic variability. Our system compares favorably with the
baseline on the human-evaluated character error rate while maintaining
significantly lower unit bitrate
Probability density distillation with generative adversarial networks for high-quality parallel waveform generation
This paper proposes an effective probability density distillation (PDD)
algorithm for WaveNet-based parallel waveform generation (PWG) systems.
Recently proposed teacher-student frameworks in the PWG system have
successfully achieved a real-time generation of speech signals. However, the
difficulties optimizing the PDD criteria without auxiliary losses result in
quality degradation of synthesized speech. To generate more natural speech
signals within the teacher-student framework, we propose a novel optimization
criterion based on generative adversarial networks (GANs). In the proposed
method, the inverse autoregressive flow-based student model is incorporated as
a generator in the GAN framework, and jointly optimized by the PDD mechanism
with the proposed adversarial learning method. As this process encourages the
student to model the distribution of realistic speech waveform, the perceptual
quality of the synthesized speech becomes much more natural. Our experimental
results verify that the PWG systems with the proposed method outperform both
those using conventional approaches, and also autoregressive generation systems
with a well-trained teacher WaveNet.Comment: Accepted to the conference of INTERSPEECH 201
MaCow: Masked Convolutional Generative Flow
Flow-based generative models, conceptually attractive due to tractability of
both the exact log-likelihood computation and latent-variable inference, and
efficiency of both training and sampling, has led to a number of impressive
empirical successes and spawned many advanced variants and theoretical
investigations. Despite their computational efficiency, the density estimation
performance of flow-based generative models significantly falls behind those of
state-of-the-art autoregressive models. In this work, we introduce masked
convolutional generative flow (MaCow), a simple yet effective architecture of
generative flow using masked convolution. By restricting the local connectivity
in a small kernel, MaCow enjoys the properties of fast and stable training, and
efficient sampling, while achieving significant improvements over Glow for
density estimation on standard image benchmarks, considerably narrowing the gap
to autoregressive models.Comment: In Proceedings of Thirty-third Conference on Neural Information
Processing Systems (NeurIPS-2019
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
The recently-developed WaveNet architecture is the current state of the art
in realistic speech synthesis, consistently rated as more natural sounding for
many different languages than any previous system. However, because WaveNet
relies on sequential generation of one audio sample at a time, it is poorly
suited to today's massively parallel computers, and therefore hard to deploy in
a real-time production setting. This paper introduces Probability Density
Distillation, a new method for training a parallel feed-forward network from a
trained WaveNet with no significant difference in quality. The resulting system
is capable of generating high-fidelity speech samples at more than 20 times
faster than real-time, and is deployed online by Google Assistant, including
serving multiple English and Japanese voices
- …