5,815 research outputs found
Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis
Generating versatile and appropriate synthetic speech requires control over
the output expression separate from the spoken text. Important non-textual
speech variation is seldom annotated, in which case output control must be
learned in an unsupervised fashion. In this paper, we perform an in-depth study
of methods for unsupervised learning of control in statistical speech
synthesis. For example, we show that popular unsupervised training heuristics
can be interpreted as variational inference in certain autoencoder models. We
additionally connect these models to VQ-VAEs, another, recently-proposed class
of deep variational autoencoders, which we show can be derived from a very
similar mathematical argument. The implications of these new probabilistic
interpretations are discussed. We illustrate the utility of the various
approaches with an application to acoustic modelling for emotional speech
synthesis, where the unsupervised methods for learning expression control
(without access to emotional labels) are found to give results that in many
aspects match or surpass the previous best supervised approach.Comment: 17 pages, 4 figure
Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis
Recent studies have shown that text-to-speech synthesis quality can be
improved by using glottal vocoding. This refers to vocoders that parameterize
speech into two parts, the glottal excitation and vocal tract, that occur in
the human speech production apparatus. Current glottal vocoders generate the
glottal excitation waveform by using deep neural networks (DNNs). However, the
squared error-based training of the present glottal excitation models is
limited to generating conditional average waveforms, which fails to capture the
stochastic variation of the waveforms. As a result, shaped noise is added as
post-processing. In this study, we propose a new method for predicting glottal
waveforms by generative adversarial networks (GANs). GANs are generative models
that aim to embed the data distribution in a latent space, enabling generation
of new instances very similar to the original by randomly sampling the latent
distribution. The glottal pulses generated by GANs show a stochastic component
similar to natural glottal pulses. In our experiments, we compare synthetic
speech generated using glottal waveforms produced by both DNNs and GANs. The
results show that the newly proposed GANs achieve synthesis quality comparable
to that of widely-used DNNs, without using an additive noise component.Comment: Accepted in Interspeec
Adversarial Audio Synthesis
Audio signals are sampled at high temporal resolutions, and learning to
synthesize audio requires capturing structure across a range of timescales.
Generative adversarial networks (GANs) have seen wide success at generating
images that are both locally and globally coherent, but they have seen little
application to audio generation. In this paper we introduce WaveGAN, a first
attempt at applying GANs to unsupervised synthesis of raw-waveform audio.
WaveGAN is capable of synthesizing one second slices of audio waveforms with
global coherence, suitable for sound effect generation. Our experiments
demonstrate that, without labels, WaveGAN learns to produce intelligible words
when trained on a small-vocabulary speech dataset, and can also synthesize
audio from other domains such as drums, bird vocalizations, and piano. We
compare WaveGAN to a method which applies GANs designed for image generation on
image-like audio feature representations, finding both approaches to be
promising.Comment: Published as a conference paper at ICLR 201
NAUTILUS: a Versatile Voice Cloning System
We introduce a novel speech synthesis system, called NAUTILUS, that can
generate speech with a target voice either from a text input or a reference
utterance of an arbitrary source speaker. By using a multi-speaker speech
corpus to train all requisite encoders and decoders in the initial training
stage, our system can clone unseen voices using untranscribed speech of target
speakers on the basis of the backpropagation algorithm. Moreover, depending on
the data circumstance of the target speaker, the cloning strategy can be
adjusted to take advantage of additional data and modify the behaviors of
text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the
situation. We test the performance of the proposed framework by using deep
convolution layers to model the encoders, decoders and WaveNet vocoder.
Evaluations show that it achieves comparable quality with state-of-the-art TTS
and VC systems when cloning with just five minutes of untranscribed speech.
Moreover, it is demonstrated that the proposed framework has the ability to
switch between TTS and VC with high speaker consistency, which will be useful
for many applications.Comment: Submitted to The IEEE/ACM Transactions on Audio, Speech, and Language
Processin
Listening while Speaking: Speech Chain by Deep Learning
Despite the close relationship between speech perception and production,
research in automatic speech recognition (ASR) and text-to-speech synthesis
(TTS) has progressed more or less independently without exerting much mutual
influence on each other. In human communication, on the other hand, a
closed-loop speech chain mechanism with auditory feedback from the speaker's
mouth to her ear is crucial. In this paper, we take a step further and develop
a closed-loop speech chain model based on deep learning. The
sequence-to-sequence model in close-loop architecture allows us to train our
model on the concatenation of both labeled and unlabeled data. While ASR
transcribes the unlabeled speech features, TTS attempts to reconstruct the
original speech waveform based on the text from ASR. In the opposite direction,
ASR also attempts to reconstruct the original text transcription given the
synthesized speech. To the best of our knowledge, this is the first deep
learning model that integrates human speech perception and production
behaviors. Our experimental results show that the proposed approach
significantly improved the performance more than separate systems that were
only trained with labeled data
Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder
Recent advances in neural autoregressive models have improve the performance
of speech synthesis (SS). However, as they lack the ability to model global
characteristics of speech (such as speaker individualities or speaking styles),
particularly when these characteristics have not been labeled, making neural
autoregressive SS systems more expressive is still an open issue. In this
paper, we propose to combine VoiceLoop, an autoregressive SS model, with
Variational Autoencoder (VAE). This approach, unlike traditional autoregressive
SS systems, uses VAE to model the global characteristics explicitly, enabling
the expressiveness of the synthesized speech to be controlled in an
unsupervised manner. Experiments using the VCTK and Blizzard2012 datasets show
the VAE helps VoiceLoop to generate higher quality speech and to control the
expressions in its synthesized speech by incorporating global characteristics
into the speech generating process.Comment: Accepted by Interspeech 201
CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network
The prosodic aspects of speech signals produced by current text-to-speech
systems are typically averaged over training material, and as such lack the
variety and liveliness found in natural speech. To avoid monotony and averaged
prosody contours, it is desirable to have a way of modeling the variation in
the prosodic aspects of speech, so audio signals can be synthesized in multiple
ways for a given text. We present a new, hierarchically structured conditional
variational autoencoder to generate prosodic features (fundamental frequency,
energy and duration) suitable for use with a vocoder or a generative model like
WaveNet. At inference time, an embedding representing the prosody of a sentence
may be sampled from the variational layer to allow for prosodic variation. To
efficiently capture the hierarchical nature of the linguistic input (words,
syllables and phones), both the encoder and decoder parts of the auto-encoder
are hierarchical, in line with the linguistic structure, with layers being
clocked dynamically at the respective rates. We show in our experiments that
our dynamic hierarchical network outperforms a non-hierarchical
state-of-the-art baseline, and, additionally, that prosody transfer across
sentences is possible by employing the prosody embedding of one sentence to
generate the speech signal of another
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce
high-quality speech directly from text or simple linguistic features such as
phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS
does not require manually annotated and complicated linguistic features such as
part-of-speech tags and syntactic structures for system training. However, it
must be carefully designed and well optimized so that it can implicitly extract
useful linguistic features from the input features. In this paper we
investigate under what conditions the neural sequence-to-sequence TTS can work
well in Japanese and English along with comparisons with deep neural network
(DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline
systems also use autoregressive probabilistic modeling and a neural vocoder. We
investigated systems from three aspects: a) model architecture, b) model
parameter size, and c) language. For the model architecture aspect, we adopt
modified Tacotron systems that we previously proposed and their variants using
an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we
investigate two model parameter sizes. For the language aspect, we conduct
listening tests in both Japanese and English to see if our findings can be
generalized across languages. Our experiments suggest that a) a neural
sequence-to-sequence TTS system should have a sufficient number of model
parameters to produce high quality speech, b) it should also use a powerful
encoder when it takes characters as inputs, and c) the encoder still has a room
for improvement and needs to have an improved architecture to learn
supra-segmental features more appropriately
VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019
We describe our submitted system for the ZeroSpeech Challenge 2019. The
current challenge theme addresses the difficulty of constructing a speech
synthesizer without any text or phonetic labels and requires a system that can
(1) discover subword units in an unsupervised way, and (2) synthesize the
speech with a target speaker's voice. Moreover, the system should also balance
the discrimination score ABX, the bit-rate compression rate, and the
naturalness and the intelligibility of the constructed voice. To tackle these
problems and achieve the best trade-off, we utilize a vector quantized
variational autoencoder (VQ-VAE) and a multi-scale codebook-to-spectrogram
(Code2Spec) inverter trained by mean square error and adversarial loss. The
VQ-VAE extracts the speech to a latent space, forces itself to map it into the
nearest codebook and produces compressed representation. Next, the inverter
generates a magnitude spectrogram to the target voice, given the codebook
vectors from VQ-VAE. In our experiments, we also investigated several other
clustering algorithms, including K-Means and GMM, and compared them with the
VQ-VAE result on ABX scores and bit rates. Our proposed approach significantly
improved the intelligibility (in CER), the MOS, and discrimination ABX scores
compared to the official ZeroSpeech 2019 baseline or even the topline.Comment: Submitted to Interspeech 201
Investigating context features hidden in End-to-End TTS
Recent studies have introduced end-to-end TTS, which integrates the
production of context and acoustic features in statistical parametric speech
synthesis. As a result, a single neural network replaced laborious feature
engineering with automated feature learning. However, little is known about
what types of context information end-to-end TTS extracts from text input
before synthesizing speech, and the previous knowledge about context features
is barely utilized. In this work, we first point out the model similarity
between end-to-end TTS and parametric TTS. Based on the similarity, we evaluate
the quality of encoder outputs from an end-to-end TTS system against eight
criteria that are derived from a standard set of context information used in
parametric TTS. We conduct experiments using an evaluation procedure that has
been newly developed in the machine learning literature for quantitative
analysis of neural representations, while adapting it to the TTS domain.
Experimental results show that the encoder outputs reflect both linguistic and
phonetic contexts, such as vowel reduction at phoneme level, lexical stress at
syllable level, and part-of-speech at word level, possibly due to the joint
optimization of context and acoustic features.Comment: Accepted to ICASSP 201
- …