22 research outputs found
Uncovering Latent Style Factors for Expressive Speech Synthesis
Prosodic modeling is a core problem in speech synthesis. The key challenge is
producing desirable prosody from textual input containing only phonetic
information. In this preliminary study, we introduce the concept of "style
tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis
model. Using style tokens, we aim to extract independent prosodic styles from
training data. We show that without annotation data or an explicit supervision
signal, our approach can automatically learn a variety of prosodic variations
in a purely data-driven way. Importantly, each style token corresponds to a
fixed style factor regardless of the given text sequence. As a result, we can
control the prosodic style of synthetic speech in a somewhat predictable and
globally consistent way.Comment: Submitted to NIPS ML4Audio workshop and ICASS
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
We present an extension to the Tacotron speech synthesis architecture that
learns a latent embedding space of prosody, derived from a reference acoustic
representation containing the desired prosody. We show that conditioning
Tacotron on this learned embedding space results in synthesized audio that
matches the prosody of the reference signal with fine time detail even when the
reference and synthesis speakers are different. Additionally, we show that a
reference prosody embedding can be used to synthesize text that is different
from that of the reference utterance. We define several quantitative and
subjective metrics for evaluating prosody transfer, and report results with
accompanying audio samples from single-speaker and 44-speaker Tacotron models
on a prosody transfer task
Exploring Neural Transducers for End-to-End Speech Recognition
In this work, we perform an empirical comparison among the CTC,
RNN-Transducer, and attention-based Seq2Seq models for end-to-end speech
recognition. We show that, without any language model, Seq2Seq and
RNN-Transducer models both outperform the best reported CTC models with a
language model, on the popular Hub5'00 benchmark. On our internal diverse
dataset, these trends continue - RNNTransducer models rescored with a language
model after beam search outperform our best CTC models. These results simplify
the speech recognition pipeline so that decoding can now be expressed purely as
neural network operations. We also study how the choice of encoder architecture
affects the performance of the three models - when all encoder layers are
forward only, and when encoders downsample the input representation
aggressively
Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis
We describe a sequence-to-sequence neural network which directly generates
speech waveforms from text inputs. The architecture extends the Tacotron model
by incorporating a normalizing flow into the autoregressive decoder loop.
Output waveforms are modeled as a sequence of non-overlapping fixed-length
blocks, each one containing hundreds of samples. The interdependencies of
waveform samples within each block are modeled using the normalizing flow,
enabling parallel training and synthesis. Longer-term dependencies are handled
autoregressively by conditioning each flow on preceding blocks.This model can
be optimized directly with maximum likelihood, with-out using intermediate,
hand-designed features nor additional loss terms. Contemporary state-of-the-art
text-to-speech (TTS) systems use a cascade of separately learned models: one
(such as Tacotron) which generates intermediate features (such as spectrograms)
from text, followed by a vocoder (such as WaveRNN) which generates waveform
samples from the intermediate features. The proposed system, in contrast, does
not use a fixed intermediate representation, and learns all parameters
end-to-end. Experiments show that the proposed model generates speech with
quality approaching a state-of-the-art neural TTS system, with significantly
improved generation speed.Comment: 6 pages including supplement, 3 figures. accepted to ICASSP 202
Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis
Despite the ability to produce human-level speech for in-domain text,
attention-based end-to-end text-to-speech (TTS) systems suffer from text
alignment failures that increase in frequency for out-of-domain text. We show
that these failures can be addressed using simple location-relative attention
mechanisms that do away with content-based query/key comparisons. We compare
two families of attention mechanisms: location-relative GMM-based mechanisms
and additive energy-based mechanisms. We suggest simple modifications to
GMM-based attention that allow it to align quickly and consistently during
training, and introduce a new location-relative attention mechanism to the
additive energy-based family, called Dynamic Convolution Attention (DCA). We
compare the various mechanisms in terms of alignment speed and consistency
during training, naturalness, and ability to generalize to long utterances, and
conclude that GMM attention and DCA can generalize to very long utterances,
while preserving naturalness for shorter, in-domain utterances.Comment: Accepted to ICASSP 202
Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis
Recent work has explored sequence-to-sequence latent variable models for
expressive speech synthesis (supporting control and transfer of prosody and
style), but has not presented a coherent framework for understanding the
trade-offs between the competing methods. In this paper, we propose embedding
capacity (the amount of information the embedding contains about the data) as a
unified method of analyzing the behavior of latent variable models of speech,
comparing existing heuristic (non-variational) methods to variational methods
that are able to explicitly constrain capacity using an upper bound on
representational mutual information. In our proposed model (Capacitron), we
show that by adding conditional dependencies to the variational posterior such
that it matches the form of the true posterior, the same model can be used for
high-precision prosody transfer, text-agnostic style transfer, and generation
of natural-sounding prior samples. For multi-speaker models, Capacitron is able
to preserve target speaker identity during inter-speaker prosody transfer and
when drawing samples from the latent prior. Lastly, we introduce a method for
decomposing embedding capacity hierarchically across two sets of latents,
allowing a portion of the latent variability to be specified and the remaining
variability sampled from a learned prior. Audio examples are available on the
web.Comment: Submitted to ICLR 202
Semi-Supervised Generative Modeling for Controllable Speech Synthesis
We present a novel generative model that combines state-of-the-art neural
text-to-speech (TTS) with semi-supervised probabilistic latent variable models.
By providing partial supervision to some of the latent variables, we are able
to force them to take on consistent and interpretable purposes, which
previously hasn't been possible with purely unsupervised TTS models. We
demonstrate that our model is able to reliably discover and control important
but rarely labelled attributes of speech, such as affect and speaking rate,
with as little as 1% (30 minutes) supervision. Even at such low supervision
levels we do not observe a degradation of synthesis quality compared to a
state-of-the-art baseline. Audio samples are available on the web
Reducing Bias in Production Speech Models
Replacing hand-engineered pipelines with end-to-end deep learning systems has
enabled strong results in applications like speech and object recognition.
However, the causality and latency constraints of production systems put
end-to-end speech models back into the underfitting regime and expose biases in
the model that we show cannot be overcome by "scaling up", i.e., training
bigger models on more data. In this work we systematically identify and address
sources of bias, reducing error rates by up to 20% while remaining practical
for deployment. We achieve this by utilizing improved neural architectures for
streaming inference, solving optimization issues, and employing strategies that
increase audio and label modelling versatility
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
In this work, we propose "global style tokens" (GSTs), a bank of embeddings
that are jointly trained within Tacotron, a state-of-the-art end-to-end speech
synthesis system. The embeddings are trained with no explicit labels, yet learn
to model a large range of acoustic expressiveness. GSTs lead to a rich set of
significant results. The soft interpretable "labels" they generate can be used
to control synthesis in novel ways, such as varying speed and speaking style -
independently of the text content. They can also be used for style transfer,
replicating the speaking style of a single audio clip across an entire
long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn
to factorize noise and speaker identity, providing a path towards highly
scalable but robust speech synthesis
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
We show that an end-to-end deep learning approach can be used to recognize
either English or Mandarin Chinese speech--two vastly different languages.
Because it replaces entire pipelines of hand-engineered components with neural
networks, end-to-end learning allows us to handle a diverse variety of speech
including noisy environments, accents and different languages. Key to our
approach is our application of HPC techniques, resulting in a 7x speedup over
our previous system. Because of this efficiency, experiments that previously
took weeks now run in days. This enables us to iterate more quickly to identify
superior architectures and algorithms. As a result, in several cases, our
system is competitive with the transcription of human workers when benchmarked
on standard datasets. Finally, using a technique called Batch Dispatch with
GPUs in the data center, we show that our system can be inexpensively deployed
in an online setting, delivering low latency when serving users at scale