1,109 research outputs found
In Other News: A Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data
Neural text-to-speech synthesis (NTTS) models have shown significant progress
in generating high-quality speech, however they require a large quantity of
training data. This makes creating models for multiple styles expensive and
time-consuming. In this paper different styles of speech are analysed based on
prosodic variations, from this a model is proposed to synthesise speech in the
style of a newscaster, with just a few hours of supplementary data. We pose the
problem of synthesising in a target style using limited data as that of
creating a bi-style model that can synthesise both neutral-style and
newscaster-style speech via a one-hot vector which factorises the two styles.
We also propose conditioning the model on contextual word embeddings, and
extensively evaluate it against neutral NTTS, and neutral concatenative-based
synthesis. This model closes the gap in perceived style-appropriateness between
natural recordings for newscaster-style of speech, and neutral speech synthesis
by approximately two-thirds.Comment: Accepted at NAACL-HLT 201
CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network
The prosodic aspects of speech signals produced by current text-to-speech
systems are typically averaged over training material, and as such lack the
variety and liveliness found in natural speech. To avoid monotony and averaged
prosody contours, it is desirable to have a way of modeling the variation in
the prosodic aspects of speech, so audio signals can be synthesized in multiple
ways for a given text. We present a new, hierarchically structured conditional
variational autoencoder to generate prosodic features (fundamental frequency,
energy and duration) suitable for use with a vocoder or a generative model like
WaveNet. At inference time, an embedding representing the prosody of a sentence
may be sampled from the variational layer to allow for prosodic variation. To
efficiently capture the hierarchical nature of the linguistic input (words,
syllables and phones), both the encoder and decoder parts of the auto-encoder
are hierarchical, in line with the linguistic structure, with layers being
clocked dynamically at the respective rates. We show in our experiments that
our dynamic hierarchical network outperforms a non-hierarchical
state-of-the-art baseline, and, additionally, that prosody transfer across
sentences is possible by employing the prosody embedding of one sentence to
generate the speech signal of another
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis
Global Style Tokens (GSTs) are a recently-proposed method to learn latent
disentangled representations of high-dimensional data. GSTs can be used within
Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to
uncover expressive factors of variation in speaking style. In this work, we
introduce the Text-Predicted Global Style Token (TP-GST) architecture, which
treats GST combination weights or style embeddings as "virtual" speaking style
labels within Tacotron. TP-GST learns to predict stylistic renderings from text
alone, requiring neither explicit labels during training nor auxiliary inputs
for inference. We show that, when trained on a dataset of expressive speech,
our system generates audio with more pitch and energy variation than two
state-of-the-art baseline models. We further demonstrate that TP-GSTs can
synthesize speech with background noise removed, and corroborate these analyses
with positive results on human-rated listener preference audiobook tasks.
Finally, we demonstrate that multi-speaker TP-GST models successfully factorize
speaker identity and speaking style. We provide a website with audio samples
for each of our findings
Modality Dropout for Improved Performance-driven Talking Faces
We describe our novel deep learning approach for driving animated faces using
both acoustic and visual information. In particular, speech-related facial
movements are generated using audiovisual information, and non-speech facial
movements are generated using only visual information. To ensure that our model
exploits both modalities during training, batches are generated that contain
audio-only, video-only, and audiovisual input features. The probability of
dropping a modality allows control over the degree to which the model exploits
audio and visual information during training. Our trained model runs in
real-time on resource limited hardware (e.g.\ a smart phone), it is user
agnostic, and it is not dependent on a potentially error-prone transcription of
the speech. We use subjective testing to demonstrate: 1) the improvement of
audiovisual-driven animation over the equivalent video-only approach, and 2)
the improvement in the animation of speech-related facial movements after
introducing modality dropout. Before introducing dropout, viewers prefer
audiovisual-driven animation in 51% of the test sequences compared with only
18% for video-driven. After introducing dropout viewer preference for
audiovisual-driven animation increases to 74%, but decreases to 8% for
video-only.Comment: Pre-prin
CAMP: a Two-Stage Approach to Modelling Prosody in Context
Prosody is an integral part of communication, but remains an open problem in
state-of-the-art speech synthesis. There are two major issues faced when
modelling prosody: (1) prosody varies at a slower rate compared with other
content in the acoustic signal (e.g. segmental information and background
noise); (2) determining appropriate prosody without sufficient context is an
ill-posed problem. In this paper, we propose solutions to both these issues. To
mitigate the challenge of modelling a slow-varying signal, we learn to
disentangle prosodic information using a word level representation. To
alleviate the ill-posed nature of prosody modelling, we use syntactic and
semantic information derived from text to learn a context-dependent prior over
our prosodic space. Our Context-Aware Model of Prosody (CAMP) outperforms the
state-of-the-art technique, closing the gap with natural speech by 26%. We also
find that replacing attention with a jointly-trained duration model improves
prosody significantly.Comment: 5 pages. Published in the 2021 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP 2021
Learning Singing From Speech
We propose an algorithm that is capable of synthesizing high quality target
speaker's singing voice given only their normal speech samples. The proposed
algorithm first integrate speech and singing synthesis into a unified
framework, and learns universal speaker embeddings that are shareable between
speech and singing synthesis tasks. Specifically, the speaker embeddings
learned from normal speech via the speech synthesis objective are shared with
those learned from singing samples via the singing synthesis objective in the
unified training framework. This makes the learned speaker embedding a
transferable representation for both speaking and singing. We evaluate the
proposed algorithm on singing voice conversion task where the content of
original singing is covered with the timbre of another speaker's voice learned
purely from their normal speech samples. Our experiments indicate that the
proposed algorithm generates high-quality singing voices that sound highly
similar to target speaker's voice given only his or her normal speech samples.
We believe that proposed algorithm will open up new opportunities for singing
synthesis and conversion for broader users and applications.Comment: Submitted to ICASSP-202
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
In this paper, we present a novel system that separates the voice of a target
speaker from multi-speaker signals, by making use of a reference signal from
the target speaker. We achieve this by training two separate neural networks:
(1) A speaker recognition network that produces speaker-discriminative
embeddings; (2) A spectrogram masking network that takes both noisy spectrogram
and speaker embedding as input, and produces a mask. Our system significantly
reduces the speech recognition WER on multi-speaker signals, with minimal WER
degradation on single-speaker signals.Comment: To appear in Interspeech 201
Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages
Recently, sequence-to-sequence models with attention have been successfully
applied in Text-to-speech (TTS). These models can generate near-human speech
with a large accurately-transcribed speech corpus. However, preparing such a
large data-set is both expensive and laborious. To alleviate the problem of
heavy data demand, we propose a novel unsupervised pre-training mechanism in
this paper. Specifically, we first use Vector-quantization
Variational-Autoencoder (VQ-VAE) to ex-tract the unsupervised linguistic units
from large-scale, publicly found, and untranscribed speech. We then pre-train
the sequence-to-sequence TTS model by using the<unsupervised linguistic units,
audio>pairs. Finally, we fine-tune the model with a small amount of<text,
audio>paired data from the target speaker. As a result, both objective and
subjective evaluations show that our proposed method can synthesize more
intelligible and natural speech with the same amount of paired training data.
Besides, we extend our proposed method to the hypothesized low-resource
languages and verify the effectiveness of the method using objective
evaluation.Comment: Accepted to the conference of INTERSPEECH 202
PPSpeech: Phrase based Parallel End-to-End TTS System
Current end-to-end autoregressive TTS systems (e.g. Tacotron 2) have
outperformed traditional parallel approaches on the quality of synthesized
speech. However, they introduce new problems at the same time. Due to the
autoregressive nature, the time cost of inference has to be proportional to the
length of text, which pose a great challenge for online serving. On the other
hand, the style of synthetic speech becomes unstable and may change obviously
among sentences. In this paper, we propose a Phrase based Parallel End-to-End
TTS System (PPSpeech) to address these issues. PPSpeech uses autoregression
approach within a phrase and executes parallel strategies for different
phrases. By this method, we can achieve both high quality and high efficiency.
In addition, we propose acoustic embedding and text context embedding as the
conditions of encoder to keep successive and prevent from abrupt style or
timbre change. Experiments show that, the synthesis speed of PPSpeech is much
faster than sentence level autoregressive Tacotron 2 when a sentence has more
than 5 phrases. The speed advantage increases with the growth of sentence
length. Subjective experiments show that the proposed system with acoustic
embedding and context embedding as conditions can make the style transition
across sentences gradient and natural, defeating Global Style Token (GST)
obviously in MOS
Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis
Generating versatile and appropriate synthetic speech requires control over
the output expression separate from the spoken text. Important non-textual
speech variation is seldom annotated, in which case output control must be
learned in an unsupervised fashion. In this paper, we perform an in-depth study
of methods for unsupervised learning of control in statistical speech
synthesis. For example, we show that popular unsupervised training heuristics
can be interpreted as variational inference in certain autoencoder models. We
additionally connect these models to VQ-VAEs, another, recently-proposed class
of deep variational autoencoders, which we show can be derived from a very
similar mathematical argument. The implications of these new probabilistic
interpretations are discussed. We illustrate the utility of the various
approaches with an application to acoustic modelling for emotional speech
synthesis, where the unsupervised methods for learning expression control
(without access to emotional labels) are found to give results that in many
aspects match or surpass the previous best supervised approach.Comment: 17 pages, 4 figure
- …