16 research outputs found
Learning latent representations for style control and transfer in end-to-end speech synthesis
In this paper, we introduce the Variational Autoencoder (VAE) to an
end-to-end speech synthesis model, to learn the latent representation of
speaking styles in an unsupervised manner. The style representation learned
through VAE shows good properties such as disentangling, scaling, and
combination, which makes it easy for style control. Style transfer can be
achieved in this framework by first inferring style representation through the
recognition network of VAE, then feeding it into TTS network to guide the style
in synthesizing speech. To avoid Kullback-Leibler (KL) divergence collapse in
training, several techniques are adopted. Finally, the proposed model shows
good performance of style control and outperforms Global Style Token (GST)
model in ABX preference tests on style transfer.Comment: Paper accepted by ICASSP 201
GraphTTS: graph-to-sequence modelling in neural text-to-speech
This paper leverages the graph-to-sequence method in neural text-to-speech
(GraphTTS), which maps the graph embedding of the input sequence to
spectrograms. The graphical inputs consist of node and edge representations
constructed from input texts. The encoding of these graphical inputs
incorporates syntax information by a GNN encoder module. Besides, applying the
encoder of GraphTTS as a graph auxiliary encoder (GAE) can analyse prosody
information from the semantic structure of texts. This can remove the manual
selection of reference audios process and makes prosody modelling an end-to-end
procedure. Experimental analysis shows that GraphTTS outperforms the
state-of-the-art sequence-to-sequence models by 0.24 in Mean Opinion Score
(MOS). GAE can adjust the pause, ventilation and tones of synthesised audios
automatically. This experimental conclusion may give some inspiration to
researchers working on improving speech synthesis prosody.Comment: Accepted to ICASSP 202
Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement
This paper presents a novel design of neural network system for fine-grained
style modeling, transfer and prediction in expressive text-to-speech (TTS)
synthesis. Fine-grained modeling is realized by extracting style embeddings
from the mel-spectrograms of phone-level speech segments. Collaborative
learning and adversarial learning strategies are applied in order to achieve
effective disentanglement of content and style factors in speech and alleviate
the "content leakage" problem in style modeling. The proposed system can be
used for varying-content speech style transfer in the single-speaker scenario.
The results of objective and subjective evaluation show that our system
performs better than other fine-grained speech style transfer models,
especially in the aspect of content preservation. By incorporating a style
predictor, the proposed system can also be used for text-to-speech synthesis.
Audio samples are provided for system demonstration
https://daxintan-cuhk.github.io/pl-csd-speech .Comment: Accepted by Interspeech 202
Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech
In this paper, we introduce Kathaka, a model trained with a novel two-stage
training process for neural speech synthesis with contextually appropriate
prosody. In Stage I, we learn a prosodic distribution at the sentence level
from mel-spectrograms available during training. In Stage II, we propose a
novel method to sample from this learnt prosodic distribution using the
contextual information available in text. To do this, we use BERT on text, and
graph-attention networks on parse trees extracted from text. We show a
statistically significant relative improvement of in naturalness over
a strong baseline when compared to recordings. We also conduct an ablation
study on variations of our sampling technique, and show a statistically
significant improvement over the baseline in each case.Comment: 5 pages and 3 figure
Emotional speech synthesis with rich and granularized control
This paper proposes an effective emotion control method for an end-to-end
text-to-speech (TTS) system. To flexibly control the distinct characteristic of
a target emotion category, it is essential to determine embedding vectors
representing the TTS input. We introduce an inter-to-intra emotional distance
ratio algorithm to the embedding vectors that can minimize the distance to the
target emotion category while maximizing its distance to the other emotion
categories. To further enhance the expressiveness of a target speech, we also
introduce an effective interpolation technique that enables the intensity of a
target emotion to be gradually changed to that of neutral speech. Subjective
evaluation results in terms of emotional expressiveness and controllability
show the superiority of the proposed algorithm to the conventional methods.Comment: Submitted to ICASSP 202
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent
years. However, the generated voice is often not perceptually identifiable by
its intended emotion category. To address this problem, we propose a new
interactive training paradigm for ETTS, denoted as i-ETTS, which seeks to
directly improve the emotion discriminability by interacting with a speech
emotion recognition (SER) model. Moreover, we formulate an iterative training
strategy with reinforcement learning to ensure the quality of i-ETTS
optimization. Experimental results demonstrate that the proposed i-ETTS
outperforms the state-of-the-art baselines by rendering speech with more
accurate emotion style. To our best knowledge, this is the first study of
reinforcement learning in emotional text-to-speech synthesis.Comment: 5 pages, 4 figures, in Proceedings of INTERSPEECH 2021 conference,
Speech Samples: https://ttslr.github.io/i-ETT
GraphPB: Graphical Representations of Prosody Boundary in Speech Synthesis
This paper introduces a graphical representation approach of prosody boundary
(GraphPB) in the task of Chinese speech synthesis, intending to parse the
semantic and syntactic relationship of input sequences in a graphical domain
for improving the prosody performance. The nodes of the graph embedding are
formed by prosodic words, and the edges are formed by the other prosodic
boundaries, namely prosodic phrase boundary (PPH) and intonation phrase
boundary (IPH). Different Graph Neural Networks (GNN) like Gated Graph Neural
Network (GGNN) and Graph Long Short-term Memory (G-LSTM) are utilised as graph
encoders to exploit the graphical prosody boundary information.
Graph-to-sequence model is proposed and formed by a graph encoder and an
attentional decoder. Two techniques are proposed to embed sequential
information into the graph-to-sequence text-to-speech model. The experimental
results show that this proposed approach can encode the phonetic and prosody
rhythm of an utterance. The mean opinion score (MOS) of these GNN models shows
comparative results with the state-of-the-art sequence-to-sequence models with
better performance in the aspect of prosody. This provides an alternative
approach for prosody modelling in end-to-end speech synthesis.Comment: Accepted to SLT 202
Fine-grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis
This paper proposes a unified model to conduct emotion transfer, control and
prediction for sequence-to-sequence based fine-grained emotional speech
synthesis. Conventional emotional speech synthesis often needs manual labels or
reference audio to determine the emotional expressions of synthesized speech.
Such coarse labels cannot control the details of speech emotion, often
resulting in an averaged emotion expression delivery, and it is also hard to
choose suitable reference audio during inference. To conduct fine-grained
emotion expression generation, we introduce phoneme-level emotion strength
representations through a learned ranking function to describe the local
emotion details, and the sentence-level emotion category is adopted to render
the global emotions of synthesized speech. With the global render and local
descriptors of emotions, we can obtain fine-grained emotion expressions from
reference audio via its emotion descriptors (for transfer) or directly from
phoneme-level manual labels (for control). As for the emotional speech
synthesis with arbitrary text inputs, the proposed model can also predict
phoneme-level emotion expressions from texts, which does not require any
reference audio or manual label
Energy Disaggregation using Variational Autoencoders
Non-intrusive load monitoring (NILM) is a technique that uses a single sensor
to measure the total power consumption of a building. Using an energy
disaggregation method, the consumption of individual appliances can be
estimated from the aggregate measurement. Recent disaggregation algorithms have
significantly improved the performance of NILM systems. However, the
generalization capability of these methods to different houses as well as the
disaggregation of multi-state appliances are still major challenges. In this
paper we address these issues and propose an energy disaggregation approach
based on the variational autoencoders framework. The probabilistic encoder
makes this approach an efficient model for encoding information relevant to the
reconstruction of the target appliance consumption. In particular, the proposed
model accurately generates more complex load profiles, thus improving the power
signal reconstruction of multi-state appliances. Moreover, its regularized
latent space improves the generalization capabilities of the model across
different houses. The proposed model is compared to state-of-the-art NILM
approaches on the UK-DALE and REFIT datasets, and yields competitive results.
The mean absolute error reduces by 18% on average across all appliances
compared to the state-of-the-art. The F1-Score increases by more than 11%,
showing improvements for the detection of the target appliance in the aggregate
measurement.Comment: 13 pages, 2 figures, results for the REFIT dataset adde
Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Several recent end-to-end text-to-speech (TTS) models enabling single-stage
training and parallel sampling have been proposed, but their sample quality
does not match that of two-stage TTS systems. In this work, we present a
parallel end-to-end TTS method that generates more natural sounding audio than
current two-stage models. Our method adopts variational inference augmented
with normalizing flows and an adversarial training process, which improves the
expressive power of generative modeling. We also propose a stochastic duration
predictor to synthesize speech with diverse rhythms from input text. With the
uncertainty modeling over latent variables and the stochastic duration
predictor, our method expresses the natural one-to-many relationship in which a
text input can be spoken in multiple ways with different pitches and rhythms. A
subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a
single speaker dataset, shows that our method outperforms the best publicly
available TTS systems and achieves a MOS comparable to ground truth.Comment: ICML 202