19 research outputs found
Fine-grained robust prosody transfer for single-speaker neural text-to-speech
We present a neural text-to-speech system for fine-grained prosody transfer
from one speaker to another. Conventional approaches for end-to-end prosody
transfer typically use either fixed-dimensional or variable-length prosody
embedding via a secondary attention to encode the reference signal. However,
when trained on a single-speaker dataset, the conventional prosody transfer
systems are not robust enough to speaker variability, especially in the case of
a reference signal coming from an unseen speaker. Therefore, we propose
decoupling of the reference signal alignment from the overall system. For this
purpose, we pre-compute phoneme-level time stamps and use them to aggregate
prosodic features per phoneme, injecting them into a sequence-to-sequence
text-to-speech system. We incorporate a variational auto-encoder to further
enhance the latent representation of prosody embeddings. We show that our
proposed approach is significantly more stable and achieves reliable prosody
transplantation from an unseen speaker. We also propose a solution to the use
case in which the transcription of the reference signal is absent. We evaluate
all our proposed methods using both objective and subjective listening tests.Comment: 5 pages, 7 figures, Accepted for Interspeech 201
CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech
Prosody Transfer (PT) is a technique that aims to use the prosody from a
source audio as a reference while synthesising speech. Fine-grained PT aims at
capturing prosodic aspects like rhythm, emphasis, melody, duration, and
loudness, from a source audio at a very granular level and transferring them
when synthesising speech in a different target speaker's voice. Current
approaches for fine-grained PT suffer from source speaker leakage, where the
synthesised speech has the voice identity of the source speaker as opposed to
the target speaker. In order to mitigate this issue, they compromise on the
quality of PT. In this paper, we propose CopyCat, a novel, many-to-many PT
system that is robust to source speaker leakage, without using parallel data.
We achieve this through a novel reference encoder architecture capable of
capturing temporal prosodic representations which are robust to source speaker
leakage. We compare CopyCat against a state-of-the-art fine-grained PT model
through various subjective evaluations, where we show a relative improvement of
in the quality of prosody transfer and in preserving the target
speaker identity, while still maintaining the same naturalness
Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features
This paper presents a simple yet effective method to achieve prosody transfer
from a reference speech signal to synthesized speech. The main idea is to
incorporate well-known acoustic correlates of prosody such as pitch and
loudness contours of the reference speech into a modern neural text-to-speech
(TTS) synthesizer such as Tacotron2 (TC2). More specifically, a small set of
acoustic features are extracted from reference audio and then used to condition
a TC2 synthesizer. The trained model is evaluated using subjective listening
tests and a novel objective evaluation of prosody transfer is proposed.
Listening tests show that the synthesized speech is rated as highly natural and
that prosody is successfully transferred from the reference speech signal to
the synthesized signal.Comment: 5 pages, in review for conference publicatio
CUHK-EE Voice Cloning System for ICASSP 2021 M2VoC Challenge
This paper presents the CUHK-EE voice cloning system for ICASSP 2021 M2VoC
challenge. The challenge provides two Mandarin speech corpora: the AIShell-3
corpus of 218 speakers with noise and reverberation and the MST corpus
including high-quality speech of one male and one female speakers. 100 and 5
utterances of 3 target speakers in different voice and style are provided in
track 1 and 2 respectively, and the participants are required to synthesize
speech in target speaker's voice and style. We take part in the track 1 and
carry out voice cloning based on 100 utterances of target speakers. An
end-to-end voicing cloning system is developed to accomplish the task, which
includes: 1. a text and speech front-end module with the help of forced
alignment, 2. an acoustic model combining Tacotron2 and DurIAN to predict
melspectrogram, 3. a Hifigan vocoder for waveform generation. Our system
comprises three stages: multi-speaker training stage, target speaker adaption
stage and target speaker synthesis stage. Our team is identified as T17. The
subjective evaluation results provided by the challenge organizer demonstrate
the effectiveness of our system. Audio samples are available at our demo page:
https://daxintan-cuhk.github.io/CUHK-EE-system-M2VoC-challenge/
Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement
This paper presents a novel design of neural network system for fine-grained
style modeling, transfer and prediction in expressive text-to-speech (TTS)
synthesis. Fine-grained modeling is realized by extracting style embeddings
from the mel-spectrograms of phone-level speech segments. Collaborative
learning and adversarial learning strategies are applied in order to achieve
effective disentanglement of content and style factors in speech and alleviate
the "content leakage" problem in style modeling. The proposed system can be
used for varying-content speech style transfer in the single-speaker scenario.
The results of objective and subjective evaluation show that our system
performs better than other fine-grained speech style transfer models,
especially in the aspect of content preservation. By incorporating a style
predictor, the proposed system can also be used for text-to-speech synthesis.
Audio samples are provided for system demonstration
https://daxintan-cuhk.github.io/pl-csd-speech .Comment: Accepted by Interspeech 202
Singing Synthesis: with a little help from my attention
We present UTACO, a singing synthesis model based on an attention-based
sequence-to-sequence mechanism and a vocoder based on dilated causal
convolutions. These two classes of models have significantly affected the field
of text-to-speech, but have never been thoroughly applied to the task of
singing synthesis. UTACO demonstrates that attention can be successfully
applied to the singing synthesis field and improves naturalness over the state
of the art. The system requires considerably less explicit modelling of voice
features such as F0 patterns, vibratos, and note and phoneme durations, than
previous models in the literature. Despite this, it shows a strong improvement
in naturalness with respect to previous neural singing synthesis models. The
model does not require any durations or pitch patterns as inputs, and learns to
insert vibrato autonomously according to the musical context. However, we
observe that, by completely dispensing with any explicit duration modelling it
becomes harder to obtain the fine control of timing needed to exactly match the
tempo of a song.Comment: Submitted to Interspeech 202
Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech
This paper proposes a hierarchical and multi-scale variational
autoencoder-based non-autoregressive text-to-speech model (HiMuV-TTS) to
generate natural speech with diverse speaking styles. Recent advances in
non-autoregressive TTS (NAR-TTS) models have significantly improved the
inference speed and robustness of synthesized speech. However, the diversity of
speaking styles and naturalness are needed to be improved. To solve this
problem, we propose the HiMuV-TTS model that first determines the global-scale
prosody and then determines the local-scale prosody via conditioning on the
global-scale prosody and the learned text representation. In addition, we
improve the quality of speech by adopting the adversarial training technique.
Experimental results verify that the proposed HiMuV-TTS model can generate more
diverse and natural speech as compared to TTS models with single-scale
variational autoencoders, and can represent different prosody information in
each scale.Comment: Submitted to INTERSPEECH 202
Controllable neural text-to-speech synthesis using intuitive prosodic features
Modern neural text-to-speech (TTS) synthesis can generate speech that is
indistinguishable from natural speech. However, the prosody of generated
utterances often represents the average prosodic style of the database instead
of having wide prosodic variation. Moreover, the generated prosody is solely
defined by the input text, which does not allow for different styles for the
same sentence. In this work, we train a sequence-to-sequence neural network
conditioned on acoustic speech features to learn a latent prosody space with
intuitive and meaningful dimensions. Experiments show that a model conditioned
on sentence-wise pitch, pitch range, phone duration, energy, and spectral tilt
can effectively control each prosodic dimension and generate a wide variety of
speaking styles, while maintaining similar mean opinion score (4.23) to our
Tacotron baseline (4.26).Comment: Accepted for publication in Interspeech 202
Emphasis control for parallel neural TTS
Recent parallel neural text-to-speech (TTS) synthesis methods are able to
generate speech with high fidelity while maintaining high performance. However,
these systems often lack control over the output prosody, thus restricting the
semantic information conveyable for a given text. This paper proposes a
hierarchical parallel neural TTS system for prosodic emphasis control by
learning a latent space that directly corresponds to a change in emphasis.
Three candidate features for the latent space are compared: 1) Variance of
pitch and duration within words in a sentence, 2) Wavelet-based feature
computed from pitch, energy, and duration, and 3) Learned combination of the
two aforementioned approaches. At inference time, word-level prosodic emphasis
is achieved by increasing the feature values of the latent space for the given
words. Experiments show that all the proposed methods are able to achieve the
perception of increased emphasis with little loss in overall quality. Moreover,
emphasized utterances were preferred in a pairwise comparison test over the
non-emphasized utterances, indicating promise for real-world applications.Comment: 5 pages, 5 figures, submitted to Interspeech 202
CAMP: a Two-Stage Approach to Modelling Prosody in Context
Prosody is an integral part of communication, but remains an open problem in
state-of-the-art speech synthesis. There are two major issues faced when
modelling prosody: (1) prosody varies at a slower rate compared with other
content in the acoustic signal (e.g. segmental information and background
noise); (2) determining appropriate prosody without sufficient context is an
ill-posed problem. In this paper, we propose solutions to both these issues. To
mitigate the challenge of modelling a slow-varying signal, we learn to
disentangle prosodic information using a word level representation. To
alleviate the ill-posed nature of prosody modelling, we use syntactic and
semantic information derived from text to learn a context-dependent prior over
our prosodic space. Our Context-Aware Model of Prosody (CAMP) outperforms the
state-of-the-art technique, closing the gap with natural speech by 26%. We also
find that replacing attention with a jointly-trained duration model improves
prosody significantly.Comment: 5 pages. Published in the 2021 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP 2021