75 research outputs found
EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System
We present EMPHASIS, an emotional phoneme-based acoustic model for speech
synthesis system. EMPHASIS includes a phoneme duration prediction model and an
acoustic parameter prediction model. It uses a CBHG-based regression network to
model the dependencies between linguistic features and acoustic features. We
modify the input and output layer structures of the network to improve the
performance. For the linguistic features, we apply a feature grouping strategy
to enhance emotional and prosodic features. The acoustic parameters are
designed to be suitable for the regression task and waveform reconstruction.
EMPHASIS can synthesize speech in real-time and generate expressive
interrogative and exclamatory speech with high audio quality. EMPHASIS is
designed to be a multi-lingual model and can synthesize Mandarin-English speech
for now. In the experiment of emotional speech synthesis, it achieves better
subjective results than other real-time speech synthesis systems.Comment: Accepted by Interspeech 201
A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis
In Mandarin text-to-speech (TTS) system, the front-end text processing module
significantly influences the intelligibility and naturalness of synthesized
speech. Building a typical pipeline-based front-end which consists of multiple
individual components requires extensive efforts. In this paper, we proposed a
unified sequence-to-sequence front-end model for Mandarin TTS that converts raw
texts to linguistic features directly. Compared to the pipeline-based
front-end, our unified front-end can achieve comparable performance in
polyphone disambiguation and prosody word prediction, and improve intonation
phrase prediction by 0.0738 in F1 score. We also implemented the unified
front-end with Tacotron and WaveRNN to build a Mandarin TTS system. The
synthesized speech by that got a comparable MOS (4.38) with the pipeline-based
front-end (4.37) and close to human recordings (4.49).Comment: Submitted to ICASSP 202
A Survey on Neural Speech Synthesis
Text to speech (TTS), or speech synthesis, which aims to synthesize
intelligible and natural speech given text, is a hot research topic in speech,
language, and machine learning communities and has broad applications in the
industry. As the development of deep learning and artificial intelligence,
neural network-based TTS has significantly improved the quality of synthesized
speech in recent years. In this paper, we conduct a comprehensive survey on
neural TTS, aiming to provide a good understanding of current research and
future trends. We focus on the key components in neural TTS, including text
analysis, acoustic models and vocoders, and several advanced topics, including
fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
We further summarize resources related to TTS (e.g., datasets, opensource
implementations) and discuss future research directions. This survey can serve
both academic researchers and industry practitioners working on TTS.Comment: A comprehensive survey on TTS, 63 pages, 18 tables, 7 figures, 457
reference
DurIAN: Duration Informed Attention Network For Multimodal Synthesis
In this paper, we present a generic and robust multimodal synthesis system
that produces highly natural speech and facial expression simultaneously. The
key component of this system is the Duration Informed Attention Network
(DurIAN), an autoregressive model in which the alignments between the input
text and the output acoustic features are inferred from a duration model. This
is different from the end-to-end attention mechanism used, and accounts for
various unavoidable artifacts, in existing end-to-end speech synthesis systems
such as Tacotron. Furthermore, DurIAN can be used to generate high quality
facial expression which can be synchronized with generated speech with/without
parallel speech and face data. To improve the efficiency of speech generation,
we also propose a multi-band parallel generation strategy on top of the WaveRNN
model. The proposed Multi-band WaveRNN effectively reduces the total
computational complexity from 9.8 to 5.5 GFLOPS, and is able to generate audio
that is 6 times faster than real time on a single CPU core. We show that DurIAN
could generate highly natural speech that is on par with current state of the
art end-to-end systems, while at the same time avoid word skipping/repeating
errors in those systems. Finally, a simple yet effective approach for
fine-grained control of expressiveness of speech and facial expression is
introduced
GraphPB: Graphical Representations of Prosody Boundary in Speech Synthesis
This paper introduces a graphical representation approach of prosody boundary
(GraphPB) in the task of Chinese speech synthesis, intending to parse the
semantic and syntactic relationship of input sequences in a graphical domain
for improving the prosody performance. The nodes of the graph embedding are
formed by prosodic words, and the edges are formed by the other prosodic
boundaries, namely prosodic phrase boundary (PPH) and intonation phrase
boundary (IPH). Different Graph Neural Networks (GNN) like Gated Graph Neural
Network (GGNN) and Graph Long Short-term Memory (G-LSTM) are utilised as graph
encoders to exploit the graphical prosody boundary information.
Graph-to-sequence model is proposed and formed by a graph encoder and an
attentional decoder. Two techniques are proposed to embed sequential
information into the graph-to-sequence text-to-speech model. The experimental
results show that this proposed approach can encode the phonetic and prosody
rhythm of an utterance. The mean opinion score (MOS) of these GNN models shows
comparative results with the state-of-the-art sequence-to-sequence models with
better performance in the aspect of prosody. This provides an alternative
approach for prosody modelling in end-to-end speech synthesis.Comment: Accepted to SLT 202
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce
high-quality speech directly from text or simple linguistic features such as
phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS
does not require manually annotated and complicated linguistic features such as
part-of-speech tags and syntactic structures for system training. However, it
must be carefully designed and well optimized so that it can implicitly extract
useful linguistic features from the input features. In this paper we
investigate under what conditions the neural sequence-to-sequence TTS can work
well in Japanese and English along with comparisons with deep neural network
(DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline
systems also use autoregressive probabilistic modeling and a neural vocoder. We
investigated systems from three aspects: a) model architecture, b) model
parameter size, and c) language. For the model architecture aspect, we adopt
modified Tacotron systems that we previously proposed and their variants using
an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we
investigate two model parameter sizes. For the language aspect, we conduct
listening tests in both Japanese and English to see if our findings can be
generalized across languages. Our experiments suggest that a) a neural
sequence-to-sequence TTS system should have a sufficient number of model
parameters to produce high quality speech, b) it should also use a powerful
encoder when it takes characters as inputs, and c) the encoder still has a room
for improvement and needs to have an improved architecture to learn
supra-segmental features more appropriately
Building Multi lingual TTS using Cross Lingual Voice Conversion
In this paper we propose a new cross-lingual Voice Conversion (VC) approach
which can generate all speech parameters (MCEP, LF0, BAP) from one DNN model
using PPGs (Phonetic PosteriorGrams) extracted from inputted speech using
several ASR acoustic models. Using the proposed VC method, we tried three
different approaches to build a multilingual TTS system without recording a
multilingual speech corpus. A listening test was carried out to evaluate both
speech quality (naturalness) and voice similarity between converted speech and
target speech. The results show that Approach 1 achieved the highest level of
naturalness (3.28 MOS on a 5-point scale) and similarity (2.77 MOS)
Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram
Cross-lingual voice conversion (VC) is an important and challenging problem
due to significant mismatches of the phonetic set and the speech prosody of
different languages. In this paper, we build upon the neural text-to-speech
(TTS) model, i.e., FastSpeech, and LPCNet neural vocoder to design a new
cross-lingual VC framework named FastSpeech-VC. We address the mismatches of
the phonetic set and the speech prosody by applying Phonetic PosteriorGrams
(PPGs), which have been proved to bridge across speaker and language
boundaries. Moreover, we add normalized logarithm-scale fundamental frequency
(Log-F0) to further compensate for the prosodic mismatches and significantly
improve naturalness. Our experiments on English and Mandarin languages
demonstrate that with only mono-lingual corpus, the proposed FastSpeech-VC can
achieve high quality converted speech with mean opinion score (MOS) close to
the professional records while maintaining good speaker similarity. Compared to
the baselines using Tacotron2 and Transformer TTS models, the FastSpeech-VC can
achieve controllable converted speech rate and much faster inference speed.
More importantly, the FastSpeech-VC can easily be adapted to a speaker with
limited training utterances.Comment: 5 pages, 2 figures, 4 tables, accepted by ICASSP 202
Data Processing for Optimizing Naturalness of Vietnamese Text-to-speech System
Abstract End-to-end text-to-speech (TTS) systems has proved its great success
in the presence of a large amount of high-quality training data recorded in
anechoic room with high-quality microphone. Another approach is to use
available source of found data like radio broadcast news. We aim to optimize
the naturalness of TTS system on the found data using a novel data processing
method. The data processing method includes 1) utterance selection and 2)
prosodic punctuation insertion to prepare training data which can optimize the
naturalness of TTS systems. We showed that using the processing data method, an
end-to-end TTS achieved a mean opinion score (MOS) of 4.1 compared to 4.3 of
natural speech. We showed that the punctuation insertion contributed the most
to the result. To facilitate the research and development of TTS systems, we
distributed the processed data of one speaker at
https://forms.gle/6Hk5YkqgDxAaC2BU6.Comment: 8 pages, 2 figures, submit to Oriental Cocosd
Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion
Current voice conversion (VC) methods can successfully convert timbre of the
audio. As modeling source audio's prosody effectively is a challenging task,
there are still limitations of transferring source style to the converted
speech. This study proposes a source style transfer method based on
recognition-synthesis framework. Previously in speech generation task, prosody
can be modeled explicitly with prosodic features or implicitly with a latent
prosody extractor. In this paper, taking advantages of both, we model the
prosody in a hybrid manner, which effectively combines explicit and implicit
methods in a proposed prosody module. Specifically, prosodic features are used
to explicit model prosody, while VAE and reference encoder are used to
implicitly model prosody, which take Mel spectrum and bottleneck feature as
input respectively. Furthermore, adversarial training is introduced to remove
speaker-related information from the VAE outputs, avoiding leaking source
speaker information while transferring style. Finally, we use a modified
self-attention based encoder to extract sentential context from bottleneck
features, which also implicitly aggregates the prosodic aspects of source
speech from the layered representations. Experiments show that our approach is
superior to the baseline and a competitive system in terms of style transfer;
meanwhile, the speech quality and speaker similarity are well maintained.Comment: Accepted by Interspeech 202
- …