Context-aware Coherent Speaking Style Prediction with Hierarchical
  Transformers for Audiobook Speech Synthesis

Chen, Liyang; Kang, Shiyin; Lei, Shun; Meng, Helen; Wu, Zhiyong; Zhou, Yixuan

Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis

Authors: Liyang Chen
Shiyin Kang
Shun Lei
Helen Meng
Zhiyong Wu
Yixuan Zhou
Publication date: 13 April 2023
Publisher

Abstract

Recent advances in text-to-speech have significantly improved the expressiveness of synthesized speech. However, it is still challenging to generate speech with contextually appropriate and coherent speaking style for multi-sentence text in audiobooks. In this paper, we propose a context-aware coherent speaking style prediction method for audiobook speech synthesis. To predict the style embedding of the current utterance, a hierarchical transformer-based context-aware style predictor with a mixture attention mask is designed, considering both text-side context information and speech-side style information of previous speeches. Based on this, we can generate long-form speech with coherent style and prosody sentence by sentence. Objective and subjective evaluations on a Mandarin audiobook dataset demonstrate that our proposed model can generate speech with more expressive and coherent speaking style than baselines, for both single-sentence and multi-sentence test.Comment: Accepted by ICASSP 202

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2304.06359

Last time updated on 16/04/2023