Search CORE

2,007 research outputs found

A comparison of Vietnamese Statistical Parametric Speech Synthesis Systems

Author: Dinh Tuan Anh
Nguyen Bao Quoc
Phan Huy Kinh
Phung Viet Lam
Publication venue
Publication date: 26/05/2020
Field of study

In recent years, statistical parametric speech synthesis (SPSS) systems have been widely utilized in many interactive speech-based systems (e.g.~Amazon's Alexa, Bose's headphones). To select a suitable SPSS system, both speech quality and performance efficiency (e.g.~decoding time) must be taken into account. In the paper, we compared four popular Vietnamese SPSS techniques using: 1) hidden Markov models (HMM), 2) deep neural networks (DNN), 3) generative adversarial networks (GAN), and 4) end-to-end (E2E) architectures, which consists of Tacontron~2 and WaveGlow vocoder in terms of speech quality and performance efficiency. We showed that the E2E systems accomplished the best quality, but required the power of GPU to achieve real-time performance. We also showed that the HMM-based system had inferior speech quality, but it was the most efficient system. Surprisingly, the E2E systems were more efficient than the DNN and GAN in inference on GPU. Surprisingly, the GAN-based system did not outperform the DNN in term of quality.Comment: 9 pages, submitted to KSE 202

arXiv.org e-Print Archive

A Waveform Representation Framework for High-quality Statistical Parametric Speech Synthesis

Author: Dong Minghui
Fan Bo
Lee Siu Wa
Tian Xiaohai
Xie Lei
Publication venue
Publication date: 06/10/2015
Field of study

State-of-the-art statistical parametric speech synthesis (SPSS) generally uses a vocoder to represent speech signals and parameterize them into features for subsequent modeling. Magnitude spectrum has been a dominant feature over the years. Although perceptual studies have shown that phase spectrum is essential to the quality of synthesized speech, it is often ignored by using a minimum phase filter during synthesis and the speech quality suffers. To bypass this bottleneck in vocoded speech, this paper proposes a phase-embedded waveform representation framework and establishes a magnitude-phase joint modeling platform for high-quality SPSS. Our experiments on waveform reconstruction show that the performance is better than that of the widely-used STRAIGHT. Furthermore, the proposed modeling and synthesis platform outperforms a leading-edge, vocoded, deep bidirectional long short-term memory recurrent neural network (DBLSTM-RNN)-based baseline system in various objective evaluation metrics conducted.Comment: accepted and will appear in APSIPA2015; keywords: speech synthesis, LSTM-RNN, vocoder, phase, waveform, modelin

arXiv.org e-Print Archive

Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

Author: Henter Gustav Eje
Lorenzo-Trueba Jaime
Wang Xin
Yamagishi Junichi
Publication venue
Publication date: 09/09/2018
Field of study

Generating versatile and appropriate synthetic speech requires control over the output expression separate from the spoken text. Important non-textual speech variation is seldom annotated, in which case output control must be learned in an unsupervised fashion. In this paper, we perform an in-depth study of methods for unsupervised learning of control in statistical speech synthesis. For example, we show that popular unsupervised training heuristics can be interpreted as variational inference in certain autoencoder models. We additionally connect these models to VQ-VAEs, another, recently-proposed class of deep variational autoencoders, which we show can be derived from a very similar mathematical argument. The implications of these new probabilistic interpretations are discussed. We illustrate the utility of the various approaches with an application to acoustic modelling for emotional speech synthesis, where the unsupervised methods for learning expression control (without access to emotional labels) are found to give results that in many aspects match or surpass the previous best supervised approach.Comment: 17 pages, 4 figure

arXiv.org e-Print Archive

Deep Learning for Singing Processing: Achievements, Challenges and Impact on Singers and Listeners

Author: Blaauw Merlijn
Bonada Jordi
Chandna Pritish
Cuesta Helena
Gómez Emilia
Publication venue
Publication date: 09/07/2018
Field of study

This paper summarizes some recent advances on a set of tasks related to the processing of singing using state-of-the-art deep learning techniques. We discuss their achievements in terms of accuracy and sound quality, and the current challenges, such as availability of data and computing resources. We also discuss the impact that these advances do and will have on listeners and singers when they are integrated in commercial applications.Comment: Keynote speech, 2018 Joint Workshop on Machine Learning for Music. The Federated Artificial Intelligence Meeting (FAIM), a joint workshop program of ICML, IJCAI/ECAI, and AAMA

arXiv.org e-Print Archive

Analysing Shortcomings of Statistical Parametric Speech Synthesis

Author: Degottex Gilles
Henter Gustav Eje
King Simon
Merritt Thomas
Publication venue
Publication date: 28/07/2018
Field of study

Output from statistical parametric speech synthesis (SPSS) remains noticeably worse than natural speech recordings in terms of quality, naturalness, speaker similarity, and intelligibility in noise. There are many hypotheses regarding the origins of these shortcomings, but these hypotheses are often kept vague and presented without empirical evidence that could confirm and quantify how a specific shortcoming contributes to imperfections in the synthesised speech. Throughout speech synthesis literature, surprisingly little work is dedicated towards identifying the perceptually most important problems in speech synthesis, even though such knowledge would be of great value for creating better SPSS systems. In this book chapter, we analyse some of the shortcomings of SPSS. In particular, we discuss issues with vocoding and present a general methodology for quantifying the effect of any of the many assumptions and design choices that hold SPSS back. The methodology is accompanied by an example that carefully measures and compares the severity of perceptual limitations imposed by vocoding as well as other factors such as the statistical model and its use.Comment: 34 pages with 4 figures; draft book chapte

arXiv.org e-Print Archive

Probability density distillation with generative adversarial networks for high-quality parallel waveform generation

Author: Kim Jae-Min
Song Eunwoo
Yamamoto Ryuichi
Publication venue
Publication date: 27/08/2019
Field of study

This paper proposes an effective probability density distillation (PDD) algorithm for WaveNet-based parallel waveform generation (PWG) systems. Recently proposed teacher-student frameworks in the PWG system have successfully achieved a real-time generation of speech signals. However, the difficulties optimizing the PDD criteria without auxiliary losses result in quality degradation of synthesized speech. To generate more natural speech signals within the teacher-student framework, we propose a novel optimization criterion based on generative adversarial networks (GANs). In the proposed method, the inverse autoregressive flow-based student model is incorporated as a generator in the GAN framework, and jointly optimized by the PDD mechanism with the proposed adversarial learning method. As this process encourages the student to model the distribution of realistic speech waveform, the perceptual quality of the synthesized speech becomes much more natural. Our experimental results verify that the PWG systems with the proposed method outperform both those using conventional approaches, and also autoregressive generation systems with a well-trained teacher WaveNet.Comment: Accepted to the conference of INTERSPEECH 201

arXiv.org e-Print Archive

High quality voice conversion using prosodic and high-resolution spectral features

Author: Chng Eng Siong
Dong Minghui
Lee Siu Wa
Nguyen Hy Quy
Tian Xiaohai
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/12/2015
Field of study

Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fundamental frequency. In this paper, a comprehensive framework using deep neural networks to convert both timbre and prosodic features is proposed. The timbre feature is represented by a high-resolution spectral feature. The prosodic features include F0, intensity and duration. It is well known that DNN is useful as a tool to model high-dimensional features. In this work, we show that DNN initialized by our proposed autoencoder pretraining yields good quality DNN conversion models. This pretraining is tailor-made for voice conversion and leverages on autoencoder to capture the generic spectral shape of source speech. Additionally, our framework uses segmental DNN models to capture the evolution of the prosodic features over time. To reconstruct the converted speech, the spectral feature produced by the DNN model is combined with the three prosodic features produced by the DNN segmental models. Our experimental results show that the application of both prosodic and high-resolution spectral features leads to quality converted speech as measured by objective evaluation and subjective listening tests

arXiv.org e-Print Archive

ExcitNet vocoder: A neural excitation model for parametric speech synthesis systems

Author: Byun Kyungguen
Kang Hong-Goo
Song Eunwoo
Publication venue
Publication date: 21/08/2019
Field of study

This paper proposes a WaveNet-based neural excitation model (ExcitNet) for statistical parametric speech synthesis systems. Conventional WaveNet-based neural vocoding systems significantly improve the perceptual quality of synthesized speech by statistically generating a time sequence of speech waveforms through an auto-regressive framework. However, they often suffer from noisy outputs because of the difficulties in capturing the complicated time-varying nature of speech signals. To improve modeling efficiency, the proposed ExcitNet vocoder employs an adaptive inverse filter to decouple spectral components from the speech signal. The residual component, i.e. excitation signal, is then trained and generated within the WaveNet framework. In this way, the quality of the synthesized speech signal can be further improved since the spectral component is well represented by a deep learning framework and, moreover, the residual component is efficiently generated by the WaveNet framework. Experimental results show that the proposed ExcitNet vocoder, trained both speaker-dependently and speaker-independently, outperforms traditional linear prediction vocoders and similarly configured conventional WaveNet vocoders.Comment: Accepted to the conference of EUSIPCO 2019. arXiv admin note: text overlap with arXiv:1811.0331

arXiv.org e-Print Archive

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

Author: Li Haizhou
Sisman Berrak
Zhou Kun
Publication venue
Publication date: 13/05/2020
Field of study

Emotional voice conversion aims to convert the spectrum and prosody to change the emotional patterns of speech, while preserving the speaker identity and linguistic content. Many studies require parallel speech data between different emotional patterns, which is not practical in real life. Moreover, they often model the conversion of fundamental frequency (F0) with a simple linear transform. As F0 is a key aspect of intonation that is hierarchical in nature, we believe that it is more adequate to model F0 in different temporal scales by using wavelet transform. We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data by learning forward and inverse mappings simultaneously using adversarial and cycle-consistency losses. We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution, for effective F0 conversion. Experimental results show that our proposed framework outperforms the baselines both in objective and subjective evaluations.Comment: accepted by Speaker Odyssey 2020 in Tokyo, Japa

arXiv.org e-Print Archive

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

Author: Li Haizhou
Sisman Berrak
Zhang Mingyang
Zhou Kun
Publication venue
Publication date: 07/08/2020
Field of study

Emotional voice conversion aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity. The prior studies on emotional voice conversion are mostly carried out under the assumption that emotion is speaker-dependent. We consider that there is a common code between speakers for emotional expression in a spoken language, therefore, a speaker-independent mapping between emotional states is possible. In this paper, we propose a speaker-independent emotional voice conversion framework, that can convert anyone's emotion without the need for parallel data. We propose a VAW-GAN based encoder-decoder structure to learn the spectrum and prosody mapping. We perform prosody conversion by using continuous wavelet transform (CWT) to model the temporal dependencies. We also investigate the use of F0 as an additional input to the decoder to improve emotion conversion performance. Experiments show that the proposed speaker-independent framework achieves competitive results for both seen and unseen speakers.Comment: Accepted by Interspeech 202

arXiv.org e-Print Archive