Search CORE

34 research outputs found

Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder

Author: Hayashi Tomoki
Kobayashi Kazuhiro
Tobing Patrick Lumban
Toda Tomoki
Wu Yi-Chiao
Publication venue
Publication date: 22/03/2020
Field of study

In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module in many different voice conversion frameworks and achieves significant improvement over conventional vocoders. However, because of the fixed dilated convolution and generic network architecture, the WN vocoder lacks robustness against unseen input features and often requires a huge network size to achieve acceptable speech quality. Such limitations usually lead to performance degradation in the voice conversion task. To overcome this problem, the QPNet vocoder is applied, which includes a pitch-dependent dilated convolution component to enhance the pitch controllability and attain a more compact network than the WN vocoder. In the proposed method, input spectral features are first converted using a framewise deep neural network, and then the QPNet vocoder generates converted speech conditioned on the linearly converted prosodic and transformed spectral features. The experimental results confirm that the QPNet vocoder achieves significantly better performance than the same-size WN vocoder while maintaining comparable speech quality to the double-size WN vocoder. Index Terms: WaveNet, vocoder, voice conversion, pitch-dependent dilated convolution, pitch controllabilityComment: 6pages, 7figures, Proc. SSW10, 201

arXiv.org e-Print Archive

A Cyclical Post-filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-speech Systems

Author: Matsunaga Noriyuki
Ohtani Yamato
Tobing Patrick Lumban
Toda Tomoki
Wu Yi-Chiao
Yasuhara Kazuki
Publication venue
Publication date: 06/08/2020
Field of study

Recently, the effectiveness of text-to-speech (TTS) systems combined with neural vocoders to generate high-fidelity speech has been shown. However, collecting the required training data and building these advanced systems from scratch are time and resource consuming. An economical approach is to develop a neural vocoder to enhance the speech generated by existing or low-cost TTS systems. Nonetheless, this approach usually suffers from two issues: 1) temporal mismatches between TTS and natural waveforms and 2) acoustic mismatches between training and testing data. To address these issues, we adopt a cyclic voice conversion (VC) model to generate temporally matched pseudo-VC data for training and acoustically matched enhanced data for testing the neural vocoders. Because of the generality, this framework can be applied to arbitrary TTS systems and neural vocoders. In this paper, we apply the proposed method with a state-of-the-art WaveNet vocoder for two different basic TTS systems, and both objective and subjective experimental results confirm the effectiveness of the proposed framework.Comment: 5 pages, 8 figures, 1 table. Proc. Interspeech, 202

arXiv.org e-Print Archive

Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

Author: Hayashi Tomoki
Kobayashi Kazuhiro
Tobing Patrick Lumban
Toda Tomoki
Wu Yi-Chiao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/03/2021
Field of study

In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achieves high-fidelity audio waveform generation. However, the pure-data-driven nature and the lack of prior knowledge of audio signals degrade the pitch controllability of WN. For instance, it is difficult for WN to precisely generate the periodic components of audio signals when the given auxiliary fundamental frequency (

F_{0}

) features are outside the

F_{0}

range observed in the training data. To address this problem, QPNet with two novel designs is proposed. First, the PDCNN component is applied to dynamically change the network architecture of WN according to the given auxiliary

F_{0}

features. Second, a cascaded network structure is utilized to simultaneously model the long- and short-term dependencies of quasi-periodic signals such as speech. The performances of single-tone sinusoid and speech generations are evaluated. The experimental results show the effectiveness of the PDCNNs for unseen auxiliary

F_{0}

features and the effectiveness of the cascaded structure for speech generation.Comment: 15 pages, 12 figures, 11 table

arXiv.org e-Print Archive

Quasi-Periodic WaveNet Vocoder: A Pitch Dependent Dilated Convolution Model for Parametric Speech Generation

Author: Hayashi Tomoki
Kobayashi Kazuhiro
Tobing Patrick Lumban
Toda Tomoki
Wu Yi-Chiao
Publication venue
Publication date: 22/03/2020
Field of study

In this paper, we propose a quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder. The effectiveness of the WN vocoder to generate high-fidelity speech samples from given acoustic features has been proved recently. However, because of the fixed dilated convolution and generic network architecture, the WN vocoder hardly generates speech with given F0 values which are outside the range observed in training data. Consequently, the WN vocoder lacks the pitch controllability which is one of the essential capabilities of conventional vocoders. To address this limitation, we propose the PDCNN component which has the time-variant adaptive dilation size related to the given F0 values and a cascade network structure of the QPNet vocoder to generate quasi-periodic signals such as speech. Both objective and subjective tests are conducted, and the experimental results demonstrate the better pitch controllability of the QPNet vocoder compared to the same and double sized WN vocoders while attaining comparable speech qualities. Index Terms: WaveNet, vocoder, quasi-periodic signal, pitch-dependent dilated convolution, pitch controllabilityComment: 5 pages, 4 figures, Proc. Interspeech, 201

arXiv.org e-Print Archive

Quasi-Periodic Parallel WaveGAN: A Non-autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

Author: Hayashi Tomoki
Kawai Hisashi
Okamoto Takuma
Toda Tomoki
Wu Yi-Chiao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/02/2021
Field of study

In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real time because of its compact model and non-autoregressive (non-AR) and non-causal mechanisms. Although PWG achieves high-fidelity speech generation, the generic and simple network architecture lacks pitch controllability for an unseen auxiliary fundamental frequency (

F_{0}

) feature such as a scaled

F_{0}

. To improve the pitch controllability and speech modeling capability, we apply a QP structure with PDCNNs to PWG, which introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary

F_{0}

feature. Both objective and subjective experimental results show that QPPWG outperforms PWG when the auxiliary

F_{0}

feature is scaled. Moreover, analyses of the intermediate outputs of QPPWG also show better tractability and interpretability of QPPWG, which respectively models spectral and excitation-like signals using the cascaded fixed and adaptive blocks of the QP structure.Comment: 15 pages, 10 figures, 8 table

arXiv.org e-Print Archive

Quasi-Periodic Parallel WaveGAN Vocoder: A Non-autoregressive Pitch-dependent Dilated Convolution Model for Parametric Speech Generation

Author: Hayashi Tomoki
Kawai Hisashi
Okamoto Takuma
Toda Tomoki
Wu Yi-Chiao
Publication venue
Publication date: 06/08/2020
Field of study

In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG. PWG is a compact non-autoregressive (non-AR) speech generation model, whose generative speed is much faster than real time. While utilizing PWG as a vocoder to generate speech on the basis of acoustic features such as spectral and prosodic features, PWG generates high-fidelity speech. However, when the input acoustic features include unseen pitches, the pitch accuracy of PWG-generated speech degrades because of the fixed and generic network of PWG without prior knowledge of speech periodicity. The proposed QPPWG adopts a pitch-dependent dilated convolution network (PDCNN) module, which introduces the pitch information into PWG via the dynamically changed network architecture, to improve the pitch controllability and speech modeling capability of vanilla PWG. Both objective and subjective evaluation results show the higher pitch accuracy and comparable speech quality of QPPWG-generated speech when the QPPWG model size is only 70 % of that of vanilla PWG.Comment: 5 page, 6 figures, 2 tables. Proc. Interspeech, 202

arXiv.org e-Print Archive

Non-parallel Voice Conversion System with WaveNet Vocoder and Collapsed Speech Suppression

Author: Hayashi Tomoki
Kobayashi Kazuhiro
Tobing Patrick Lumban
Toda Tomoki
Wu Yi-Chiao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/04/2020
Field of study

In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic features, acoustic and temporal mismatches, and exposure bias usually lead to significant speech quality degradation, making WN generate some very noisy speech segments called collapsed speech. To tackle the problem, we take conventional-vocoder-generated speech as the reference speech to derive a linear predictive coding distribution constraint (LPCDC) to avoid the collapsed speech problem. Furthermore, to mitigate the negative effects introduced by the LPCDC, we propose a collapsed speech segment detector (CSSD) to ensure that the LPCDC is only applied to the problematic segments to limit the loss of quality to short periods. Objective and subjective evaluations are conducted, and the experimental results confirm the effectiveness of the proposed method, which further improves the speech quality of our previous non-parallel VC system submitted to Voice Conversion Challenge 2018.Comment: 13 pages, 13 figures, 1 table, accepted to publish in IEEE Acces

arXiv.org e-Print Archive

Unified Source-Filter GAN: Unified Source-filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN

Author: Toda Tomoki
Wu Yi-Chiao
Yoneyama Reo
Publication venue
Publication date: 27/06/2021
Field of study

We propose a unified approach to data-driven source-filter modeling using a single neural network for developing a neural vocoder capable of generating high-quality synthetic speech waveforms while retaining flexibility of the source-filter model to control their voice characteristics. Our proposed network called unified source-filter generative adversarial networks (uSFGAN) is developed by factorizing quasi-periodic parallel WaveGAN (QPPWG), one of the neural vocoders based on a single neural network, into a source excitation generation network and a vocal tract resonance filtering network by additionally implementing a regularization loss. Moreover, inspired by neural source filter (NSF), only a sinusoidal waveform is additionally used as the simplest clue to generate a periodic source excitation waveform while minimizing the effect of approximations in the source filter model. The experimental results demonstrate that uSFGAN outperforms conventional neural vocoders, such as QPPWG and NSF in both speech quality and pitch controllability.Comment: Submitted to INTERSPEECH 202

arXiv.org e-Print Archive

Online Speaker Adaptation for WaveNet-based Neural Vocoders

Author: Ai Yang
Huang Qiuchen
Ling Zhenhua
Publication venue
Publication date: 13/08/2020
Field of study

In this paper, we propose an online speaker adaptation method for WaveNet-based neural vocoders in order to improve their performance on speaker-independent waveform generation. In this method, a speaker encoder is first constructed using a large speaker-verification dataset which can extract a speaker embedding vector from an utterance pronounced by an arbitrary speaker. At the training stage, a speaker-aware WaveNet vocoder is then built using a multi-speaker dataset which adopts both acoustic feature sequences and speaker embedding vectors as conditions.At the generation stage, we first feed the acoustic feature sequence from a test speaker into the speaker encoder to obtain the speaker embedding vector of the utterance. Then, both the speaker embedding vector and acoustic features pass the speaker-aware WaveNet vocoder to reconstruct speech waveforms. Experimental results demonstrate that our method can achieve a better objective and subjective performance on reconstructing waveforms of unseen speakers than the conventional speaker-independent WaveNet vocoder.Comment: 6 pages, 2 figures, 4 table

arXiv.org e-Print Archive

A Survey on Neural Speech Synthesis

Author: Liu Tie-Yan
Qin Tao
Soong Frank
Tan Xu
Publication venue
Publication date: 23/07/2021
Field of study

Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.Comment: A comprehensive survey on TTS, 63 pages, 18 tables, 7 figures, 457 reference

arXiv.org e-Print Archive