34 research outputs found
Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder
In this paper, we investigate the effectiveness of a quasi-periodic WaveNet
(QPNet) vocoder combined with a statistical spectral conversion technique for a
voice conversion task. The WaveNet (WN) vocoder has been applied as the
waveform generation module in many different voice conversion frameworks and
achieves significant improvement over conventional vocoders. However, because
of the fixed dilated convolution and generic network architecture, the WN
vocoder lacks robustness against unseen input features and often requires a
huge network size to achieve acceptable speech quality. Such limitations
usually lead to performance degradation in the voice conversion task. To
overcome this problem, the QPNet vocoder is applied, which includes a
pitch-dependent dilated convolution component to enhance the pitch
controllability and attain a more compact network than the WN vocoder. In the
proposed method, input spectral features are first converted using a framewise
deep neural network, and then the QPNet vocoder generates converted speech
conditioned on the linearly converted prosodic and transformed spectral
features. The experimental results confirm that the QPNet vocoder achieves
significantly better performance than the same-size WN vocoder while
maintaining comparable speech quality to the double-size WN vocoder. Index
Terms: WaveNet, vocoder, voice conversion, pitch-dependent dilated convolution,
pitch controllabilityComment: 6pages, 7figures, Proc. SSW10, 201
A Cyclical Post-filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-speech Systems
Recently, the effectiveness of text-to-speech (TTS) systems combined with
neural vocoders to generate high-fidelity speech has been shown. However,
collecting the required training data and building these advanced systems from
scratch are time and resource consuming. An economical approach is to develop a
neural vocoder to enhance the speech generated by existing or low-cost TTS
systems. Nonetheless, this approach usually suffers from two issues: 1)
temporal mismatches between TTS and natural waveforms and 2) acoustic
mismatches between training and testing data. To address these issues, we adopt
a cyclic voice conversion (VC) model to generate temporally matched pseudo-VC
data for training and acoustically matched enhanced data for testing the neural
vocoders. Because of the generality, this framework can be applied to arbitrary
TTS systems and neural vocoders. In this paper, we apply the proposed method
with a state-of-the-art WaveNet vocoder for two different basic TTS systems,
and both objective and subjective experimental results confirm the
effectiveness of the proposed framework.Comment: 5 pages, 8 figures, 1 table. Proc. Interspeech, 202
Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network
In this paper, a pitch-adaptive waveform generative model named
Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch
controllability of vanilla WaveNet (WN) using pitch-dependent dilated
convolution neural networks (PDCNNs). Specifically, as a probabilistic
autoregressive generation model with stacked dilated convolution layers, WN
achieves high-fidelity audio waveform generation. However, the pure-data-driven
nature and the lack of prior knowledge of audio signals degrade the pitch
controllability of WN. For instance, it is difficult for WN to precisely
generate the periodic components of audio signals when the given auxiliary
fundamental frequency () features are outside the range observed
in the training data. To address this problem, QPNet with two novel designs is
proposed. First, the PDCNN component is applied to dynamically change the
network architecture of WN according to the given auxiliary features.
Second, a cascaded network structure is utilized to simultaneously model the
long- and short-term dependencies of quasi-periodic signals such as speech. The
performances of single-tone sinusoid and speech generations are evaluated. The
experimental results show the effectiveness of the PDCNNs for unseen auxiliary
features and the effectiveness of the cascaded structure for speech
generation.Comment: 15 pages, 12 figures, 11 table
Quasi-Periodic WaveNet Vocoder: A Pitch Dependent Dilated Convolution Model for Parametric Speech Generation
In this paper, we propose a quasi-periodic neural network (QPNet) vocoder
with a novel network architecture named pitch-dependent dilated convolution
(PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder. The
effectiveness of the WN vocoder to generate high-fidelity speech samples from
given acoustic features has been proved recently. However, because of the fixed
dilated convolution and generic network architecture, the WN vocoder hardly
generates speech with given F0 values which are outside the range observed in
training data. Consequently, the WN vocoder lacks the pitch controllability
which is one of the essential capabilities of conventional vocoders. To address
this limitation, we propose the PDCNN component which has the time-variant
adaptive dilation size related to the given F0 values and a cascade network
structure of the QPNet vocoder to generate quasi-periodic signals such as
speech. Both objective and subjective tests are conducted, and the experimental
results demonstrate the better pitch controllability of the QPNet vocoder
compared to the same and double sized WN vocoders while attaining comparable
speech qualities. Index Terms: WaveNet, vocoder, quasi-periodic signal,
pitch-dependent dilated convolution, pitch controllabilityComment: 5 pages, 4 figures, Proc. Interspeech, 201
Quasi-Periodic Parallel WaveGAN: A Non-autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network
In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform
generative model, which applies a quasi-periodic (QP) structure to a parallel
WaveGAN (PWG) model using pitch-dependent dilated convolution networks
(PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model,
whose generation time is much faster than real time because of its compact
model and non-autoregressive (non-AR) and non-causal mechanisms. Although PWG
achieves high-fidelity speech generation, the generic and simple network
architecture lacks pitch controllability for an unseen auxiliary fundamental
frequency () feature such as a scaled . To improve the pitch
controllability and speech modeling capability, we apply a QP structure with
PDCNNs to PWG, which introduces pitch information to the network by dynamically
changing the network architecture corresponding to the auxiliary
feature. Both objective and subjective experimental results show that QPPWG
outperforms PWG when the auxiliary feature is scaled. Moreover,
analyses of the intermediate outputs of QPPWG also show better tractability and
interpretability of QPPWG, which respectively models spectral and
excitation-like signals using the cascaded fixed and adaptive blocks of the QP
structure.Comment: 15 pages, 10 figures, 8 table
Quasi-Periodic Parallel WaveGAN Vocoder: A Non-autoregressive Pitch-dependent Dilated Convolution Model for Parametric Speech Generation
In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a
quasi-periodic (QP) architecture to improve the pitch controllability of PWG.
PWG is a compact non-autoregressive (non-AR) speech generation model, whose
generative speed is much faster than real time. While utilizing PWG as a
vocoder to generate speech on the basis of acoustic features such as spectral
and prosodic features, PWG generates high-fidelity speech. However, when the
input acoustic features include unseen pitches, the pitch accuracy of
PWG-generated speech degrades because of the fixed and generic network of PWG
without prior knowledge of speech periodicity. The proposed QPPWG adopts a
pitch-dependent dilated convolution network (PDCNN) module, which introduces
the pitch information into PWG via the dynamically changed network
architecture, to improve the pitch controllability and speech modeling
capability of vanilla PWG. Both objective and subjective evaluation results
show the higher pitch accuracy and comparable speech quality of QPPWG-generated
speech when the QPPWG model size is only 70 % of that of vanilla PWG.Comment: 5 page, 6 figures, 2 tables. Proc. Interspeech, 202
Non-parallel Voice Conversion System with WaveNet Vocoder and Collapsed Speech Suppression
In this paper, we integrate a simple non-parallel voice conversion (VC)
system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression
technique. The effectiveness of WN as a vocoder for generating high-fidelity
speech waveforms on the basis of acoustic features has been confirmed in recent
works. However, when combining the WN vocoder with a VC system, the distorted
acoustic features, acoustic and temporal mismatches, and exposure bias usually
lead to significant speech quality degradation, making WN generate some very
noisy speech segments called collapsed speech. To tackle the problem, we take
conventional-vocoder-generated speech as the reference speech to derive a
linear predictive coding distribution constraint (LPCDC) to avoid the collapsed
speech problem. Furthermore, to mitigate the negative effects introduced by the
LPCDC, we propose a collapsed speech segment detector (CSSD) to ensure that the
LPCDC is only applied to the problematic segments to limit the loss of quality
to short periods. Objective and subjective evaluations are conducted, and the
experimental results confirm the effectiveness of the proposed method, which
further improves the speech quality of our previous non-parallel VC system
submitted to Voice Conversion Challenge 2018.Comment: 13 pages, 13 figures, 1 table, accepted to publish in IEEE Acces
Unified Source-Filter GAN: Unified Source-filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN
We propose a unified approach to data-driven source-filter modeling using a
single neural network for developing a neural vocoder capable of generating
high-quality synthetic speech waveforms while retaining flexibility of the
source-filter model to control their voice characteristics. Our proposed
network called unified source-filter generative adversarial networks (uSFGAN)
is developed by factorizing quasi-periodic parallel WaveGAN (QPPWG), one of the
neural vocoders based on a single neural network, into a source excitation
generation network and a vocal tract resonance filtering network by
additionally implementing a regularization loss. Moreover, inspired by neural
source filter (NSF), only a sinusoidal waveform is additionally used as the
simplest clue to generate a periodic source excitation waveform while
minimizing the effect of approximations in the source filter model. The
experimental results demonstrate that uSFGAN outperforms conventional neural
vocoders, such as QPPWG and NSF in both speech quality and pitch
controllability.Comment: Submitted to INTERSPEECH 202
Online Speaker Adaptation for WaveNet-based Neural Vocoders
In this paper, we propose an online speaker adaptation method for
WaveNet-based neural vocoders in order to improve their performance on
speaker-independent waveform generation. In this method, a speaker encoder is
first constructed using a large speaker-verification dataset which can extract
a speaker embedding vector from an utterance pronounced by an arbitrary
speaker. At the training stage, a speaker-aware WaveNet vocoder is then built
using a multi-speaker dataset which adopts both acoustic feature sequences and
speaker embedding vectors as conditions.At the generation stage, we first feed
the acoustic feature sequence from a test speaker into the speaker encoder to
obtain the speaker embedding vector of the utterance. Then, both the speaker
embedding vector and acoustic features pass the speaker-aware WaveNet vocoder
to reconstruct speech waveforms. Experimental results demonstrate that our
method can achieve a better objective and subjective performance on
reconstructing waveforms of unseen speakers than the conventional
speaker-independent WaveNet vocoder.Comment: 6 pages, 2 figures, 4 table
A Survey on Neural Speech Synthesis
Text to speech (TTS), or speech synthesis, which aims to synthesize
intelligible and natural speech given text, is a hot research topic in speech,
language, and machine learning communities and has broad applications in the
industry. As the development of deep learning and artificial intelligence,
neural network-based TTS has significantly improved the quality of synthesized
speech in recent years. In this paper, we conduct a comprehensive survey on
neural TTS, aiming to provide a good understanding of current research and
future trends. We focus on the key components in neural TTS, including text
analysis, acoustic models and vocoders, and several advanced topics, including
fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
We further summarize resources related to TTS (e.g., datasets, opensource
implementations) and discuss future research directions. This survey can serve
both academic researchers and industry practitioners working on TTS.Comment: A comprehensive survey on TTS, 63 pages, 18 tables, 7 figures, 457
reference