456 research outputs found
RNN-based speech synthesis using a continuous sinusoidal model
Recently in statistical parametric speech synthesis, we proposed a continuous
sinusoidal model (CSM) using continuous F0 (contF0) in combination with Maximum
Voiced Frequency (MVF), which was successfully giving state-of-the-art vocoders
performance (e.g. similar to STRAIGHT) in synthesized speech. In this paper, we
address the use of sequence-to-sequence modeling with recurrent neural networks
(RNNs). Bidirectional long short-term memory (Bi-LSTM) is investigated and
applied using our CSM to model contF0, MVF, and Mel-Generalized Cepstrum (MGC)
for more natural sounding synthesized speech. For refining the output of the
contF0 estimation, post-processing based on time-warping approach is applied to
reduce the unwanted voiced component of the unvoiced speech sounds, resulting
in an enhanced contF0 track. The overall conclusion is covered by objective
evaluation and subjective listening test, showing that the proposed framework
provides satisfactory results in terms of naturalness and intelligibility, and
is comparable to the high-quality WORLD model based RNNs.Comment: 8 pages, 4 figures, Accepted to IJCNN 201
A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data
In a typical voice conversion system, vocoder is commonly used for
speech-to-features analysis and features-to-speech synthesis. However, vocoder
can be a source of speech quality degradation. This paper presents a
vocoder-free voice conversion approach using WaveNet for non-parallel training
data. Instead of dealing with the intermediate features, the proposed approach
utilizes the WaveNet to map the Phonetic PosteriorGrams (PPGs) to the waveform
samples directly. In this way, we avoid the estimation errors caused by vocoder
and feature conversion. Additionally, as PPG is assumed to be speaker
independent, the proposed method also reduces the feature mismatch problem in
WaveNet vocoder based approaches. Experimental results conducted on the
CMU-ARCTIC database show that the proposed approach significantly outperforms
the baseline approaches in terms of speech quality.Comment: 5 pages, 4 figures, This paper is submitted to INTERSPEECH 201
Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data
Emotional voice conversion aims to convert the spectrum and prosody to change
the emotional patterns of speech, while preserving the speaker identity and
linguistic content. Many studies require parallel speech data between different
emotional patterns, which is not practical in real life. Moreover, they often
model the conversion of fundamental frequency (F0) with a simple linear
transform. As F0 is a key aspect of intonation that is hierarchical in nature,
we believe that it is more adequate to model F0 in different temporal scales by
using wavelet transform. We propose a CycleGAN network to find an optimal
pseudo pair from non-parallel training data by learning forward and inverse
mappings simultaneously using adversarial and cycle-consistency losses. We also
study the use of continuous wavelet transform (CWT) to decompose F0 into ten
temporal scales, that describes speech prosody at different time resolution,
for effective F0 conversion. Experimental results show that our proposed
framework outperforms the baselines both in objective and subjective
evaluations.Comment: accepted by Speaker Odyssey 2020 in Tokyo, Japa
Taco-VC: A Single Speaker Tacotron based Voice Conversion with Limited Data
This paper introduces Taco-VC, a novel architecture for voice conversion
based on Tacotron synthesizer, which is a sequence-to-sequence with attention
model. The training of multi-speaker voice conversion systems requires a large
number of resources, both in training and corpus size. Taco-VC is implemented
using a single speaker Tacotron synthesizer based on Phonetic PosteriorGrams
(PPGs) and a single speaker WaveNet vocoder conditioned on mel spectrograms. To
enhance the converted speech quality, and to overcome over-smoothing, the
outputs of Tacotron are passed through a novel speechenhancement network, which
is composed of a combination of the phoneme recognition and Tacotron networks.
Our system is trained just with a single speaker corpus and adapts to new
speakers using only a few minutes of training data. Using mid-size public
datasets, our method outperforms the baseline in the VCC 2018 SPOKE
non-parallel voice conversion task and achieves competitive results compared to
multi-speaker networks trained on large private datasets.Comment: Accepted to EUSIPCO 202
TTS Skins: Speaker Conversion via ASR
We present a fully convolutional wav-to-wav network for converting between
speakers' voices, without relying on text. Our network is based on an
encoder-decoder architecture, where the encoder is pre-trained for the task of
Automatic Speech Recognition, and a multi-speaker waveform decoder is trained
to reconstruct the original signal in an autoregressive manner. We train the
network on narrated audiobooks, and demonstrate multi-voice TTS in those
voices, by converting the voice of a TTS robot
Speech-to-Singing Conversion based on Boundary Equilibrium GAN
This paper investigates the use of generative adversarial network (GAN)-based
models for converting the spectrogram of a speech signal into that of a singing
one, without reference to the phoneme sequence underlying the speech. This is
achieved by viewing speech-to-singing conversion as a style transfer problem.
Specifically, given a speech input, and optionally the F0 contour of the target
singing, the proposed model generates as the output a singing signal with a
progressive-growing encoder/decoder architecture and boundary equilibrium GAN
loss functions. Our quantitative and qualitative analysis show that the
proposed model generates singing voices with much higher naturalness than an
existing non adversarially-trained baseline. For reproducibility, the code will
be publicly available at a GitHub repository upon paper publication.Comment: Accepted for publication at INTERSPEECH 202
ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion
This paper proposes a voice conversion (VC) method using sequence-to-sequence
(seq2seq or S2S) learning, which flexibly converts not only the voice
characteristics but also the pitch contour and duration of input speech. The
proposed method, called ConvS2S-VC, has three key features. First, it uses a
model with a fully convolutional architecture. This is particularly
advantageous in that it is suitable for parallel computations using GPUs. It is
also beneficial since it enables effective normalization techniques such as
batch normalization to be used for all the hidden layers in the networks.
Second, it achieves many-to-many conversion by simultaneously learning mappings
among multiple speakers using only a single model instead of separately
learning mappings between each speaker pair using a different model. This
enables the model to fully utilize available training data collected from
multiple speakers by capturing common latent features that can be shared across
different speakers. Owing to this structure, our model works reasonably well
even without source speaker information, thus making it able to handle
any-to-many conversion tasks. Third, we introduce a mechanism, called the
conditional batch normalization that switches batch normalization layers in
accordance with the target speaker. This particular mechanism has been found to
be extremely effective for our many-to-many conversion model. We conducted
speaker identity conversion experiments and found that ConvS2S-VC obtained
higher sound quality and speaker similarity than baseline methods. We also
found from audio examples that it could perform well in various tasks including
emotional expression conversion, electrolaryngeal speech enhancement, and
English accent conversion.Comment: Published in IEEE/ACM Trans. ASLP
https://ieeexplore.ieee.org/document/911344
Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks
For our submission to the ZeroSpeech 2019 challenge, we apply discrete
latent-variable neural networks to unlabelled speech and use the discovered
units for speech synthesis. Unsupervised discrete subword modelling could be
useful for studies of phonetic category learning in infants or in low-resource
speech technology requiring symbolic input. We use an autoencoder (AE)
architecture with intermediate discretisation. We decouple acoustic unit
discovery from speaker modelling by conditioning the AE's decoder on the
training speaker identity. At test time, unit discovery is performed on speech
from an unseen speaker, followed by unit decoding conditioned on a known target
speaker to obtain reconstructed filterbanks. This output is fed to a neural
vocoder to synthesise speech in the target speaker's voice. For discretisation,
categorical variational autoencoders (CatVAEs), vector-quantised VAEs (VQ-VAEs)
and straight-through estimation are compared at different compression levels on
two languages. Our final model uses convolutional encoding, VQ-VAE
discretisation, deconvolutional decoding and an FFTNet vocoder. We show that
decoupled speaker conditioning intrinsically improves discrete acoustic
representations, yielding competitive synthesis quality compared to the
challenge baseline.Comment: Interspeech 201
Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder
In this paper, we investigate the effectiveness of a quasi-periodic WaveNet
(QPNet) vocoder combined with a statistical spectral conversion technique for a
voice conversion task. The WaveNet (WN) vocoder has been applied as the
waveform generation module in many different voice conversion frameworks and
achieves significant improvement over conventional vocoders. However, because
of the fixed dilated convolution and generic network architecture, the WN
vocoder lacks robustness against unseen input features and often requires a
huge network size to achieve acceptable speech quality. Such limitations
usually lead to performance degradation in the voice conversion task. To
overcome this problem, the QPNet vocoder is applied, which includes a
pitch-dependent dilated convolution component to enhance the pitch
controllability and attain a more compact network than the WN vocoder. In the
proposed method, input spectral features are first converted using a framewise
deep neural network, and then the QPNet vocoder generates converted speech
conditioned on the linearly converted prosodic and transformed spectral
features. The experimental results confirm that the QPNet vocoder achieves
significantly better performance than the same-size WN vocoder while
maintaining comparable speech quality to the double-size WN vocoder. Index
Terms: WaveNet, vocoder, voice conversion, pitch-dependent dilated convolution,
pitch controllabilityComment: 6pages, 7figures, Proc. SSW10, 201
Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension
This paper presents a waveform modeling and generation method using
hierarchical recurrent neural networks (HRNN) for speech bandwidth extension
(BWE). Different from conventional BWE methods which predict spectral
parameters for reconstructing wideband speech waveforms, this BWE method models
and predicts waveform samples directly without using vocoders. Inspired by
SampleRNN which is an unconditional neural audio generator, the HRNN model
represents the distribution of each wideband or high-frequency waveform sample
conditioned on the input narrowband waveform samples using a neural network
composed of long short-term memory (LSTM) layers and feed-forward (FF) layers.
The LSTM layers form a hierarchical structure and each layer operates at a
specific temporal resolution to efficiently capture long-span dependencies
between temporal sequences. Furthermore, additional conditions, such as the
bottleneck (BN) features derived from narrowband speech using a deep neural
network (DNN)-based state classifier, are employed as auxiliary input to
further improve the quality of generated wideband speech. The experimental
results of comparing several waveform modeling methods show that the HRNN-based
method can achieve better speech quality and run-time efficiency than the
dilated convolutional neural network (DCNN)-based method and the plain
sample-level recurrent neural network (SRNN)-based method. Our proposed method
also outperforms the conventional vocoder-based BWE method using LSTM-RNNs in
terms of the subjective quality of the reconstructed wideband speech.Comment: Accepted by IEEE Transactions on Audio, Speech and Language
Processin
- …