1,720 research outputs found
High quality voice conversion using prosodic and high-resolution spectral features
Voice conversion methods have advanced rapidly over the last decade. Studies
have shown that speaker characteristics are captured by spectral feature as
well as various prosodic features. Most existing conversion methods focus on
the spectral feature as it directly represents the timbre characteristics,
while some conversion methods have focused only on the prosodic feature
represented by the fundamental frequency. In this paper, a comprehensive
framework using deep neural networks to convert both timbre and prosodic
features is proposed. The timbre feature is represented by a high-resolution
spectral feature. The prosodic features include F0, intensity and duration. It
is well known that DNN is useful as a tool to model high-dimensional features.
In this work, we show that DNN initialized by our proposed autoencoder
pretraining yields good quality DNN conversion models. This pretraining is
tailor-made for voice conversion and leverages on autoencoder to capture the
generic spectral shape of source speech. Additionally, our framework uses
segmental DNN models to capture the evolution of the prosodic features over
time. To reconstruct the converted speech, the spectral feature produced by the
DNN model is combined with the three prosodic features produced by the DNN
segmental models. Our experimental results show that the application of both
prosodic and high-resolution spectral features leads to quality converted
speech as measured by objective evaluation and subjective listening tests
A Cyclical Post-filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-speech Systems
Recently, the effectiveness of text-to-speech (TTS) systems combined with
neural vocoders to generate high-fidelity speech has been shown. However,
collecting the required training data and building these advanced systems from
scratch are time and resource consuming. An economical approach is to develop a
neural vocoder to enhance the speech generated by existing or low-cost TTS
systems. Nonetheless, this approach usually suffers from two issues: 1)
temporal mismatches between TTS and natural waveforms and 2) acoustic
mismatches between training and testing data. To address these issues, we adopt
a cyclic voice conversion (VC) model to generate temporally matched pseudo-VC
data for training and acoustically matched enhanced data for testing the neural
vocoders. Because of the generality, this framework can be applied to arbitrary
TTS systems and neural vocoders. In this paper, we apply the proposed method
with a state-of-the-art WaveNet vocoder for two different basic TTS systems,
and both objective and subjective experimental results confirm the
effectiveness of the proposed framework.Comment: 5 pages, 8 figures, 1 table. Proc. Interspeech, 202
Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis
Recent studies have shown that text-to-speech synthesis quality can be
improved by using glottal vocoding. This refers to vocoders that parameterize
speech into two parts, the glottal excitation and vocal tract, that occur in
the human speech production apparatus. Current glottal vocoders generate the
glottal excitation waveform by using deep neural networks (DNNs). However, the
squared error-based training of the present glottal excitation models is
limited to generating conditional average waveforms, which fails to capture the
stochastic variation of the waveforms. As a result, shaped noise is added as
post-processing. In this study, we propose a new method for predicting glottal
waveforms by generative adversarial networks (GANs). GANs are generative models
that aim to embed the data distribution in a latent space, enabling generation
of new instances very similar to the original by randomly sampling the latent
distribution. The glottal pulses generated by GANs show a stochastic component
similar to natural glottal pulses. In our experiments, we compare synthetic
speech generated using glottal waveforms produced by both DNNs and GANs. The
results show that the newly proposed GANs achieve synthesis quality comparable
to that of widely-used DNNs, without using an additive noise component.Comment: Accepted in Interspeec
RNN-based speech synthesis using a continuous sinusoidal model
Recently in statistical parametric speech synthesis, we proposed a continuous
sinusoidal model (CSM) using continuous F0 (contF0) in combination with Maximum
Voiced Frequency (MVF), which was successfully giving state-of-the-art vocoders
performance (e.g. similar to STRAIGHT) in synthesized speech. In this paper, we
address the use of sequence-to-sequence modeling with recurrent neural networks
(RNNs). Bidirectional long short-term memory (Bi-LSTM) is investigated and
applied using our CSM to model contF0, MVF, and Mel-Generalized Cepstrum (MGC)
for more natural sounding synthesized speech. For refining the output of the
contF0 estimation, post-processing based on time-warping approach is applied to
reduce the unwanted voiced component of the unvoiced speech sounds, resulting
in an enhanced contF0 track. The overall conclusion is covered by objective
evaluation and subjective listening test, showing that the proposed framework
provides satisfactory results in terms of naturalness and intelligibility, and
is comparable to the high-quality WORLD model based RNNs.Comment: 8 pages, 4 figures, Accepted to IJCNN 201
Modeling Singing F0 With Neural Network Driven Transition-Sustain Models
This study focuses on generating fundamental frequency (F0) curves of singing
voice from musical scores stored in a midi-like notation. Current statistical
parametric approaches to singing F0 modeling meet difficulties in reproducing
vibratos and the temporal details at note boundaries due to the oversmoothing
tendency of statistical models. This paper presents a neural network based
solution that models a pair of neighboring notes at a time (the transition
model) and uses a separate network for generating vibratos (the sustain model).
Predictions from the two models are combined by summation after proper
enveloping to enforce continuity. In the training phase, mild misalignment
between the scores and the target F0 is addressed by back-propagating the
gradients to the networks' inputs. Subjective listening tests on the NITech
singing database show that transition-sustain models are able to generate F0
trajectories close to the original performance.Comment: 5 pages, 5 figure
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce
high-quality speech directly from text or simple linguistic features such as
phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS
does not require manually annotated and complicated linguistic features such as
part-of-speech tags and syntactic structures for system training. However, it
must be carefully designed and well optimized so that it can implicitly extract
useful linguistic features from the input features. In this paper we
investigate under what conditions the neural sequence-to-sequence TTS can work
well in Japanese and English along with comparisons with deep neural network
(DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline
systems also use autoregressive probabilistic modeling and a neural vocoder. We
investigated systems from three aspects: a) model architecture, b) model
parameter size, and c) language. For the model architecture aspect, we adopt
modified Tacotron systems that we previously proposed and their variants using
an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we
investigate two model parameter sizes. For the language aspect, we conduct
listening tests in both Japanese and English to see if our findings can be
generalized across languages. Our experiments suggest that a) a neural
sequence-to-sequence TTS system should have a sufficient number of model
parameters to produce high quality speech, b) it should also use a powerful
encoder when it takes characters as inputs, and c) the encoder still has a room
for improvement and needs to have an improved architecture to learn
supra-segmental features more appropriately
Waveform to Single Sinusoid Regression to Estimate the F0 Contour from Noisy Speech Using Recurrent Deep Neural Networks
The fundamental frequency (F0) represents pitch in speech that determines
prosodic characteristics of speech and is needed in various tasks for speech
analysis and synthesis. Despite decades of research on this topic, F0
estimation at low signal-to-noise ratios (SNRs) in unexpected noise conditions
remains difficult. This work proposes a new approach to noise robust F0
estimation using a recurrent neural network (RNN) trained in a supervised
manner. Recent studies employ deep neural networks (DNNs) for F0 tracking as a
frame-by-frame classification task into quantised frequency states but we
propose waveform-to-sinusoid regression instead to achieve both noise
robustness and accurate estimation with increased frequency resolution.
Experimental results with PTDB-TUG corpus contaminated by additive noise
(NOISEX-92) demonstrate that the proposed method improves gross pitch error
(GPE) rate and fine pitch error (FPE) by more than 35 % at SNRs between -10 dB
and +10 dB compared with well-known noise robust F0 tracker, PEFAC.
Furthermore, the proposed method also outperforms state-of-the-art DNN-based
approaches by more than 15 % in terms of both FPE and GPE rate over the
preceding SNR range.Comment: Accepted by peer reviewing for Interspeech 201
Error Reduction Network for DBLSTM-based Voice Conversion
So far, many of the deep learning approaches for voice conversion produce
good quality speech by using a large amount of training data. This paper
presents a Deep Bidirectional Long Short-Term Memory (DBLSTM) based voice
conversion framework that can work with a limited amount of training data. We
propose to implement a DBLSTM based average model that is trained with data
from many speakers. Then, we propose to perform adaptation with a limited
amount of target data. Last but not least, we propose an error reduction
network that can improve the voice conversion quality even further. The
proposed framework is motivated by three observations. Firstly, DBLSTM can
achieve a remarkable voice conversion by considering the long-term dependencies
of the speech utterance. Secondly, DBLSTM based average model can be easily
adapted with a small amount of data, to achieve a speech that sounds closer to
the target. Thirdly, an error reduction network can be trained with a small
amount of training data, and can improve the conversion quality effectively.
The experiments show that the proposed voice conversion framework is flexible
to work with limited training data and outperforms the traditional frameworks
in both objective and subjective evaluations.Comment: Accepted by APSIPA 201
A comparison of Vietnamese Statistical Parametric Speech Synthesis Systems
In recent years, statistical parametric speech synthesis (SPSS) systems have
been widely utilized in many interactive speech-based systems (e.g.~Amazon's
Alexa, Bose's headphones). To select a suitable SPSS system, both speech
quality and performance efficiency (e.g.~decoding time) must be taken into
account. In the paper, we compared four popular Vietnamese SPSS techniques
using: 1) hidden Markov models (HMM), 2) deep neural networks (DNN), 3)
generative adversarial networks (GAN), and 4) end-to-end (E2E) architectures,
which consists of Tacontron~2 and WaveGlow vocoder in terms of speech quality
and performance efficiency. We showed that the E2E systems accomplished the
best quality, but required the power of GPU to achieve real-time performance.
We also showed that the HMM-based system had inferior speech quality, but it
was the most efficient system. Surprisingly, the E2E systems were more
efficient than the DNN and GAN in inference on GPU. Surprisingly, the GAN-based
system did not outperform the DNN in term of quality.Comment: 9 pages, submitted to KSE 202
Probabilistic Binary-Mask Cocktail-Party Source Separation in a Convolutional Deep Neural Network
Separation of competing speech is a key challenge in signal processing and a
feat routinely performed by the human auditory brain. A long standing benchmark
of the spectrogram approach to source separation is known as the ideal binary
mask. Here, we train a convolutional deep neural network, on a two-speaker
cocktail party problem, to make probabilistic predictions about binary masks.
Our results approach ideal binary mask performance, illustrating that
relatively simple deep neural networks are capable of robust binary mask
prediction. We also illustrate the trade-off between prediction statistics and
separation quality
- …