3,623 research outputs found
Neural Voice Cloning with a Few Samples
Voice cloning is a highly desired feature for personalized speech interfaces.
Neural network based speech synthesis has been shown to generate high quality
speech for a large number of speakers. In this paper, we introduce a neural
voice cloning system that takes a few audio samples as input. We study two
approaches: speaker adaptation and speaker encoding. Speaker adaptation is
based on fine-tuning a multi-speaker generative model with a few cloning
samples. Speaker encoding is based on training a separate model to directly
infer a new speaker embedding from cloning audios and to be used with a
multi-speaker generative model. In terms of naturalness of the speech and its
similarity to original speaker, both approaches can achieve good performance,
even with very few cloning audios. While speaker adaptation can achieve better
naturalness and similarity, the cloning time or required memory for the speaker
encoding approach is significantly less, making it favorable for low-resource
deployment
Neural voice cloning with a few low-quality samples
In this paper, we explore the possibility of speech synthesis from low
quality found data using only limited number of samples of target speaker. We
try to extract only the speaker embedding from found data of target speaker
unlike previous works which tries to train the entire text-to-speech system on
found data. Also, the two speaker mimicking approaches which are adaptation and
speaker-encoder-based are applied on newly released LibriTTS dataset and
previously released VCTK corpus to examine the impact of speaker variety on
clarity and target-speaker-similarity
NAUTILUS: a Versatile Voice Cloning System
We introduce a novel speech synthesis system, called NAUTILUS, that can
generate speech with a target voice either from a text input or a reference
utterance of an arbitrary source speaker. By using a multi-speaker speech
corpus to train all requisite encoders and decoders in the initial training
stage, our system can clone unseen voices using untranscribed speech of target
speakers on the basis of the backpropagation algorithm. Moreover, depending on
the data circumstance of the target speaker, the cloning strategy can be
adjusted to take advantage of additional data and modify the behaviors of
text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the
situation. We test the performance of the proposed framework by using deep
convolution layers to model the encoders, decoders and WaveNet vocoder.
Evaluations show that it achieves comparable quality with state-of-the-art TTS
and VC systems when cloning with just five minutes of untranscribed speech.
Moreover, it is demonstrated that the proposed framework has the ability to
switch between TTS and VC with high speaker consistency, which will be useful
for many applications.Comment: Submitted to The IEEE/ACM Transactions on Audio, Speech, and Language
Processin
Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement
With the popularity of deep neural network, speech synthesis task has
achieved significant improvements based on the end-to-end encoder-decoder
framework in the recent days. More and more applications relying on speech
synthesis technology have been widely used in our daily life. Robust speech
synthesis model depends on high quality and customized data which needs lots of
collecting efforts. It is worth investigating how to take advantage of
low-quality and low resource voice data which can be easily obtained from the
Internet for usage of synthesizing personalized voice. In this paper, the
proposed end-to-end speech synthesis model uses both speaker embedding and
noise representation as conditional inputs to model speaker and noise
information respectively. Firstly, the speech synthesis model is pre-trained
with both multi-speaker clean data and noisy augmented data; then the
pre-trained model is adapted on noisy low-resource new speaker data; finally,
by setting the clean speech condition, the model can synthesize the new
speaker's clean voice. Experimental results show that the speech generated by
the proposed approach has better subjective evaluation results than the method
directly fine-tuning pre-trained multi-speaker speech synthesis model with
denoised new speaker data
High quality, lightweight and adaptable TTS using LPCNet
We present a lightweight adaptable neural TTS system with high quality
output. The system is composed of three separate neural network blocks: prosody
prediction, acoustic feature prediction and Linear Prediction Coding Net as a
neural vocoder. This system can synthesize speech with close to natural quality
while running 3 times faster than real-time on a standard CPU. The modular
setup of the system allows for simple adaptation to new voices with a small
amount of data. We first demonstrate the ability of the system to produce high
quality speech when trained on large, high quality datasets. Following that, we
demonstrate its adaptability by mimicking unseen voices using 5 to 20 minutes
long datasets with lower recording quality. Large scale Mean Opinion Score
quality and similarity tests are presented, showing that the system can adapt
to unseen voices with quality gap of 0.12 and similarity gap of 3% compared to
natural speech for male voices and quality gap of 0.35 and similarity of gap of
9 % for female voices.Comment: Accepted to Interspeech 201
Sample Efficient Adaptive Text-to-Speech
We present a meta-learning approach for adaptive text-to-speech (TTS) with
few data. During training, we learn a multi-speaker model using a shared
conditional WaveNet core and independent learned embeddings for each speaker.
The aim of training is not to produce a neural network with fixed weights,
which is then deployed as a TTS system. Instead, the aim is to produce a
network that requires few data at deployment time to rapidly adapt to new
speakers. We introduce and benchmark three strategies: (i) learning the speaker
embedding while keeping the WaveNet core fixed, (ii) fine-tuning the entire
architecture with stochastic gradient descent, and (iii) predicting the speaker
embedding with a trained neural network encoder. The experiments show that
these approaches are successful at adapting the multi-speaker neural network to
new speakers, obtaining state-of-the-art results in both sample naturalness and
voice similarity with merely a few minutes of audio data from new speakers.Comment: Accepted by ICLR 201
Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks
We propose the multi-head convolutional neural network (MCNN) architecture
for waveform synthesis from spectrograms. Nonlinear interpolation in MCNN is
employed with transposed convolution layers in parallel heads. MCNN achieves
more than an order of magnitude higher compute intensity than commonly-used
iterative algorithms like Griffin-Lim, yielding efficient utilization for
modern multi-core processors, and very fast (more than 300x real-time) waveform
synthesis. For training of MCNN, we use a large-scale speech recognition
dataset and losses defined on waveforms that are related to perceptual audio
quality. We demonstrate that MCNN constitutes a very promising approach for
high-quality speech synthesis, without any iterative algorithms or
autoregression in computations
In Other News: A Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data
Neural text-to-speech synthesis (NTTS) models have shown significant progress
in generating high-quality speech, however they require a large quantity of
training data. This makes creating models for multiple styles expensive and
time-consuming. In this paper different styles of speech are analysed based on
prosodic variations, from this a model is proposed to synthesise speech in the
style of a newscaster, with just a few hours of supplementary data. We pose the
problem of synthesising in a target style using limited data as that of
creating a bi-style model that can synthesise both neutral-style and
newscaster-style speech via a one-hot vector which factorises the two styles.
We also propose conditioning the model on contextual word embeddings, and
extensively evaluate it against neutral NTTS, and neutral concatenative-based
synthesis. This model closes the gap in perceived style-appropriateness between
natural recordings for newscaster-style of speech, and neutral speech synthesis
by approximately two-thirds.Comment: Accepted at NAACL-HLT 201
Few Shot Speaker Recognition using Deep Neural Networks
The recent advances in deep learning are mostly driven by availability of
large amount of training data. However, availability of such data is not always
possible for specific tasks such as speaker recognition where collection of
large amount of data is not possible in practical scenarios. Therefore, in this
paper, we propose to identify speakers by learning from only a few training
examples. To achieve this, we use a deep neural network with prototypical loss
where the input to the network is a spectrogram. For output, we project the
class feature vectors into a common embedding space, followed by
classification. Further, we show the effectiveness of capsule net in a few shot
learning setting. To this end, we utilize an auto-encoder to learn generalized
feature embeddings from class-specific embeddings obtained from capsule
network. We provide exhaustive experiments on publicly available datasets and
competitive baselines, demonstrating the superiority and generalization ability
of the proposed few shot learning pipelines
Non-Autoregressive Neural Text-to-Speech
In this work, we propose ParaNet, a non-autoregressive seq2seq model that
converts text to spectrogram. It is fully convolutional and brings 46.7 times
speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining
reasonably good speech quality. ParaNet also produces stable alignment between
text and speech on the challenging test sentences by iteratively improving the
attention in a layer-by-layer manner. Furthermore, we build the parallel
text-to-speech system and test various parallel neural vocoders, which can
synthesize speech from text through a single feed-forward pass. We also explore
a novel VAE-based approach to train the inverse autoregressive flow (IAF) based
parallel vocoder from scratch, which avoids the need for distillation from a
separately trained WaveNet as previous work.Comment: Published at ICML 2020. (v3 changed paper title
- …