9 research outputs found
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
Recently, voice conversion (VC) without parallel data has been successfully
adapted to multi-target scenario in which a single model is trained to convert
the input voice to many different speakers. However, such model suffers from
the limitation that it can only convert the voice to the speakers in the
training data, which narrows down the applicable scenario of VC. In this paper,
we proposed a novel one-shot VC approach which is able to perform VC by only an
example utterance from source and target speaker respectively, and the source
and target speaker do not even need to be seen during training. This is
achieved by disentangling speaker and content representations with instance
normalization (IN). Objective and subjective evaluation shows that our model is
able to generate the voice similar to target speaker. In addition to the
performance measurement, we also demonstrate that this model is able to learn
meaningful speaker representations without any supervision.Comment: Interspeech 201
Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations
Recently, cycle-consistent adversarial network (Cycle-GAN) has been
successfully applied to voice conversion to a different speaker without
parallel data, although in those approaches an individual model is needed for
each target speaker. In this paper, we propose an adversarial learning
framework for voice conversion, with which a single model can be trained to
convert the voice to many different speakers, all without parallel data, by
separating the speaker characteristics from the linguistic content in speech
signals. An autoencoder is first trained to extract speaker-independent latent
representations and speaker embedding separately using another auxiliary
speaker classifier to regularize the latent representation. The decoder then
takes the speaker-independent latent representation and the target speaker
embedding as the input to generate the voice of the target speaker with the
linguistic content of the source utterance. The quality of decoder output is
further improved by patching with the residual signal produced by another pair
of generator and discriminator. A target speaker set size of 20 was tested in
the preliminary experiments, and very good voice quality was obtained.
Conventional voice conversion metrics are reported. We also show that the
speaker information has been properly reduced from the latent representations.Comment: Accepted to Interspeech 201
Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks
We propose a parallel-data-free voice-conversion (VC) method that can learn a
mapping from source to target speech without relying on parallel data. The
proposed method is general purpose, high quality, and parallel-data free and
works without any extra data, modules, or alignment procedure. It also avoids
over-smoothing, which occurs in many conventional statistical model-based VC
methods. Our method, called CycleGAN-VC, uses a cycle-consistent adversarial
network (CycleGAN) with gated convolutional neural networks (CNNs) and an
identity-mapping loss. A CycleGAN learns forward and inverse mappings
simultaneously using adversarial and cycle-consistency losses. This makes it
possible to find an optimal pseudo pair from unpaired data. Furthermore, the
adversarial loss contributes to reducing over-smoothing of the converted
feature sequence. We configure a CycleGAN with gated CNNs and train it with an
identity-mapping loss. This allows the mapping function to capture sequential
and hierarchical structures while preserving linguistic information. We
evaluated our method on a parallel-data-free VC task. An objective evaluation
showed that the converted feature sequence was near natural in terms of global
variance and modulation spectra. A subjective evaluation showed that the
quality of the converted speech was comparable to that obtained with a Gaussian
mixture model-based method under advantageous conditions with parallel and
twice the amount of data
CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion
Non-parallel voice conversion (VC) is a technique for learning the mapping
from source to target speech without relying on parallel data. This is an
important task, but it has been challenging due to the disadvantages of the
training conditions. Recently, CycleGAN-VC has provided a breakthrough and
performed comparably to a parallel VC method without relying on any extra data,
modules, or time alignment procedures. However, there is still a large gap
between the real target and converted speech, and bridging this gap remains a
challenge. To reduce this gap, we propose CycleGAN-VC2, which is an improved
version of CycleGAN-VC incorporating three new techniques: an improved
objective (two-step adversarial losses), improved generator (2-1-2D CNN), and
improved discriminator (PatchGAN). We evaluated our method on a non-parallel VC
task and analyzed the effect of each technique in detail. An objective
evaluation showed that these techniques help bring the converted feature
sequence closer to the target in terms of both global and local structures,
which we assess by using Mel-cepstral distortion and modulation spectra
distance, respectively. A subjective evaluation showed that CycleGAN-VC2
outperforms CycleGAN-VC in terms of naturalness and similarity for every
speaker pair, including intra-gender and inter-gender pairs.Comment: Accepted to ICASSP 2019. Project page:
http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc2/index.htm
StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks
This paper proposes a method that allows non-parallel many-to-many voice
conversion (VC) by using a variant of a generative adversarial network (GAN)
called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in that it
(1) requires no parallel utterances, transcriptions, or time alignment
procedures for speech generator training, (2) simultaneously learns
many-to-many mappings across different attribute domains using a single
generator network, (3) is able to generate converted speech signals quickly
enough to allow real-time implementations and (4) requires only several minutes
of training examples to generate reasonably realistic-sounding speech.
Subjective evaluation experiments on a non-parallel many-to-many speaker
identity conversion task revealed that the proposed method obtained higher
sound quality and speaker similarity than a state-of-the-art method based on
variational autoencoding GANs
A Multi-Discriminator CycleGAN for Unsupervised Non-Parallel Speech Domain Adaptation
Domain adaptation plays an important role for speech recognition models, in
particular, for domains that have low resources. We propose a novel generative
model based on cyclic-consistent generative adversarial network (CycleGAN) for
unsupervised non-parallel speech domain adaptation. The proposed model employs
multiple independent discriminators on the power spectrogram, each in charge of
different frequency bands. As a result we have 1) better discriminators that
focus on fine-grained details of the frequency features, and 2) a generator
that is capable of generating more realistic domain-adapted spectrogram. We
demonstrate the effectiveness of our method on speech recognition with gender
adaptation, where the model only has access to supervised data from one gender
during training, but is evaluated on the other at test time. Our model is able
to achieve an average of on phoneme error rate, and word
error rate relative performance improvement as compared to the baseline, on
TIMIT and WSJ dataset, respectively. Qualitatively, our model also generates
more natural sounding speech, when conditioned on data from the other domain.Comment: Accepted to Interspeech 201
MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms
Traditional voice conversion methods rely on parallel recordings of multiple
speakers pronouncing the same sentences. For real-world applications however,
parallel data is rarely available. We propose MelGAN-VC, a voice conversion
method that relies on non-parallel speech data and is able to convert audio
signals of arbitrary length from a source voice to a target voice. We firstly
compute spectrograms from waveform data and then perform a domain translation
using a Generative Adversarial Network (GAN) architecture. An additional
siamese network helps preserving speech information in the translation process,
without sacrificing the ability to flexibly model the style of the target
speaker. We test our framework with a dataset of clean speech recordings, as
well as with a collection of noisy real-world speech examples. Finally, we
apply the same method to perform music style transfer, translating arbitrarily
long music samples from one genre to another, and showing that our framework is
flexible and can be used for audio manipulation applications different from
voice conversion
ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder
This paper proposes a non-parallel many-to-many voice conversion (VC) method
using a variant of the conditional variational autoencoder (VAE) called an
auxiliary classifier VAE (ACVAE). The proposed method has three key features.
First, it adopts fully convolutional architectures to construct the encoder and
decoder networks so that the networks can learn conversion rules that capture
time dependencies in the acoustic feature sequences of source and target
speech. Second, it uses an information-theoretic regularization for the model
training to ensure that the information in the attribute class label will not
be lost in the conversion process. With regular CVAEs, the encoder and decoder
are free to ignore the attribute class label input. This can be problematic
since in such a situation, the attribute class label will have little effect on
controlling the voice characteristics of input speech at test time. Such
situations can be avoided by introducing an auxiliary classifier and training
the encoder and decoder so that the attribute classes of the decoder outputs
are correctly predicted by the classifier. Third, it avoids producing
buzzy-sounding speech at test time by simply transplanting the spectral details
of the input speech into its converted version. Subjective evaluation
experiments revealed that this simple method worked reasonably well in a
non-parallel many-to-many speaker identity conversion task.Comment: arXiv admin note: substantial text overlap with arXiv:1806.0216
Nonparallel Voice Conversion with Augmented Classifier Star Generative Adversarial Networks
We previously proposed a method that allows for nonparallel voice conversion
(VC) by using a variant of generative adversarial networks (GANs) called
StarGAN. The main features of our method, called StarGAN-VC, are as follows:
First, it requires no parallel utterances, transcriptions, or time alignment
procedures for speech generator training. Second, it can simultaneously learn
mappings across multiple domains using a single generator network and thus
fully exploit available training data collected from multiple domains to
capture latent features that are common to all the domains. Third, it can
generate converted speech signals quickly enough to allow real-time
implementations and requires only several minutes of training examples to
generate reasonably realistic-sounding speech. In this paper, we describe three
formulations of StarGAN, including a newly introduced novel StarGAN variant
called "Augmented classifier StarGAN (A-StarGAN)", and compare them in a
nonparallel VC task. We also compare them with several baseline methods.Comment: Submitted to IEEE/ACM Trans. ASLP. This paper is an extended
full-paper version of arXiv:1806.0216