29 research outputs found
ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion
This paper proposes a voice conversion (VC) method using sequence-to-sequence
(seq2seq or S2S) learning, which flexibly converts not only the voice
characteristics but also the pitch contour and duration of input speech. The
proposed method, called ConvS2S-VC, has three key features. First, it uses a
model with a fully convolutional architecture. This is particularly
advantageous in that it is suitable for parallel computations using GPUs. It is
also beneficial since it enables effective normalization techniques such as
batch normalization to be used for all the hidden layers in the networks.
Second, it achieves many-to-many conversion by simultaneously learning mappings
among multiple speakers using only a single model instead of separately
learning mappings between each speaker pair using a different model. This
enables the model to fully utilize available training data collected from
multiple speakers by capturing common latent features that can be shared across
different speakers. Owing to this structure, our model works reasonably well
even without source speaker information, thus making it able to handle
any-to-many conversion tasks. Third, we introduce a mechanism, called the
conditional batch normalization that switches batch normalization layers in
accordance with the target speaker. This particular mechanism has been found to
be extremely effective for our many-to-many conversion model. We conducted
speaker identity conversion experiments and found that ConvS2S-VC obtained
higher sound quality and speaker similarity than baseline methods. We also
found from audio examples that it could perform well in various tasks including
emotional expression conversion, electrolaryngeal speech enhancement, and
English accent conversion.Comment: Published in IEEE/ACM Trans. ASLP
https://ieeexplore.ieee.org/document/911344
Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion
Emotional voice conversion aims to convert the emotion of speech from one
state to another while preserving the linguistic content and speaker identity.
The prior studies on emotional voice conversion are mostly carried out under
the assumption that emotion is speaker-dependent. We consider that there is a
common code between speakers for emotional expression in a spoken language,
therefore, a speaker-independent mapping between emotional states is possible.
In this paper, we propose a speaker-independent emotional voice conversion
framework, that can convert anyone's emotion without the need for parallel
data. We propose a VAW-GAN based encoder-decoder structure to learn the
spectrum and prosody mapping. We perform prosody conversion by using continuous
wavelet transform (CWT) to model the temporal dependencies. We also investigate
the use of F0 as an additional input to the decoder to improve emotion
conversion performance. Experiments show that the proposed speaker-independent
framework achieves competitive results for both seen and unseen speakers.Comment: Accepted by Interspeech 202
CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion
Non-parallel voice conversion (VC) is a technique for learning the mapping
from source to target speech without relying on parallel data. This is an
important task, but it has been challenging due to the disadvantages of the
training conditions. Recently, CycleGAN-VC has provided a breakthrough and
performed comparably to a parallel VC method without relying on any extra data,
modules, or time alignment procedures. However, there is still a large gap
between the real target and converted speech, and bridging this gap remains a
challenge. To reduce this gap, we propose CycleGAN-VC2, which is an improved
version of CycleGAN-VC incorporating three new techniques: an improved
objective (two-step adversarial losses), improved generator (2-1-2D CNN), and
improved discriminator (PatchGAN). We evaluated our method on a non-parallel VC
task and analyzed the effect of each technique in detail. An objective
evaluation showed that these techniques help bring the converted feature
sequence closer to the target in terms of both global and local structures,
which we assess by using Mel-cepstral distortion and modulation spectra
distance, respectively. A subjective evaluation showed that CycleGAN-VC2
outperforms CycleGAN-VC in terms of naturalness and similarity for every
speaker pair, including intra-gender and inter-gender pairs.Comment: Accepted to ICASSP 2019. Project page:
http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc2/index.htm
Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks
We propose a parallel-data-free voice-conversion (VC) method that can learn a
mapping from source to target speech without relying on parallel data. The
proposed method is general purpose, high quality, and parallel-data free and
works without any extra data, modules, or alignment procedure. It also avoids
over-smoothing, which occurs in many conventional statistical model-based VC
methods. Our method, called CycleGAN-VC, uses a cycle-consistent adversarial
network (CycleGAN) with gated convolutional neural networks (CNNs) and an
identity-mapping loss. A CycleGAN learns forward and inverse mappings
simultaneously using adversarial and cycle-consistency losses. This makes it
possible to find an optimal pseudo pair from unpaired data. Furthermore, the
adversarial loss contributes to reducing over-smoothing of the converted
feature sequence. We configure a CycleGAN with gated CNNs and train it with an
identity-mapping loss. This allows the mapping function to capture sequential
and hierarchical structures while preserving linguistic information. We
evaluated our method on a parallel-data-free VC task. An objective evaluation
showed that the converted feature sequence was near natural in terms of global
variance and modulation spectra. A subjective evaluation showed that the
quality of the converted speech was comparable to that obtained with a Gaussian
mixture model-based method under advantageous conditions with parallel and
twice the amount of data
StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks
This paper proposes a method that allows non-parallel many-to-many voice
conversion (VC) by using a variant of a generative adversarial network (GAN)
called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in that it
(1) requires no parallel utterances, transcriptions, or time alignment
procedures for speech generator training, (2) simultaneously learns
many-to-many mappings across different attribute domains using a single
generator network, (3) is able to generate converted speech signals quickly
enough to allow real-time implementations and (4) requires only several minutes
of training examples to generate reasonably realistic-sounding speech.
Subjective evaluation experiments on a non-parallel many-to-many speaker
identity conversion task revealed that the proposed method obtained higher
sound quality and speaker similarity than a state-of-the-art method based on
variational autoencoding GANs
AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms
This paper describes a method based on a sequence-to-sequence learning
(Seq2Seq) with attention and context preservation mechanism for voice
conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving
sequence modeling such as speech synthesis and recognition, machine
translation, and image captioning. In contrast to current VC techniques, our
method 1) stabilizes and accelerates the training procedure by considering
guided attention and proposed context preservation losses, 2) allows not only
spectral envelopes but also fundamental frequency contours and durations of
speech to be converted, 3) requires no context information such as phoneme
labels, and 4) requires no time-aligned source and target speech data in
advance. In our experiment, the proposed VC framework can be trained in only
one day, using only one GPU of an NVIDIA Tesla K80, while the quality of the
synthesized speech is higher than that of speech converted by Gaussian mixture
model-based VC and is comparable to that of speech generated by recurrent
neural network-based text-to-speech synthesis, which can be regarded as an
upper limit on VC performance.Comment: Submitted to ICASSP201
A Multi-Discriminator CycleGAN for Unsupervised Non-Parallel Speech Domain Adaptation
Domain adaptation plays an important role for speech recognition models, in
particular, for domains that have low resources. We propose a novel generative
model based on cyclic-consistent generative adversarial network (CycleGAN) for
unsupervised non-parallel speech domain adaptation. The proposed model employs
multiple independent discriminators on the power spectrogram, each in charge of
different frequency bands. As a result we have 1) better discriminators that
focus on fine-grained details of the frequency features, and 2) a generator
that is capable of generating more realistic domain-adapted spectrogram. We
demonstrate the effectiveness of our method on speech recognition with gender
adaptation, where the model only has access to supervised data from one gender
during training, but is evaluated on the other at test time. Our model is able
to achieve an average of on phoneme error rate, and word
error rate relative performance improvement as compared to the baseline, on
TIMIT and WSJ dataset, respectively. Qualitatively, our model also generates
more natural sounding speech, when conditioned on data from the other domain.Comment: Accepted to Interspeech 201
High quality voice conversion using prosodic and high-resolution spectral features
Voice conversion methods have advanced rapidly over the last decade. Studies
have shown that speaker characteristics are captured by spectral feature as
well as various prosodic features. Most existing conversion methods focus on
the spectral feature as it directly represents the timbre characteristics,
while some conversion methods have focused only on the prosodic feature
represented by the fundamental frequency. In this paper, a comprehensive
framework using deep neural networks to convert both timbre and prosodic
features is proposed. The timbre feature is represented by a high-resolution
spectral feature. The prosodic features include F0, intensity and duration. It
is well known that DNN is useful as a tool to model high-dimensional features.
In this work, we show that DNN initialized by our proposed autoencoder
pretraining yields good quality DNN conversion models. This pretraining is
tailor-made for voice conversion and leverages on autoencoder to capture the
generic spectral shape of source speech. Additionally, our framework uses
segmental DNN models to capture the evolution of the prosodic features over
time. To reconstruct the converted speech, the spectral feature produced by the
DNN model is combined with the three prosodic features produced by the DNN
segmental models. Our experimental results show that the application of both
prosodic and high-resolution spectral features leads to quality converted
speech as measured by objective evaluation and subjective listening tests
VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture
Voice conversion (VC) is a task that transforms the source speaker's timbre,
accent, and tones in audio into another one's while preserving the linguistic
content. It is still a challenging work, especially in a one-shot setting.
Auto-encoder-based VC methods disentangle the speaker and the content in input
speech without given the speaker's identity, so these methods can further
generalize to unseen speakers. The disentangle capability is achieved by vector
quantization (VQ), adversarial training, or instance normalization (IN).
However, the imperfect disentanglement may harm the quality of output speech.
In this work, to further improve audio quality, we use the U-Net architecture
within an auto-encoder-based VC system. We find that to leverage the U-Net
architecture, a strong information bottleneck is necessary. The VQ-based
method, which quantizes the latent vectors, can serve the purpose. The
objective and the subjective evaluations show that the proposed method performs
well in both audio naturalness and speaker similarity
Many-to-Many Voice Conversion using Conditional Cycle-Consistent Adversarial Networks
Voice conversion (VC) refers to transforming the speaker characteristics of
an utterance without altering its linguistic contents. Many works on voice
conversion require to have parallel training data that is highly expensive to
acquire. Recently, the cycle-consistent adversarial network (CycleGAN), which
does not require parallel training data, has been applied to voice conversion,
showing the state-of-the-art performance. The CycleGAN based voice conversion,
however, can be used only for a pair of speakers, i.e., one-to-one voice
conversion between two speakers. In this paper, we extend the CycleGAN by
conditioning the network on speakers. As a result, the proposed method can
perform many-to-many voice conversion among multiple speakers using a single
generative adversarial network (GAN). Compared to building multiple CycleGANs
for each pair of speakers, the proposed method reduces the computational and
spatial cost significantly without compromising the sound quality of the
converted voice. Experimental results using the VCC2018 corpus confirm the
efficiency of the proposed method