27 research outputs found
Reducing one-to-many problem in Voice Conversion by equalizing the formant locations using dynamic frequency warping
In this study, we investigate a solution to reduce the effect of one-to-many
problem in voice conversion. One-to-many problem in VC happens when two very
similar speech segments in source speaker have corresponding speech segments in
target speaker that are not similar to each other. As a result, the mapper
function usually over-smoothes the generated features in order to be similar to
both target speech segments. In this study, we propose to equalize the formant
location of source-target frame pairs using dynamic frequency warping in order
to reduce the complexity. After the conversion, another dynamic frequency
warping is further applied to reverse the effect of formant location
equalization during the training. The subjective experiments showed that the
proposed approach improves the speech quality significantly.Comment: 5 pages, 5 figure
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
Recently, voice conversion (VC) without parallel data has been successfully
adapted to multi-target scenario in which a single model is trained to convert
the input voice to many different speakers. However, such model suffers from
the limitation that it can only convert the voice to the speakers in the
training data, which narrows down the applicable scenario of VC. In this paper,
we proposed a novel one-shot VC approach which is able to perform VC by only an
example utterance from source and target speaker respectively, and the source
and target speaker do not even need to be seen during training. This is
achieved by disentangling speaker and content representations with instance
normalization (IN). Objective and subjective evaluation shows that our model is
able to generate the voice similar to target speaker. In addition to the
performance measurement, we also demonstrate that this model is able to learn
meaningful speaker representations without any supervision.Comment: Interspeech 201
ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion
This paper proposes a voice conversion (VC) method using sequence-to-sequence
(seq2seq or S2S) learning, which flexibly converts not only the voice
characteristics but also the pitch contour and duration of input speech. The
proposed method, called ConvS2S-VC, has three key features. First, it uses a
model with a fully convolutional architecture. This is particularly
advantageous in that it is suitable for parallel computations using GPUs. It is
also beneficial since it enables effective normalization techniques such as
batch normalization to be used for all the hidden layers in the networks.
Second, it achieves many-to-many conversion by simultaneously learning mappings
among multiple speakers using only a single model instead of separately
learning mappings between each speaker pair using a different model. This
enables the model to fully utilize available training data collected from
multiple speakers by capturing common latent features that can be shared across
different speakers. Owing to this structure, our model works reasonably well
even without source speaker information, thus making it able to handle
any-to-many conversion tasks. Third, we introduce a mechanism, called the
conditional batch normalization that switches batch normalization layers in
accordance with the target speaker. This particular mechanism has been found to
be extremely effective for our many-to-many conversion model. We conducted
speaker identity conversion experiments and found that ConvS2S-VC obtained
higher sound quality and speaker similarity than baseline methods. We also
found from audio examples that it could perform well in various tasks including
emotional expression conversion, electrolaryngeal speech enhancement, and
English accent conversion.Comment: Published in IEEE/ACM Trans. ASLP
https://ieeexplore.ieee.org/document/911344
Error Reduction Network for DBLSTM-based Voice Conversion
So far, many of the deep learning approaches for voice conversion produce
good quality speech by using a large amount of training data. This paper
presents a Deep Bidirectional Long Short-Term Memory (DBLSTM) based voice
conversion framework that can work with a limited amount of training data. We
propose to implement a DBLSTM based average model that is trained with data
from many speakers. Then, we propose to perform adaptation with a limited
amount of target data. Last but not least, we propose an error reduction
network that can improve the voice conversion quality even further. The
proposed framework is motivated by three observations. Firstly, DBLSTM can
achieve a remarkable voice conversion by considering the long-term dependencies
of the speech utterance. Secondly, DBLSTM based average model can be easily
adapted with a small amount of data, to achieve a speech that sounds closer to
the target. Thirdly, an error reduction network can be trained with a small
amount of training data, and can improve the conversion quality effectively.
The experiments show that the proposed voice conversion framework is flexible
to work with limited training data and outperforms the traditional frameworks
in both objective and subjective evaluations.Comment: Accepted by APSIPA 201
Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks
We propose a parallel-data-free voice-conversion (VC) method that can learn a
mapping from source to target speech without relying on parallel data. The
proposed method is general purpose, high quality, and parallel-data free and
works without any extra data, modules, or alignment procedure. It also avoids
over-smoothing, which occurs in many conventional statistical model-based VC
methods. Our method, called CycleGAN-VC, uses a cycle-consistent adversarial
network (CycleGAN) with gated convolutional neural networks (CNNs) and an
identity-mapping loss. A CycleGAN learns forward and inverse mappings
simultaneously using adversarial and cycle-consistency losses. This makes it
possible to find an optimal pseudo pair from unpaired data. Furthermore, the
adversarial loss contributes to reducing over-smoothing of the converted
feature sequence. We configure a CycleGAN with gated CNNs and train it with an
identity-mapping loss. This allows the mapping function to capture sequential
and hierarchical structures while preserving linguistic information. We
evaluated our method on a parallel-data-free VC task. An objective evaluation
showed that the converted feature sequence was near natural in terms of global
variance and modulation spectra. A subjective evaluation showed that the
quality of the converted speech was comparable to that obtained with a Gaussian
mixture model-based method under advantageous conditions with parallel and
twice the amount of data
CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion
Non-parallel voice conversion (VC) is a technique for learning the mapping
from source to target speech without relying on parallel data. This is an
important task, but it has been challenging due to the disadvantages of the
training conditions. Recently, CycleGAN-VC has provided a breakthrough and
performed comparably to a parallel VC method without relying on any extra data,
modules, or time alignment procedures. However, there is still a large gap
between the real target and converted speech, and bridging this gap remains a
challenge. To reduce this gap, we propose CycleGAN-VC2, which is an improved
version of CycleGAN-VC incorporating three new techniques: an improved
objective (two-step adversarial losses), improved generator (2-1-2D CNN), and
improved discriminator (PatchGAN). We evaluated our method on a non-parallel VC
task and analyzed the effect of each technique in detail. An objective
evaluation showed that these techniques help bring the converted feature
sequence closer to the target in terms of both global and local structures,
which we assess by using Mel-cepstral distortion and modulation spectra
distance, respectively. A subjective evaluation showed that CycleGAN-VC2
outperforms CycleGAN-VC in terms of naturalness and similarity for every
speaker pair, including intra-gender and inter-gender pairs.Comment: Accepted to ICASSP 2019. Project page:
http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc2/index.htm
Voice conversion using coefficient mapping and neural network
The research presents a voice conversion model using coefficient mapping and
neural network. Most previous works on parametric speech synthesis did not
account for losses in spectral details causing over smoothing and invariably,
an appreciable deviation of the converted speech from the targeted speaker. An
improved model that uses both linear predictive coding (LPC) and line spectral
frequency (LSF) coefficients to parametrize the source speech signal was
developed in this work to reveal the effect of over-smoothing. Non-linear
mapping ability of neural network was employed in mapping the source speech
vectors into the acoustic vector space of the target. Training LPC coefficients
with neural network yielded a poor result due to the instability of the LPC
filter poles. The LPC coefficients were converted to line spectral frequency
coefficients before been trained with a 3-layer neural network. The algorithm
was tested with noisy data with the result evaluated using Mel-Cepstral
Distance measurement. Cepstral distance evaluation shows a 35.7 percent
reduction in the spectral distance between the target and the converted speech.Comment: 5 page
StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks
This paper proposes a method that allows non-parallel many-to-many voice
conversion (VC) by using a variant of a generative adversarial network (GAN)
called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in that it
(1) requires no parallel utterances, transcriptions, or time alignment
procedures for speech generator training, (2) simultaneously learns
many-to-many mappings across different attribute domains using a single
generator network, (3) is able to generate converted speech signals quickly
enough to allow real-time implementations and (4) requires only several minutes
of training examples to generate reasonably realistic-sounding speech.
Subjective evaluation experiments on a non-parallel many-to-many speaker
identity conversion task revealed that the proposed method obtained higher
sound quality and speaker similarity than a state-of-the-art method based on
variational autoencoding GANs
A Multi-Discriminator CycleGAN for Unsupervised Non-Parallel Speech Domain Adaptation
Domain adaptation plays an important role for speech recognition models, in
particular, for domains that have low resources. We propose a novel generative
model based on cyclic-consistent generative adversarial network (CycleGAN) for
unsupervised non-parallel speech domain adaptation. The proposed model employs
multiple independent discriminators on the power spectrogram, each in charge of
different frequency bands. As a result we have 1) better discriminators that
focus on fine-grained details of the frequency features, and 2) a generator
that is capable of generating more realistic domain-adapted spectrogram. We
demonstrate the effectiveness of our method on speech recognition with gender
adaptation, where the model only has access to supervised data from one gender
during training, but is evaluated on the other at test time. Our model is able
to achieve an average of on phoneme error rate, and word
error rate relative performance improvement as compared to the baseline, on
TIMIT and WSJ dataset, respectively. Qualitatively, our model also generates
more natural sounding speech, when conditioned on data from the other domain.Comment: Accepted to Interspeech 201
Learning in your voice: Non-parallel voice conversion based on speaker consistency loss
In this paper, we propose a novel voice conversion strategy to resolve the
mismatch between the training and conversion scenarios when parallel speech
corpus is unavailable for training. Based on auto-encoder and disentanglement
frameworks, we design the proposed model to extract identity and content
representations while reconstructing the input speech signal itself. Since we
use other speaker's identity information in the training process, the training
philosophy is naturally matched with the objective of voice conversion process.
In addition, we effectively design the disentanglement framework to reliably
preserve linguistic information and to enhance the quality of converted speech
signals. The superiority of the proposed method is shown in subjective
listening tests as well as objective measures.Comment: ICASSP 2021 submitte