13 research outputs found
Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations
This paper presents a method of sequence-to-sequence (seq2seq) voice
conversion using non-parallel training data. In this method, disentangled
linguistic and speaker representations are extracted from acoustic features,
and voice conversion is achieved by preserving the linguistic representations
of source utterances while replacing the speaker representations with the
target ones. Our model is built under the framework of encoder-decoder neural
networks. A recognition encoder is designed to learn the disentangled
linguistic representations with two strategies. First, phoneme transcriptions
of training data are introduced to provide the references for leaning
linguistic representations of audio signals. Second, an adversarial training
strategy is employed to further wipe out speaker information from the
linguistic representations. Meanwhile, speaker representations are extracted
from audio signals by a speaker encoder. The model parameters are estimated by
two-stage training, including a pretraining stage using a multi-speaker dataset
and a fine-tuning stage using the dataset of a specific conversion pair. Since
both the recognition encoder and the decoder for recovering acoustic features
are seq2seq neural networks, there are no constrains of frame alignment and
frame-by-frame conversion in our proposed method. Experimental results showed
that our method obtained higher similarity and naturalness than the best
non-parallel voice conversion method in Voice Conversion Challenge 2018.
Besides, the performance of our proposed method was closed to the
state-of-the-art parallel seq2seq voice conversion method.Comment: Accepted by IEEE/ACM Transactions on Aduio, Speech and Language
Processin
Transferring Source Style in Non-Parallel Voice Conversion
Voice conversion (VC) techniques aim to modify speaker identity of an
utterance while preserving the underlying linguistic information. Most VC
approaches ignore modeling of the speaking style (e.g. emotion and emphasis),
which may contain the factors intentionally added by the speaker and should be
retained during conversion. This study proposes a sequence-to-sequence based
non-parallel VC approach, which has the capability of transferring the speaking
style from the source speech to the converted speech by explicitly modeling.
Objective evaluation and subjective listening tests show superiority of the
proposed VC approach in terms of speech naturalness and speaker similarity of
the converted speech. Experiments are also conducted to show the source-style
transferability of the proposed approach.Comment: 5 pages, 8 figures, submitted to INTERSPEECH 202
crank: An Open-Source Software for Nonparallel Voice Conversion Based on Vector-Quantized Variational Autoencoder
In this paper, we present an open-source software for developing a
nonparallel voice conversion (VC) system named crank. Although we have released
an open-source VC software based on the Gaussian mixture model named sprocket
in the last VC Challenge, it is not straightforward to apply any speech corpus
because it is necessary to prepare parallel utterances of source and target
speakers to model a statistical conversion function. To address this issue, in
this study, we developed a new open-source VC software that enables users to
model the conversion function by using only a nonparallel speech corpus. For
implementing the VC software, we used a vector-quantized variational
autoencoder (VQVAE). To rapidly examine the effectiveness of recent
technologies developed in this research field, crank also supports several
representative works for autoencoder-based VC methods such as the use of
hierarchical architectures, cyclic architectures, generative adversarial
networks, speaker adversarial training, and neural vocoders. Moreover, it is
possible to automatically estimate objective measures such as mel-cepstrum
distortion and pseudo mean opinion score based on MOSNet. In this paper, we
describe representative functions developed in crank and make brief comparisons
by objective evaluations.Comment: Accepted to ICASSP 202
Intra-class variation reduction of speaker representation in disentanglement framework
In this paper, we propose an effective training strategy to ex-tract robust
speaker representations from a speech signal. Oneof the key challenges in
speaker recognition tasks is to learnlatent representations or embeddings
containing solely speakercharacteristic information in order to be robust in
terms of intra-speaker variations. By modifying the network architecture
togenerate both speaker-related and speaker-unrelated representa-tions, we
exploit a learning criterion which minimizes the mu-tual information between
these disentangled embeddings. Wealso introduce an identity change loss
criterion which utilizes areconstruction error to different utterances spoken
by the samespeaker. Since the proposed criteria reduce the variation ofspeaker
characteristics caused by changes in background envi-ronment or spoken content,
the resulting embeddings of eachspeaker become more consistent. The
effectiveness of the pro-posed method is demonstrated through two tasks;
disentangle-ment performance, and improvement of speaker recognition ac-curacy
compared to the baseline model on a benchmark dataset,VoxCeleb1. Ablation
studies also show the impact of each cri-terion on overall performance.Comment: Accepted for INTERSPEECH 202
Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data
We propose Cotatron, a transcription-guided speech encoder for
speaker-independent linguistic representation. Cotatron is based on the
multispeaker TTS architecture and can be trained with conventional TTS
datasets. We train a voice conversion system to reconstruct speech with
Cotatron features, which is similar to the previous methods based on Phonetic
Posteriorgram (PPG). By training and evaluating our system with 108 speakers
from the VCTK dataset, we outperform the previous method in terms of both
naturalness and speaker similarity. Our system can also convert speech from
speakers that are unseen during training, and utilize ASR to automate the
transcription with minimal reduction of the performance. Audio samples are
available at https://mindslab-ai.github.io/cotatron, and the code with a
pre-trained model will be made available soon.Comment: To appear in INTERSPEECH 202
The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS
This paper presents the sequence-to-sequence (seq2seq) baseline system for
the voice conversion challenge (VCC) 2020. We consider a naive approach for
voice conversion (VC), which is to first transcribe the input speech with an
automatic speech recognition (ASR) model, followed using the transcriptions to
generate the voice of the target with a text-to-speech (TTS) model. We revisit
this method under a sequence-to-sequence (seq2seq) framework by utilizing
ESPnet, an open-source end-to-end speech processing toolkit, and the many
well-configured pretrained models provided by the community. Official
evaluation results show that our system comes out top among the participating
systems in terms of conversion similarity, demonstrating the promising ability
of seq2seq models to convert speaker identity. The implementation is made
open-source at: https://github.com/espnet/espnet/tree/master/egs/vcc20.Comment: Accepted to the ISCA Joint Workshop for the Blizzard Challenge and
Voice Conversion Challenge 202
Accent and Speaker Disentanglement in Many-to-many Voice Conversion
This paper proposes an interesting voice and accent joint conversion
approach, which can convert an arbitrary source speaker's voice to a target
speaker with non-native accent. This problem is challenging as each target
speaker only has training data in native accent and we need to disentangle
accent and speaker information in the conversion model training and re-combine
them in the conversion stage. In our recognition-synthesis conversion
framework, we manage to solve this problem by two proposed tricks. First, we
use accent-dependent speech recognizers to obtain bottleneck features for
different accented speakers. This aims to wipe out other factors beyond the
linguistic information in the BN features for conversion model training.
Second, we propose to use adversarial training to better disentangle the
speaker and accent information in our encoder-decoder based conversion model.
Specifically, we plug an auxiliary speaker classifier to the encoder, trained
with an adversarial loss to wipe out speaker information from the encoder
output. Experiments show that our approach is superior to the baseline. The
proposed tricks are quite effective in improving accentedness and audio quality
and speaker similarity are well maintained.Comment: Accepted to ISCSLP202
Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training
Emotional voice conversion (EVC) aims to change the emotional state of an
utterance while preserving the linguistic content and speaker identity. In this
paper, we propose a novel 2-stage training strategy for sequence-to-sequence
emotional voice conversion with a limited amount of emotional speech data. We
note that the proposed EVC framework leverages text-to-speech (TTS) as they
share a common goal that is to generate high-quality expressive voice. In stage
1, we perform style initialization with a multi-speaker TTS corpus, to
disentangle speaking style and linguistic content. In stage 2, we perform
emotion training with a limited amount of emotional speech data, to learn how
to disentangle emotional style and linguistic information from the speech. The
proposed framework can perform both spectrum and prosody conversion and
achieves significant improvement over the state-of-the-art baselines in both
objective and subjective evaluation.Comment: Accepted by Interspeech 202
Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion
Current voice conversion (VC) methods can successfully convert timbre of the
audio. As modeling source audio's prosody effectively is a challenging task,
there are still limitations of transferring source style to the converted
speech. This study proposes a source style transfer method based on
recognition-synthesis framework. Previously in speech generation task, prosody
can be modeled explicitly with prosodic features or implicitly with a latent
prosody extractor. In this paper, taking advantages of both, we model the
prosody in a hybrid manner, which effectively combines explicit and implicit
methods in a proposed prosody module. Specifically, prosodic features are used
to explicit model prosody, while VAE and reference encoder are used to
implicitly model prosody, which take Mel spectrum and bottleneck feature as
input respectively. Furthermore, adversarial training is introduced to remove
speaker-related information from the VAE outputs, avoiding leaking source
speaker information while transferring style. Finally, we use a modified
self-attention based encoder to extract sentential context from bottleneck
features, which also implicitly aggregates the prosodic aspects of source
speech from the layered representations. Experiments show that our approach is
superior to the baseline and a competitive system in terms of style transfer;
meanwhile, the speech quality and speaker similarity are well maintained.Comment: Accepted by Interspeech 202
VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics
In this paper, we propose a non-parallel any-to-many voice conversion (VC)
method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel
waveform generation method, VoiceGrad is based upon the concepts of score
matching and Langevin dynamics. It uses weighted denoising score matching to
train a score approximator, a fully convolutional network with a U-Net
structure designed to predict the gradient of the log density of the speech
feature sequences of multiple speakers, and performs VC by using annealed
Langevin dynamics to iteratively update an input feature sequence towards the
nearest stationary point of the target distribution based on the trained score
approximator network. Thanks to the nature of this concept, VoiceGrad enables
any-to-many VC, a VC scenario in which the speaker of input speech can be
arbitrary, and allows for non-parallel training, which requires no parallel
utterances or transcriptions.Comment: arXiv admin note: text overlap with arXiv:2008.1260