80,960 research outputs found
ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion
This paper proposes a voice conversion (VC) method using sequence-to-sequence
(seq2seq or S2S) learning, which flexibly converts not only the voice
characteristics but also the pitch contour and duration of input speech. The
proposed method, called ConvS2S-VC, has three key features. First, it uses a
model with a fully convolutional architecture. This is particularly
advantageous in that it is suitable for parallel computations using GPUs. It is
also beneficial since it enables effective normalization techniques such as
batch normalization to be used for all the hidden layers in the networks.
Second, it achieves many-to-many conversion by simultaneously learning mappings
among multiple speakers using only a single model instead of separately
learning mappings between each speaker pair using a different model. This
enables the model to fully utilize available training data collected from
multiple speakers by capturing common latent features that can be shared across
different speakers. Owing to this structure, our model works reasonably well
even without source speaker information, thus making it able to handle
any-to-many conversion tasks. Third, we introduce a mechanism, called the
conditional batch normalization that switches batch normalization layers in
accordance with the target speaker. This particular mechanism has been found to
be extremely effective for our many-to-many conversion model. We conducted
speaker identity conversion experiments and found that ConvS2S-VC obtained
higher sound quality and speaker similarity than baseline methods. We also
found from audio examples that it could perform well in various tasks including
emotional expression conversion, electrolaryngeal speech enhancement, and
English accent conversion.Comment: Published in IEEE/ACM Trans. ASLP
https://ieeexplore.ieee.org/document/911344
Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation
We describe Parrotron, an end-to-end-trained speech-to-speech conversion
model that maps an input spectrogram directly to another spectrogram, without
utilizing any intermediate discrete representation. The network is composed of
an encoder, spectrogram and phoneme decoders, followed by a vocoder to
synthesize a time-domain waveform. We demonstrate that this model can be
trained to normalize speech from any speaker regardless of accent, prosody, and
background noise, into the voice of a single canonical target speaker with a
fixed accent and consistent articulation and prosody. We further show that this
normalization model can be adapted to normalize highly atypical speech from a
deaf speaker, resulting in significant improvements in intelligibility and
naturalness, measured via a speech recognizer and listening tests. Finally,
demonstrating the utility of this model on other speech tasks, we show that the
same model architecture can be trained to perform a speech separation taskComment: 5 pages, submitted to Interspeech 201
Towards Robust Neural Vocoding for Speech Generation: A Survey
Recently, neural vocoders have been widely used in speech synthesis tasks,
including text-to-speech and voice conversion. However, when encountering data
distribution mismatch between training and inference, neural vocoders trained
on real data often degrade in voice quality for unseen scenarios. In this
paper, we train four common neural vocoders, including WaveNet, WaveRNN,
FFTNet, Parallel WaveGAN alternately on five different datasets. To study the
robustness of neural vocoders, we evaluate the models using acoustic features
from seen/unseen speakers, seen/unseen languages, a text-to-speech model, and a
voice conversion model. We found out that the speaker variety is much more
important for achieving a universal vocoder than the language. Through our
experiments, we show that WaveNet and WaveRNN are more suitable for
text-to-speech models, while Parallel WaveGAN is more suitable for voice
conversion applications. Great amount of subjective MOS results in naturalness
for all vocoders are presented for future studies.Comment: Submitted to INTERSPEECH 202
Direct speech-to-speech translation with a sequence-to-sequence model
We present an attention-based sequence-to-sequence neural network which can
directly translate speech from one language into speech in another language,
without relying on an intermediate text representation. The network is trained
end-to-end, learning to map speech spectrograms into target spectrograms in
another language, corresponding to the translated content (in a different
canonical voice). We further demonstrate the ability to synthesize translated
speech using the voice of the source speaker. We conduct experiments on two
Spanish-to-English speech translation datasets, and find that the proposed
model slightly underperforms a baseline cascade of a direct speech-to-text
translation model and a text-to-speech synthesis model, demonstrating the
feasibility of the approach on this very challenging task.Comment: Accepted to Interspeech 201
NAUTILUS: a Versatile Voice Cloning System
We introduce a novel speech synthesis system, called NAUTILUS, that can
generate speech with a target voice either from a text input or a reference
utterance of an arbitrary source speaker. By using a multi-speaker speech
corpus to train all requisite encoders and decoders in the initial training
stage, our system can clone unseen voices using untranscribed speech of target
speakers on the basis of the backpropagation algorithm. Moreover, depending on
the data circumstance of the target speaker, the cloning strategy can be
adjusted to take advantage of additional data and modify the behaviors of
text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the
situation. We test the performance of the proposed framework by using deep
convolution layers to model the encoders, decoders and WaveNet vocoder.
Evaluations show that it achieves comparable quality with state-of-the-art TTS
and VC systems when cloning with just five minutes of untranscribed speech.
Moreover, it is demonstrated that the proposed framework has the ability to
switch between TTS and VC with high speaker consistency, which will be useful
for many applications.Comment: Submitted to The IEEE/ACM Transactions on Audio, Speech, and Language
Processin
Towards Fine-Grained Prosody Control for Voice Conversion
In a typical voice conversion system, prior works utilize various acoustic
features (e.g., the pitch, voiced/unvoiced flag, aperiodicity) of the source
speech to control the prosody of generated waveform. However, the prosody is
related with many factors, such as the intonation, stress and rhythm. It is a
challenging task to perfectly describe the prosody through acoustic features.
To deal with this problem, we propose prosody embeddings to model prosody.
These embeddings are learned from the source speech in an unsupervised manner.
We conduct experiments on our Mandarin corpus recoded by professional speakers.
Experimental results demonstrate that the proposed method enables fine-grained
control of the prosody. In challenging situations (such as the source speech is
a singing song), our proposed method can also achieve promising results
Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations
Recently, cycle-consistent adversarial network (Cycle-GAN) has been
successfully applied to voice conversion to a different speaker without
parallel data, although in those approaches an individual model is needed for
each target speaker. In this paper, we propose an adversarial learning
framework for voice conversion, with which a single model can be trained to
convert the voice to many different speakers, all without parallel data, by
separating the speaker characteristics from the linguistic content in speech
signals. An autoencoder is first trained to extract speaker-independent latent
representations and speaker embedding separately using another auxiliary
speaker classifier to regularize the latent representation. The decoder then
takes the speaker-independent latent representation and the target speaker
embedding as the input to generate the voice of the target speaker with the
linguistic content of the source utterance. The quality of decoder output is
further improved by patching with the residual signal produced by another pair
of generator and discriminator. A target speaker set size of 20 was tested in
the preliminary experiments, and very good voice quality was obtained.
Conventional voice conversion metrics are reported. We also show that the
speaker information has been properly reduced from the latent representations.Comment: Accepted to Interspeech 201
Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks
We propose a parallel-data-free voice-conversion (VC) method that can learn a
mapping from source to target speech without relying on parallel data. The
proposed method is general purpose, high quality, and parallel-data free and
works without any extra data, modules, or alignment procedure. It also avoids
over-smoothing, which occurs in many conventional statistical model-based VC
methods. Our method, called CycleGAN-VC, uses a cycle-consistent adversarial
network (CycleGAN) with gated convolutional neural networks (CNNs) and an
identity-mapping loss. A CycleGAN learns forward and inverse mappings
simultaneously using adversarial and cycle-consistency losses. This makes it
possible to find an optimal pseudo pair from unpaired data. Furthermore, the
adversarial loss contributes to reducing over-smoothing of the converted
feature sequence. We configure a CycleGAN with gated CNNs and train it with an
identity-mapping loss. This allows the mapping function to capture sequential
and hierarchical structures while preserving linguistic information. We
evaluated our method on a parallel-data-free VC task. An objective evaluation
showed that the converted feature sequence was near natural in terms of global
variance and modulation spectra. A subjective evaluation showed that the
quality of the converted speech was comparable to that obtained with a Gaussian
mixture model-based method under advantageous conditions with parallel and
twice the amount of data
Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram
Cross-lingual voice conversion (VC) is an important and challenging problem
due to significant mismatches of the phonetic set and the speech prosody of
different languages. In this paper, we build upon the neural text-to-speech
(TTS) model, i.e., FastSpeech, and LPCNet neural vocoder to design a new
cross-lingual VC framework named FastSpeech-VC. We address the mismatches of
the phonetic set and the speech prosody by applying Phonetic PosteriorGrams
(PPGs), which have been proved to bridge across speaker and language
boundaries. Moreover, we add normalized logarithm-scale fundamental frequency
(Log-F0) to further compensate for the prosodic mismatches and significantly
improve naturalness. Our experiments on English and Mandarin languages
demonstrate that with only mono-lingual corpus, the proposed FastSpeech-VC can
achieve high quality converted speech with mean opinion score (MOS) close to
the professional records while maintaining good speaker similarity. Compared to
the baselines using Tacotron2 and Transformer TTS models, the FastSpeech-VC can
achieve controllable converted speech rate and much faster inference speed.
More importantly, the FastSpeech-VC can easily be adapted to a speaker with
limited training utterances.Comment: 5 pages, 2 figures, 4 tables, accepted by ICASSP 202
Building a mixed-lingual neural TTS system with only monolingual data
When deploying a Chinese neural text-to-speech (TTS) synthesis system, one of
the challenges is to synthesize Chinese utterances with English phrases or
words embedded. This paper looks into the problem in the encoder-decoder
framework when only monolingual data from a target speaker is available.
Specifically, we view the problem from two aspects: speaker consistency within
an utterance and naturalness. We start the investigation with an Average Voice
Model which is built from multi-speaker monolingual data, i.e. Mandarin and
English data. On the basis of that, we look into speaker embedding for speaker
consistency within an utterance and phoneme embedding for naturalness and
intelligibility and study the choice of data for model training. We report the
findings and discuss the challenges to build a mixed-lingual TTS system with
only monolingual data.Comment: To appear in INTERSPEECH 201
- …