1,558 research outputs found
Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data
Emotional voice conversion aims to convert the spectrum and prosody to change
the emotional patterns of speech, while preserving the speaker identity and
linguistic content. Many studies require parallel speech data between different
emotional patterns, which is not practical in real life. Moreover, they often
model the conversion of fundamental frequency (F0) with a simple linear
transform. As F0 is a key aspect of intonation that is hierarchical in nature,
we believe that it is more adequate to model F0 in different temporal scales by
using wavelet transform. We propose a CycleGAN network to find an optimal
pseudo pair from non-parallel training data by learning forward and inverse
mappings simultaneously using adversarial and cycle-consistency losses. We also
study the use of continuous wavelet transform (CWT) to decompose F0 into ten
temporal scales, that describes speech prosody at different time resolution,
for effective F0 conversion. Experimental results show that our proposed
framework outperforms the baselines both in objective and subjective
evaluations.Comment: accepted by Speaker Odyssey 2020 in Tokyo, Japa
Is Neuromorphic MNIST neuromorphic? Analyzing the discriminative power of neuromorphic datasets in the time domain
The advantage of spiking neural networks (SNNs) over their predecessors is
their ability to spike, enabling them to use spike timing for coding and
efficient computing. A neuromorphic dataset should allow a neuromorphic
algorithm to clearly show that a SNN is able to perform better on the dataset
than an ANN. We have analyzed both N-MNIST and N-Caltech101 along these lines,
but focus our study on N-MNIST. First we evaluate if additional information is
encoded in the time domain in a neuromoprhic dataset. We show that an ANN
trained with backpropagation on frame based versions of N-MNIST and
N-Caltech101 images achieve 99.23% and 78.01% accuracy. These are the best
classification accuracies obtained on these datasets to date. Second we present
the first unsupervised SNN to be trained on N-MNIST and demonstrate results of
91.78%. We also use this SNN for further experiments on N-MNIST to show that
rate based SNNs perform better, and precise spike timings are not important in
N-MNIST. N-MNIST does not, therefore, highlight the unique ability of SNNs. The
conclusion of this study opens an important question in neuromorphic
engineering - what, then, constitutes a good neuromorphic dataset
A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data
In a typical voice conversion system, vocoder is commonly used for
speech-to-features analysis and features-to-speech synthesis. However, vocoder
can be a source of speech quality degradation. This paper presents a
vocoder-free voice conversion approach using WaveNet for non-parallel training
data. Instead of dealing with the intermediate features, the proposed approach
utilizes the WaveNet to map the Phonetic PosteriorGrams (PPGs) to the waveform
samples directly. In this way, we avoid the estimation errors caused by vocoder
and feature conversion. Additionally, as PPG is assumed to be speaker
independent, the proposed method also reduces the feature mismatch problem in
WaveNet vocoder based approaches. Experimental results conducted on the
CMU-ARCTIC database show that the proposed approach significantly outperforms
the baseline approaches in terms of speech quality.Comment: 5 pages, 4 figures, This paper is submitted to INTERSPEECH 201
Error Reduction Network for DBLSTM-based Voice Conversion
So far, many of the deep learning approaches for voice conversion produce
good quality speech by using a large amount of training data. This paper
presents a Deep Bidirectional Long Short-Term Memory (DBLSTM) based voice
conversion framework that can work with a limited amount of training data. We
propose to implement a DBLSTM based average model that is trained with data
from many speakers. Then, we propose to perform adaptation with a limited
amount of target data. Last but not least, we propose an error reduction
network that can improve the voice conversion quality even further. The
proposed framework is motivated by three observations. Firstly, DBLSTM can
achieve a remarkable voice conversion by considering the long-term dependencies
of the speech utterance. Secondly, DBLSTM based average model can be easily
adapted with a small amount of data, to achieve a speech that sounds closer to
the target. Thirdly, an error reduction network can be trained with a small
amount of training data, and can improve the conversion quality effectively.
The experiments show that the proposed voice conversion framework is flexible
to work with limited training data and outperforms the traditional frameworks
in both objective and subjective evaluations.Comment: Accepted by APSIPA 201
Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss
The SpeakerBeam-FE (SBF) method is proposed for speaker extraction. It
attempts to overcome the problem of unknown number of speakers in an audio
recording during source separation. The mask approximation loss of SBF is
sub-optimal, which doesn't calculate direct signal reconstruction error and
consider the speech context. To address these problems, this paper proposes a
magnitude and temporal spectrum approximation loss to estimate a phase
sensitive mask for the target speaker with the speaker characteristics.
Moreover, this paper explores a concatenation framework instead of the context
adaptive deep neural network in the SBF method to encode a speaker embedding
into the mask estimation network. Experimental results under open evaluation
condition show that the proposed method achieves 70.4% and 17.7% relative
improvement over the SBF baseline on signal-to-distortion ratio (SDR) and
perceptual evaluation of speech quality (PESQ), respectively. A further
analysis demonstrates 69.1% and 72.3% relative SDR improvements obtained by the
proposed method for different and same gender mixtures.Comment: Accepted in ICASSP 201
Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion
Emotional voice conversion aims to convert the emotion of speech from one
state to another while preserving the linguistic content and speaker identity.
The prior studies on emotional voice conversion are mostly carried out under
the assumption that emotion is speaker-dependent. We consider that there is a
common code between speakers for emotional expression in a spoken language,
therefore, a speaker-independent mapping between emotional states is possible.
In this paper, we propose a speaker-independent emotional voice conversion
framework, that can convert anyone's emotion without the need for parallel
data. We propose a VAW-GAN based encoder-decoder structure to learn the
spectrum and prosody mapping. We perform prosody conversion by using continuous
wavelet transform (CWT) to model the temporal dependencies. We also investigate
the use of F0 as an additional input to the decoder to improve emotion
conversion performance. Experiments show that the proposed speaker-independent
framework achieves competitive results for both seen and unseen speakers.Comment: Accepted by Interspeech 202
Spoofing detection under noisy conditions: a preliminary investigation and an initial database
Spoofing detection for automatic speaker verification (ASV), which is to
discriminate between live speech and attacks, has received increasing
attentions recently. However, all the previous studies have been done on the
clean data without significant additive noise. To simulate the real-life
scenarios, we perform a preliminary investigation of spoofing detection under
additive noisy conditions, and also describe an initial database for this task.
The noisy database is based on the ASVspoof challenge 2015 database and
generated by artificially adding background noises at different signal-to-noise
ratios (SNRs). Five different additive noises are included. Our preliminary
results show that using the model trained from clean data, the system
performance degrades significantly in noisy conditions. Phase-based feature is
more noise robust than magnitude-based features. And the systems perform
significantly differ under different noise scenarios.Comment: Submitted to Odyssey: The Speaker and Language Recognition Workshop
201
Generative x-vectors for text-independent speaker verification
Speaker verification (SV) systems using deep neural network embeddings,
so-called the x-vector systems, are becoming popular due to its good
performance superior to the i-vector systems. The fusion of these systems
provides improved performance benefiting both from the discriminatively trained
x-vectors and generative i-vectors capturing distinct speaker characteristics.
In this paper, we propose a novel method to include the complementary
information of i-vector and x-vector, that is called generative x-vector. The
generative x-vector utilizes a transformation model learned from the i-vector
and x-vector representations of the background data. Canonical correlation
analysis is applied to derive this transformation model, which is later used to
transform the standard x-vectors of the enrollment and test segments to the
corresponding generative x-vectors. The SV experiments performed on the NIST
SRE 2010 dataset demonstrate that the system using generative x-vectors
provides considerably better performance than the baseline i-vector and
x-vector systems. Furthermore, the generative x-vectors outperform the fusion
of i-vector and x-vector systems for long-duration utterances, while yielding
comparable results for short-duration utterances.Comment: Accepted for publication at SLT 201
VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019
We describe our submitted system for the ZeroSpeech Challenge 2019. The
current challenge theme addresses the difficulty of constructing a speech
synthesizer without any text or phonetic labels and requires a system that can
(1) discover subword units in an unsupervised way, and (2) synthesize the
speech with a target speaker's voice. Moreover, the system should also balance
the discrimination score ABX, the bit-rate compression rate, and the
naturalness and the intelligibility of the constructed voice. To tackle these
problems and achieve the best trade-off, we utilize a vector quantized
variational autoencoder (VQ-VAE) and a multi-scale codebook-to-spectrogram
(Code2Spec) inverter trained by mean square error and adversarial loss. The
VQ-VAE extracts the speech to a latent space, forces itself to map it into the
nearest codebook and produces compressed representation. Next, the inverter
generates a magnitude spectrogram to the target voice, given the codebook
vectors from VQ-VAE. In our experiments, we also investigated several other
clustering algorithms, including K-Means and GMM, and compared them with the
VQ-VAE result on ABX scores and bit rates. Our proposed approach significantly
improved the intelligibility (in CER), the MOS, and discrimination ABX scores
compared to the official ZeroSpeech 2019 baseline or even the topline.Comment: Submitted to Interspeech 201
Target Speaker Extraction for Overlapped Multi-Talker Speaker Verification
The performance of speaker verification degrades significantly when the test
speech is corrupted by interference speakers. Speaker diarization does well to
separate speakers if the speakers are temporally overlapped. However, if
multi-talkers speak at the same time, we need the technique to separate the
speech in the spectral domain. This paper proposes an overlapped multi-talker
speaker verification framework by using target speaker extraction methods.
Specifically, given the target speaker information, the target speaker's speech
is firstly extracted from the overlapped multi-talker speech by a target
speaker extraction module. Then, the extracted speech is passed to the speaker
verification system. Experimental results show that the proposed approach
significantly improves the performance of overlapped multi-talker speaker
verification and achieves 65.7% relative EER reduction.Comment: 5 pages, 3 figures. This paper is submitted to Interspeech 201
- …
