21 research outputs found
Collapsed speech segment detection and suppression for WaveNet vocoder
In this paper, we propose a technique to alleviate the quality degradation
caused by collapsed speech segments sometimes generated by the WaveNet vocoder.
The effectiveness of the WaveNet vocoder for generating natural speech from
acoustic features has been proved in recent works. However, it sometimes
generates very noisy speech with collapsed speech segments when only a limited
amount of training data is available or significant acoustic mismatches exist
between the training and testing data. Such a limitation on the corpus and
limited ability of the model can easily occur in some speech generation
applications, such as voice conversion and speech enhancement. To address this
problem, we propose a technique to automatically detect collapsed speech
segments. Moreover, to refine the detected segments, we also propose a waveform
generation technique for WaveNet using a linear predictive coding constraint.
Verification and subjective tests are conducted to investigate the
effectiveness of the proposed techniques. The verification results indicate
that the detection technique can detect most collapsed segments. The subjective
evaluations of voice conversion demonstrate that the generation technique
significantly improves the speech quality while maintaining the same speaker
similarity.Comment: 5 pages, 6 figures. Proc. Interspeech, 201
The Effect of Explicit Structure Encoding of Deep Neural Networks for Symbolic Music Generation
With recent breakthroughs in artificial neural networks, deep generative
models have become one of the leading techniques for computational creativity.
Despite very promising progress on image and short sequence generation,
symbolic music generation remains a challenging problem since the structure of
compositions are usually complicated. In this study, we attempt to solve the
melody generation problem constrained by the given chord progression. This
music meta-creation problem can also be incorporated into a plan recognition
system with user inputs and predictive structural outputs. In particular, we
explore the effect of explicit architectural encoding of musical structure via
comparing two sequential generative models: LSTM (a type of RNN) and WaveNet
(dilated temporal-CNN). As far as we know, this is the first study of applying
WaveNet to symbolic music generation, as well as the first systematic
comparison between temporal-CNN and RNN for music generation. We conduct a
survey for evaluation in our generations and implemented Variable Markov Oracle
in music pattern discovery. Experimental results show that to encode structure
more explicitly using a stack of dilated convolution layers improved the
performance significantly, and a global encoding of underlying chord
progression into the generation procedure gains even more.Comment: 8 pages, 13 figure
MFCC-GAN Codec: A New AI-based Audio Coding
In this paper, we proposed AI-based audio coding using MFCC features in an
adversarial setting. We combined a conventional encoder with an adversarial
learning decoder to better reconstruct the original waveform. Since GAN gives
implicit density estimation, therefore, such models are less prone to
overfitting. We compared our work with five well-known codecs namely AAC, AC3,
Opus, Vorbis, and Speex, performing on bitrates from 2kbps to 128kbps.
MFCCGAN_36k achieved the state-of-the-art result in terms of SNR despite a
lower bitrate in comparison to AC3_128k, AAC_112k, Vorbis_48k, Opus_48k, and
Speex_48K. On the other hand, MFCCGAN_13k also achieved high SNR=27 which is
equal to that of AC3_128k, and AAC_112k while having a significantly lower
bitrate (13 kbps). MFCCGAN_36k achieved higher NISQA-MOS results compared to
AAC_48k while having a 20% lower bitrate. Furthermore, MFCCGAN_13k obtained
NISQAMOS= 3.9 which is much higher than AAC_24k, AAC_32k, AC3_32k, and AAC_48k.
For future work, we finally suggest adopting loss functions optimizing
intelligibility and perceptual metrics in the MFCCGAN structure to improve
quality and intelligibility simultaneously.Comment: Accepted in ABU Technical Review journal 2023/
Voice Conversion with Conditional SampleRNN
Here we present a novel approach to conditioning the SampleRNN generative
model for voice conversion (VC). Conventional methods for VC modify the
perceived speaker identity by converting between source and target acoustic
features. Our approach focuses on preserving voice content and depends on the
generative network to learn voice style. We first train a multi-speaker
SampleRNN model conditioned on linguistic features, pitch contour, and speaker
identity using a multi-speaker speech corpus. Voice-converted speech is
generated using linguistic features and pitch contour extracted from the source
speaker, and the target speaker identity. We demonstrate that our system is
capable of many-to-many voice conversion without requiring parallel data,
enabling broad applications. Subjective evaluation demonstrates that our
approach outperforms conventional VC methods.Comment: Accepted at Interspeech 2018, Hyderabad, India. This version matches
the final version submitted to the conferenc
F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder
Non-parallel many-to-many voice conversion remains an interesting but
challenging speech processing task. Many style-transfer-inspired methods such
as generative adversarial networks (GANs) and variational autoencoders (VAEs)
have been proposed. Recently, AutoVC, a conditional autoencoders (CAEs) based
method achieved state-of-the-art results by disentangling the speaker identity
and speech content using information-constraining bottlenecks, and it achieves
zero-shot conversion by swapping in a different speaker's identity embedding to
synthesize a new voice. However, we found that while speaker identity is
disentangled from speech content, a significant amount of prosodic information,
such as source F0, leaks through the bottleneck, causing target F0 to fluctuate
unnaturally. Furthermore, AutoVC has no control of the converted F0 and thus
unsuitable for many applications. In the paper, we modified and improved
autoencoder-based voice conversion to disentangle content, F0, and speaker
identity at the same time. Therefore, we can control the F0 contour, generate
speech with F0 consistent with the target speaker, and significantly improve
quality and similarity. We support our improvement through quantitative and
qualitative analysis
Refined WaveNet Vocoder for Variational Autoencoder Based Voice Conversion
This paper presents a refinement framework of WaveNet vocoders for
variational autoencoder (VAE) based voice conversion (VC), which reduces the
quality distortion caused by the mismatch between the training data and testing
data. Conventional WaveNet vocoders are trained with natural acoustic features
but conditioned on the converted features in the conversion stage for VC, and
such a mismatch often causes significant quality and similarity degradation. In
this work, we take advantage of the particular structure of VAEs to refine
WaveNet vocoders with the self-reconstructed features generated by VAE, which
are of similar characteristics with the converted features while having the
same temporal structure with the target natural features. We analyze these
features and show that the self-reconstructed features are similar to the
converted features. Objective and subjective experimental results demonstrate
the effectiveness of our proposed framework.Comment: 5 pages, 7 figures, 1 table. Accepted to EUSIPCO 201