47 research outputs found
Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer
In this paper, we propose a differentiable WORLD synthesizer and demonstrate
its use in end-to-end audio style transfer tasks such as (singing) voice
conversion and the DDSP timbre transfer task. Accordingly, our baseline
differentiable synthesizer has no model parameters, yet it yields adequate
synthesis quality. We can extend the baseline synthesizer by appending
lightweight black-box postnets which apply further processing to the baseline
output in order to improve fidelity. An alternative differentiable approach
considers extraction of the source excitation spectrum directly, which can
improve naturalness albeit for a narrower class of style transfer applications.
The acoustic feature parameterization used by our approaches has the added
benefit that it naturally disentangles pitch and timbral information so that
they can be modeled separately. Moreover, as there exists a robust means of
estimating these acoustic features from monophonic audio sources, it allows for
parameter loss terms to be added to an end-to-end objective function, which can
help convergence and/or further stabilize (adversarial) training.Comment: A revised version of this work has been accepted to the 154th AES
Convention; 12 pages, 4 figure
An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era
Speech is the fundamental mode of human communication, and its synthesis has
long been a core priority in human-computer interaction research. In recent
years, machines have managed to master the art of generating speech that is
understandable by humans. But the linguistic content of an utterance
encompasses only a part of its meaning. Affect, or expressivity, has the
capacity to turn speech into a medium capable of conveying intimate thoughts,
feelings, and emotions -- aspects that are essential for engaging and
naturalistic interpersonal communication. While the goal of imparting
expressivity to synthesised utterances has so far remained elusive, following
recent advances in text-to-speech synthesis, a paradigm shift is well under way
in the fields of affective speech synthesis and conversion as well. Deep
learning, as the technology which underlies most of the recent advances in
artificial intelligence, is spearheading these efforts. In the present
overview, we outline ongoing trends and summarise state-of-the-art approaches
in an attempt to provide a comprehensive overview of this exciting field.Comment: Submitted to the Proceedings of IEE
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis
Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surrogate objective can achieve a lower bound of the log marginal likelihood tighter than a conventional surrogate. We also find that BDDM allows inheriting pre-trained score network parameters from any DPMs and consequently enables speedy and stable learning of the schedule network and optimization of a noise schedule for sampling. Our experiments demonstrate that BDDMs can generate high-fidelity audio samples with as few as three sampling steps. Moreover, compared to other state-of-the-art diffusion-based neural vocoders, BDDMs produce comparable or higher quality samples indistinguishable from human speech, notably with only seven sampling steps (143x faster than WaveGrad and 28.6x faster than DiffWave). We release our code at https://github.com/tencent-ailab/bddm
MFCCGAN: A Novel MFCC-Based Speech Synthesizer Using Adversarial Learning
In this paper, we introduce MFCCGAN as a novel speech synthesizer based on
adversarial learning that adopts MFCCs as input and generates raw speech
waveforms. Benefiting the GAN model capabilities, it produces speech with
higher intelligibility than a rule-based MFCC-based speech synthesizer WORLD.
We evaluated the model based on a popular intrusive objective speech
intelligibility measure (STOI) and quality (NISQA score). Experimental results
show that our proposed system outperforms Librosa MFCC- inversion (by an
increase of about 26% up to 53% in STOI and 16% up to 78% in NISQA score) and a
rise of about 10% in intelligibility and about 4% in naturalness in comparison
with conventional rule-based vocoder WORLD that used in the CycleGAN-VC family.
However, WORLD needs additional data like F0. Finally, using perceptual loss in
discriminators based on STOI could improve the quality more. WebMUSHRA-based
subjective tests also show the quality of the proposed approach.Comment: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP
Waveform Generation for Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks
The state-of-the-art in text-to-speech synthesis has recently improved
considerably due to novel neural waveform generation methods, such as WaveNet.
However, these methods suffer from their slow sequential inference process,
while their parallel versions are difficult to train and even more expensive
computationally. Meanwhile, generative adversarial networks (GANs) have
achieved impressive results in image generation and are making their way into
audio applications; parallel inference is among their lucrative properties. By
adopting recent advances in GAN training techniques, this investigation studies
waveform generation for TTS in two domains (speech signal and glottal
excitation). Listening test results show that while direct waveform generation
with GAN is still far behind WaveNet, a GAN-based glottal excitation model can
achieve quality and voice similarity on par with a WaveNet vocoder.Comment: Submitted to ICASSP 201
SnakeGAN: A Universal Vocoder Leveraging DDSP Prior Knowledge and Periodic Inductive Bias
Generative adversarial network (GAN)-based neural vocoders have been widely
used in audio synthesis tasks due to their high generation quality, efficient
inference, and small computation footprint. However, it is still challenging to
train a universal vocoder which can generalize well to out-of-domain (OOD)
scenarios, such as unseen speaking styles, non-speech vocalization, singing,
and musical pieces. In this work, we propose SnakeGAN, a GAN-based universal
vocoder, which can synthesize high-fidelity audio in various OOD scenarios.
SnakeGAN takes a coarse-grained signal generated by a differentiable digital
signal processing (DDSP) model as prior knowledge, aiming at recovering
high-fidelity waveform from a Mel-spectrogram. We introduce periodic
nonlinearities through the Snake activation function and anti-aliased
representation into the generator, which further brings the desired inductive
bias for audio synthesis and significantly improves the extrapolation capacity
for universal vocoding in unseen scenarios. To validate the effectiveness of
our proposed method, we train SnakeGAN with only speech data and evaluate its
performance for various OOD distributions with both subjective and objective
metrics. Experimental results show that SnakeGAN significantly outperforms the
compared approaches and can generate high-fidelity audio samples including
unseen speakers with unseen styles, singing voices, instrumental pieces, and
nonverbal vocalization.Comment: Accepted by ICME 202
Enhancing GAN-Based Vocoders with Contrastive Learning Under Data-limited Condition
Vocoder models have recently achieved substantial progress in generating
authentic audio comparable to human quality while significantly reducing memory
requirement and inference time. However, these data-hungry generative models
require large-scale audio data for learning good representations. In this
paper, we apply contrastive learning methods in training the vocoder to improve
the perceptual quality of the vocoder without modifying its architecture or
adding more data. We design an auxiliary task with mel-spectrogram contrastive
learning to enhance the utterance-level quality of the vocoder model under
data-limited conditions. We also extend the task to include waveforms to
improve the multi-modality comprehension of the model and address the
discriminator overfitting problem. We optimize the additional task
simultaneously with GAN training objectives. Our result shows that the tasks
improve model performance substantially in data-limited settings. Our analysis
based on the result indicates that the proposed design successfully alleviates
discriminator overfitting and produces audio of higher fidelity
Methods for speaking style conversion from normal speech to high vocal effort speech
This thesis deals with vocal-effort-focused speaking style conversion (SSC). Specifically, we studied two topics on conversion of normal speech to high vocal effort. The first topic involves the conversion of normal speech to shouted speech. We employed this conversion in a speaker recognition system with vocal effort mismatch between test and enrollment utterances (shouted speech vs. normal speech). The mismatch causes a degradation of the system's speaker identification performance. As solution, we proposed a SSC system that included a novel spectral mapping, used along a statistical mapping technique, to transform the mel-frequency spectral energies of normal speech enrollment utterances towards their counterparts in shouted speech. We evaluated the proposed solution by comparing speaker identification rates for a state-of-the-art i-vector-based speaker recognition system, with and without applying SSC to the enrollment utterances. Our results showed that applying the proposed SSC pre-processing to the enrollment data improves considerably the speaker identification rates.
The second topic involves a normal-to-Lombard speech conversion. We proposed a vocoder-based parametric SSC system to perform the conversion. This system first extracts speech features using the vocoder. Next, a mapping technique, robust to data scarcity, maps the features. Finally, the vocoder synthesizes the mapped features into speech. We used two vocoders in the conversion system, for comparison: a glottal vocoder and the widely used STRAIGHT. We assessed the converted speech from the two vocoder cases with two subjective listening tests that measured similarity to Lombard speech and naturalness. The similarity subjective test showed that, for both vocoder cases, our proposed SSC system was able to convert normal speech to Lombard speech. The naturalness subjective test showed that the converted samples using the glottal vocoder were clearly more natural than those obtained with STRAIGHT