86 research outputs found
Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer
In this paper, we propose a differentiable WORLD synthesizer and demonstrate
its use in end-to-end audio style transfer tasks such as (singing) voice
conversion and the DDSP timbre transfer task. Accordingly, our baseline
differentiable synthesizer has no model parameters, yet it yields adequate
synthesis quality. We can extend the baseline synthesizer by appending
lightweight black-box postnets which apply further processing to the baseline
output in order to improve fidelity. An alternative differentiable approach
considers extraction of the source excitation spectrum directly, which can
improve naturalness albeit for a narrower class of style transfer applications.
The acoustic feature parameterization used by our approaches has the added
benefit that it naturally disentangles pitch and timbral information so that
they can be modeled separately. Moreover, as there exists a robust means of
estimating these acoustic features from monophonic audio sources, it allows for
parameter loss terms to be added to an end-to-end objective function, which can
help convergence and/or further stabilize (adversarial) training.Comment: A revised version of this work has been accepted to the 154th AES
Convention; 12 pages, 4 figure
In search of the optimal acoustic features for statistical parametric speech synthesis
In the Statistical Parametric Speech Synthesis (SPSS) paradigm, speech is generally
represented as acoustic features and the waveform is generated by a vocoder. A comprehensive
summary of state-of-the-art vocoding techniques is presented, highlighting
their characteristics, advantages, and drawbacks, primarily when used in SPSS. We
conclude that state-of-the-art vocoding methods are suboptimal and are a cause of significant loss of quality, even though numerous vocoders have been proposed in the last
decade. In fact, it seems that the most complicated methods perform worse than simpler
ones based on more robust analysis/synthesis algorithms. Typical methods, based on
the source-filter or sinusoidal models, rely on excessive simplifying assumptions. They
perform what we call an "extreme decomposition" of speech (e.g., source+filter or sinusoids+
noise), which we believe to be a major drawback. Problems include: difficulties
in the estimation of components; modelling of complex non-linear mechanisms; a lack
of ground truth. In addition, the statistical dependence that exists between stochastic
and deterministic components of speech is not modelled.
We start by improving just the waveform generation stage of SPSS, using standard
acoustic features. We propose a new method of waveform generation tailored for SPSS,
based on neither source-filter separation nor sinusoidal modelling. The proposed waveform
generator avoids unnecessary assumptions and decompositions as far as possible,
and uses only the fundamental frequency and spectral envelope as acoustic features. A
very small speech database is used as a source of base speech signals which are subsequently
\reshaped" to match the specifications output by the acoustic model in the
SPSS framework. All of this is done without any decomposition, such as source+filter
or harmonics+noise. A comprehensive description of the waveform generation process
is presented, along with implementation issues. Two SPSS voices, a female and a male,
were built to test the proposed method by using a standard TTS toolkit, Merlin. In
a subjective evaluation, listeners preferred the proposed waveform generator over a
state-of-the-art vocoder, STRAIGHT.
Even though the proposed \waveform reshaping" generator generates higher speech
quality than STRAIGHT, the improvement is not large enough. Consequently, we propose
a new acoustic representation, whose implementation involves feature extraction
and waveform generation, i.e., a complete vocoder. The new representation encodes
the complex spectrum derived from the Fourier Transform in a way explicitly designed
for SPSS, rather than for speech coding or copy-synthesis. The feature set comprises
four feature streams describing magnitude spectrum, phase spectrum, and fundamental
frequency; all of these are represented by real numbers. It avoids heuristics or unstable
methods for phase unwrapping. The new feature extraction does not attempt to
decompose the speech structure and thus the "phasiness" and "buzziness" found in a
typical vocoder, such as STRAIGHT, is dramatically reduced. Our method works at
a lower frame rate than a typical vocoder. To demonstrate the proposed method, two
DNN-based voices, a male and a female, were built using the Merlin toolkit. Subjective
comparisons were performed with a state-of-the-art baseline. The proposed vocoder
substantially outperformed the baseline for both voices and under all configurations
tested. Furthermore, several enhancements were made over the original design, which
are beneficial for either sound quality or compatibility with other tools. In addition to
its use in SPSS, the proposed vocoder is also demonstrated being used for join smoothing
in unit selection-based systems, and can be used for voice conversion or automatic
speech recognition
HMM-based speech synthesis using an acoustic glottal source model
Parametric speech synthesis has received increased attention in recent years following
the development of statistical HMM-based speech synthesis. However, the speech
produced using this method still does not sound as natural as human speech and there
is limited parametric flexibility to replicate voice quality aspects, such as breathiness.
The hypothesis of this thesis is that speech naturalness and voice quality can be
more accurately replicated by a HMM-based speech synthesiser using an acoustic glottal
source model, the Liljencrants-Fant (LF) model, to represent the source component
of speech instead of the traditional impulse train.
Two different analysis-synthesis methods were developed during this thesis, in order
to integrate the LF-model into a baseline HMM-based speech synthesiser, which is
based on the popular HTS system and uses the STRAIGHT vocoder. The first method,
which is called Glottal Post-Filtering (GPF), consists of passing a chosen LF-model
signal through a glottal post-filter to obtain the source signal and then generating
speech, by passing this source signal through the spectral envelope filter. The system
which uses the GPF method (HTS-GPF system) is similar to the baseline system,
but it uses a different source signal instead of the impulse train used by STRAIGHT.
The second method, called Glottal Spectral Separation (GSS), generates speech by
passing the LF-model signal through the vocal tract filter. The major advantage of the
synthesiser which incorporates the GSS method, named HTS-LF, is that the acoustic
properties of the LF-model parameters are automatically learnt by the HMMs.
In this thesis, an initial perceptual experiment was conducted to compare the LFmodel
to the impulse train. The results showed that the LF-model was significantly
better, both in terms of speech naturalness and replication of two basic voice qualities
(breathy and tense). In a second perceptual evaluation, the HTS-LF system was better
than the baseline system, although the difference between the two had been expected to
be more significant. A third experiment was conducted to evaluate the HTS-GPF system
and an improved HTS-LF system, in terms of speech naturalness, voice similarity
and intelligibility. The results showed that the HTS-GPF system performed similarly
to the baseline. However, the HTS-LF system was significantly outperformed by the
baseline. Finally, acoustic measurements were performed on the synthetic speech to
investigate the speech distortion in the HTS-LF system. The results indicated that a
problem in replicating the rapid variations of the vocal tract filter parameters at transitions
between voiced and unvoiced sounds is the most significant cause of speech
distortion. This problem encourages future work to further improve the system
Object coding of music using expressive MIDI
PhDStructured audio uses a high level representation of a signal to produce audio output.
When it was first introduced in 1998, creating a structured audio representation
from an audio signal was beyond the state-of-the-art. Inspired by object coding and
structured audio, we present a system to reproduce audio using Expressive MIDI,
high-level parameters being used to represent pitch expression from an audio signal.
This allows a low bit-rate MIDI sketch of the original audio to be produced.
We examine optimisation techniques which may be suitable for inferring Expressive
MIDI parameters from estimated pitch trajectories, considering the effect of data
codings on the difficulty of optimisation. We look at some less common Gray codes
and examine their effect on algorithm performance on standard test problems.
We build an expressive MIDI system, estimating parameters from audio and synthesising
output from those parameters. When the parameter estimation succeeds,
we find that the system produces note pitch trajectories which match source audio to
within 10 pitch cents. We consider the quality of the system in terms of both parameter
estimation and the final output, finding that improvements to core components {
audio segmentation and pitch estimation, both active research fields { would produce
a better system.
We examine the current state-of-the-art in pitch estimation, and find that some
estimators produce high precision estimates but are prone to harmonic errors, whilst
other estimators produce fewer harmonic errors but are less precise. Inspired by this,
we produce a novel pitch estimator combining the output of existing estimators
Speech Enhancement Using Speech Synthesis Techniques
Traditional speech enhancement systems reduce noise by modifying the noisy signal to make it more like a clean signal, which suffers from two problems: under-suppression of noise and over-suppression of speech. These problems create distortions in enhanced speech and hurt the quality of the enhanced signal. We propose to utilize speech synthesis techniques for a higher quality speech enhancement system. Synthesizing clean speech based on the noisy signal could produce outputs that are both noise-free and high quality. We first show that we can replace the noisy speech with its clean resynthesis from a previously recorded clean speech dictionary from the same speaker (concatenative resynthesis). Next, we show that using a speech synthesizer (vocoder) we can create a clean resynthesis of the noisy speech for more than one speaker. We term this parametric resynthesis (PR). PR can generate better prosody from noisy speech than a TTS system which uses textual information only. Additionally, we can use the high quality speech generation capability of neural vocoders for better quality speech enhancement. When trained on data from enough speakers, these vocoders can generate speech from unseen speakers, both male, and female, with similar quality as seen speakers in training. Finally, we show that using neural vocoders we can achieve better objective signal and overall quality than the state-of-the-art speech enhancement systems and better subjective quality than an oracle mask-based system
- …