714 research outputs found
HMM-based speech synthesiser using the LF-model of the glottal source
A major factor which causes a deterioration in speech quality in HMM-based speech synthesis is the use of a simple delta pulse signal to generate the excitation of voiced speech. This paper sets out a new approach to using an acoustic glottal source model in HMM-based synthesisers instead of the traditional pulse signal. The goal is to improve speech quality and to better model and transform voice characteristics. We have found the new method decreases buzziness and also improves prosodic modelling. A perceptual evaluation has supported this finding by showing a 55.6 % preference for the new system, as against the baseline. This improvement, while not being as significant as we had initially expected, does encourage us to work on developing the proposed speech synthesiser further
RawNet: Fast End-to-End Neural Vocoder
Neural networks based vocoders have recently demonstrated the powerful
ability to synthesize high quality speech. These models usually generate
samples by conditioning on some spectrum features, such as Mel-spectrum.
However, these features are extracted by using speech analysis module including
some processing based on the human knowledge. In this work, we proposed RawNet,
a truly end-to-end neural vocoder, which use a coder network to learn the
higher representation of signal, and an autoregressive voder network to
generate speech sample by sample. The coder and voder together act like an
auto-encoder network, and could be jointly trained directly on raw waveform
without any human-designed features. The experiments on the Copy-Synthesis
tasks show that RawNet can achieve the comparative synthesized speech quality
with LPCNet, with a smaller model architecture and faster speech generation at
the inference step.Comment: Submitted to Interspeech 2019, Graz, Austri
In search of the optimal acoustic features for statistical parametric speech synthesis
In the Statistical Parametric Speech Synthesis (SPSS) paradigm, speech is generally
represented as acoustic features and the waveform is generated by a vocoder. A comprehensive
summary of state-of-the-art vocoding techniques is presented, highlighting
their characteristics, advantages, and drawbacks, primarily when used in SPSS. We
conclude that state-of-the-art vocoding methods are suboptimal and are a cause of significant loss of quality, even though numerous vocoders have been proposed in the last
decade. In fact, it seems that the most complicated methods perform worse than simpler
ones based on more robust analysis/synthesis algorithms. Typical methods, based on
the source-filter or sinusoidal models, rely on excessive simplifying assumptions. They
perform what we call an "extreme decomposition" of speech (e.g., source+filter or sinusoids+
noise), which we believe to be a major drawback. Problems include: difficulties
in the estimation of components; modelling of complex non-linear mechanisms; a lack
of ground truth. In addition, the statistical dependence that exists between stochastic
and deterministic components of speech is not modelled.
We start by improving just the waveform generation stage of SPSS, using standard
acoustic features. We propose a new method of waveform generation tailored for SPSS,
based on neither source-filter separation nor sinusoidal modelling. The proposed waveform
generator avoids unnecessary assumptions and decompositions as far as possible,
and uses only the fundamental frequency and spectral envelope as acoustic features. A
very small speech database is used as a source of base speech signals which are subsequently
\reshaped" to match the specifications output by the acoustic model in the
SPSS framework. All of this is done without any decomposition, such as source+filter
or harmonics+noise. A comprehensive description of the waveform generation process
is presented, along with implementation issues. Two SPSS voices, a female and a male,
were built to test the proposed method by using a standard TTS toolkit, Merlin. In
a subjective evaluation, listeners preferred the proposed waveform generator over a
state-of-the-art vocoder, STRAIGHT.
Even though the proposed \waveform reshaping" generator generates higher speech
quality than STRAIGHT, the improvement is not large enough. Consequently, we propose
a new acoustic representation, whose implementation involves feature extraction
and waveform generation, i.e., a complete vocoder. The new representation encodes
the complex spectrum derived from the Fourier Transform in a way explicitly designed
for SPSS, rather than for speech coding or copy-synthesis. The feature set comprises
four feature streams describing magnitude spectrum, phase spectrum, and fundamental
frequency; all of these are represented by real numbers. It avoids heuristics or unstable
methods for phase unwrapping. The new feature extraction does not attempt to
decompose the speech structure and thus the "phasiness" and "buzziness" found in a
typical vocoder, such as STRAIGHT, is dramatically reduced. Our method works at
a lower frame rate than a typical vocoder. To demonstrate the proposed method, two
DNN-based voices, a male and a female, were built using the Merlin toolkit. Subjective
comparisons were performed with a state-of-the-art baseline. The proposed vocoder
substantially outperformed the baseline for both voices and under all configurations
tested. Furthermore, several enhancements were made over the original design, which
are beneficial for either sound quality or compatibility with other tools. In addition to
its use in SPSS, the proposed vocoder is also demonstrated being used for join smoothing
in unit selection-based systems, and can be used for voice conversion or automatic
speech recognition
Normal-to-Lombard Adaptation of Speech Synthesis Using Long Short-Term Memory Recurrent Neural Networks
In this article, three adaptation methods are compared based on how well they change the speaking style of a neural network based text-to-speech (TTS) voice. The speaking style conversion adopted here is from normal to Lombard speech. The selected adaptation methods are: auxiliary features (AF), learning hidden unit contribution (LHUC), and fine-tuning (FT). Furthermore, four state-of-the-art TTS vocoders are compared in the same context. The evaluated vocoders are: GlottHMM, GlottDNN, STRAIGHT, and pulse model in log-domain (PML). Objective and subjective evaluations were conducted to study the performance of both the adaptation methods and the vocoders. In the subjective evaluations, speaking style similarity and speech intelligibility were assessed. In addition to acoustic model adaptation, phoneme durations were also adapted from normal to Lombard with the FT adaptation method. In objective evaluations and speaking style similarity tests, we found that the FT method outperformed the other two adaptation methods. In speech intelligibility tests, we found that there were no significant differences between vocoders although the PML vocoder showed slightly better performance compared to the three other vocoders.Peer reviewe
- …