124 research outputs found
Recommended from our members
A novel framework for high-quality voice source analysis and synthesis
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The analysis, parameterization and modeling of voice source estimates obtained via inverse filtering of recorded speech are some of the most challenging areas of speech processing owing to the fact humans produce a wide range of voice source realizations and that the voice source estimates commonly contain artifacts due to the non-linear time-varying source-filter coupling. Currently, the most widely adopted representation of voice source signal is Liljencrants-Fant's (LF) model which was developed in late 1985. Due to the overly simplistic interpretation of voice source dynamics, LF model can not represent the fine temporal structure of glottal flow derivative realizations nor can it carry the sufficient spectral richness to facilitate a truly natural sounding speech synthesis. In this thesis we have introduced Characteristic Glottal Pulse Waveform Parameterization and Modeling (CGPWPM) which constitutes an entirely novel framework for voice source analysis, parameterization and reconstruction. In comparative evaluation of CGPWPM and LF model we have demonstrated that the proposed method is able to preserve higher levels of speaker dependant information from the voice source estimates and realize a more natural sounding speech synthesis. In general, we have shown that CGPWPM-based speech synthesis rates highly on the scale of absolute perceptual acceptability and that speech signals are faithfully reconstructed on consistent basis, across speakers, gender. We have applied CGPWPM to voice quality profiling and text-independent voice quality conversion method. The proposed voice conversion method is able to achieve the desired perceptual effects and the modified
speech remained as natural sounding and intelligible as natural speech. In this thesis, we have also developed an optimal wavelet thresholding strategy for voice source signals which is able to suppress aspiration noise and still retain both the slow and the rapid variations in the voice source estimate
Recommended from our members
A Log Domain Pulse Model for Parametric Speech Synthesis
Most of the degradation in current Statistical Parametric Speech Synthesis (SPSS) results from the form of the vocoder. One of the main causes of degradation is the reconstruction of the noise. In this article, a new signal model is proposed that leads to a simple synthesizer, without the need for ad-hoc tuning of model parameters. The model is not based on the traditional additive linear source-filter model, it adopts a combination of speech components that are additive in the log domain. Also, the same representation for voiced and unvoiced segments is used, rather than relying on binary voicing decisions. This avoids voicing error discontinuities that can occur in many current vocoders. A simple binary mask is used to denote the presence of noise in the time-frequency domain, which is less sensitive to classification errors. Four experiments have been carried out to evaluate this new model. The first experiment examines the noise reconstruction issue. Three listening tests have also been carried out that demonstrate the advantages of this model: comparison with the STRAIGHT vocoder; the direct prediction of the binary noise mask by using a mixed output configuration; and partial improvements of creakiness using a mask correction mechanism.European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie; 10.13039/501100000266-EPSR
Voice source characterization for prosodic and spectral manipulation
The objective of this dissertation is to study and develop techniques to decompose the speech signal into its two main
components: voice source and vocal tract. Our main efforts are on the glottal pulse analysis and characterization. We want to
explore the utility of this model in different areas of speech processing: speech synthesis, voice conversion or emotion detection
among others. Thus, we will study different techniques for prosodic and spectral manipulation. One of our requirements is that
the methods should be robust enough to work with the large databases typical of speech synthesis. We use a speech production
model in which the glottal flow produced by the vibrating vocal folds goes through the vocal (and nasal) tract cavities and its
radiated by the lips. Removing the effect of the vocal tract from the speech signal to obtain the glottal pulse is known as inverse
filtering. We use a parametric model fo the glottal pulse directly in the source-filter decomposition phase.
In order to validate the accuracy of the parametrization algorithm, we designed a synthetic corpus using LF glottal parameters
reported in the literature, complemented with our own results from the vowel database. The results show that our method gives
satisfactory results in a wide range of glottal configurations and at different levels of SNR. Our method using the whitened
residual compared favorably to this reference, achieving high quality ratings (Good-Excellent). Our full parametrized system
scored lower than the other two ranking in third place, but still higher than the acceptance threshold (Fair-Good).
Next we proposed two methods for prosody modification, one for each of the residual representations explained above. The first
method used our full parametrization system and frame interpolation to perform the desired changes in pitch and duration. The
second method used resampling on the residual waveform and a frame selection technique to generate a new sequence of
frames to be synthesized. The results showed that both methods are rated similarly (Fair-Good) and that more work is needed in
order to achieve quality levels similar to the reference methods.
As part of this dissertation, we have studied the application of our models in three different areas: voice conversion, voice quality
analysis and emotion recognition. We have included our speech production model in a reference voice conversion system, to
evaluate the impact of our parametrization in this task. The results showed that the evaluators preferred our method over the
original one, rating it with a higher score in the MOS scale. To study the voice quality, we recorded a small database consisting of
isolated, sustained Spanish vowels in four different phonations (modal, rough, creaky and falsetto) and were later also used in
our study of voice quality. Comparing the results with those reported in the literature, we found them to generally agree with
previous findings. Some differences existed, but they could be attributed to the difficulties in comparing voice qualities produced
by different speakers. At the same time we conducted experiments in the field of voice quality identification, with very good
results. We have also evaluated the performance of an automatic emotion classifier based on GMM using glottal measures. For
each emotion, we have trained an specific model using different features, comparing our parametrization to a baseline system
using spectral and prosodic characteristics. The results of the test were very satisfactory, showing a relative error reduction of
more than 20% with respect to the baseline system. The accuracy of the different emotions detection was also high, improving
the results of previously reported works using the same database. Overall, we can conclude that the glottal source parameters
extracted using our algorithm have a positive impact in the field of automatic emotion classification
HMM-based synthesis of creaky voice
Creaky voice, also referred to as vocal fry, is a voice quality frequently produced in many languages, in both read and conversational speech. To enhance the naturalness of speech synthesis, these latter should be able to generate speech in all its expressive diversity, including creaky voice. The present study looks to exploit our recent developments, including creaky voice detection, prediction of creaky voice from context, and rendering of the creaky excitation, into a fully functioning and automatic HMM-based synthesis system. HMM-based synthetic creaky voices are built and evaluated in subjective listening tests, which show that the best synthetic creaky voices are rated more natural and more creaky compared to a conventional voice. A noncreaky voice is also successfully transformed to use creak by modifying the F0 contour and excitation of the predicted creaky parts. The transformed voice is rated equal in terms of naturalness and clearly more creaky compared to the original voice. Index Terms: speech synthesis, creaky voice, contextual factors, F0 estimation, excitation modelin
Ultrasound-Based Silent Speech Interface Built on a Continuous Vocoder
Recently it was shown that within the Silent Speech Interface (SSI) field,
the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the
articulatory input, using Deep Neural Networks for articulatory-to-acoustic
mapping. Moreover, text-to-speech synthesizers were shown to produce higher
quality speech when using a continuous pitch estimate, which takes non-zero
pitch values even when voicing is not present. Therefore, in this paper on
UTI-based SSI, we use a simple continuous F0 tracker which does not apply a
strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0,
Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a
convolutional neural network, with UTI as input. The results demonstrate that
during the articulatory-to-acoustic mapping experiments, the continuous F0 is
predicted with lower error, and the continuous vocoder produces slightly more
natural synthesized speech than the baseline vocoder using standard
discontinuous F0.Comment: 5 pages, 3 figures, accepted for publication at Interspeech 201
- …