85 research outputs found
Detection of Irregular Phonation in Speech
This work addresses the detection & characterization of irregular phonation in spontaneous speech. While published work tackles this problem as a two-hypothesis problem only in regions of speech with phonation, this work focuses on distinguishing aperiodicity due to frication from that due to irregular voicing. This work also deals with correction of a current pitch tracking algorithm in regions of irregular phonation, where most pitch trackers fail to perform well. Relying on the detection of regions of irregular phonation, an acoustic parameter is developed in order to characterize these regions for speaker identification applications. The detection performance of the algorithm on a clean speech corpus (TIMIT) is seen to be 91.8%, with the percentage of false detections being 17.42%. On telephone speech corpus (NIST 98) database, the detection performance is 89.2%, with the percentage of false detections being 12.8%. The pitch detection accuracy increased from 95.4% to 98.3% for TIMIT, and from94.8% to 97.4% for NIST 98 databases. The creakiness parameter was added to a set of seven acoustic parameters for speaker identification on the NIST 98 database, and the performance was found to be enhanced by 1.5% for female speakers and 0.4% for male speakers for a population of 250 speakers
In search of the optimal acoustic features for statistical parametric speech synthesis
In the Statistical Parametric Speech Synthesis (SPSS) paradigm, speech is generally
represented as acoustic features and the waveform is generated by a vocoder. A comprehensive
summary of state-of-the-art vocoding techniques is presented, highlighting
their characteristics, advantages, and drawbacks, primarily when used in SPSS. We
conclude that state-of-the-art vocoding methods are suboptimal and are a cause of significant loss of quality, even though numerous vocoders have been proposed in the last
decade. In fact, it seems that the most complicated methods perform worse than simpler
ones based on more robust analysis/synthesis algorithms. Typical methods, based on
the source-filter or sinusoidal models, rely on excessive simplifying assumptions. They
perform what we call an "extreme decomposition" of speech (e.g., source+filter or sinusoids+
noise), which we believe to be a major drawback. Problems include: difficulties
in the estimation of components; modelling of complex non-linear mechanisms; a lack
of ground truth. In addition, the statistical dependence that exists between stochastic
and deterministic components of speech is not modelled.
We start by improving just the waveform generation stage of SPSS, using standard
acoustic features. We propose a new method of waveform generation tailored for SPSS,
based on neither source-filter separation nor sinusoidal modelling. The proposed waveform
generator avoids unnecessary assumptions and decompositions as far as possible,
and uses only the fundamental frequency and spectral envelope as acoustic features. A
very small speech database is used as a source of base speech signals which are subsequently
\reshaped" to match the specifications output by the acoustic model in the
SPSS framework. All of this is done without any decomposition, such as source+filter
or harmonics+noise. A comprehensive description of the waveform generation process
is presented, along with implementation issues. Two SPSS voices, a female and a male,
were built to test the proposed method by using a standard TTS toolkit, Merlin. In
a subjective evaluation, listeners preferred the proposed waveform generator over a
state-of-the-art vocoder, STRAIGHT.
Even though the proposed \waveform reshaping" generator generates higher speech
quality than STRAIGHT, the improvement is not large enough. Consequently, we propose
a new acoustic representation, whose implementation involves feature extraction
and waveform generation, i.e., a complete vocoder. The new representation encodes
the complex spectrum derived from the Fourier Transform in a way explicitly designed
for SPSS, rather than for speech coding or copy-synthesis. The feature set comprises
four feature streams describing magnitude spectrum, phase spectrum, and fundamental
frequency; all of these are represented by real numbers. It avoids heuristics or unstable
methods for phase unwrapping. The new feature extraction does not attempt to
decompose the speech structure and thus the "phasiness" and "buzziness" found in a
typical vocoder, such as STRAIGHT, is dramatically reduced. Our method works at
a lower frame rate than a typical vocoder. To demonstrate the proposed method, two
DNN-based voices, a male and a female, were built using the Merlin toolkit. Subjective
comparisons were performed with a state-of-the-art baseline. The proposed vocoder
substantially outperformed the baseline for both voices and under all configurations
tested. Furthermore, several enhancements were made over the original design, which
are beneficial for either sound quality or compatibility with other tools. In addition to
its use in SPSS, the proposed vocoder is also demonstrated being used for join smoothing
in unit selection-based systems, and can be used for voice conversion or automatic
speech recognition
Classification and Separation Techniques based on Fundamental Frequency for Speech Enhancement
[ES] En esta tesis se desarrollan nuevos algoritmos de clasificación y mejora de voz basados en las propiedades de la frecuencia fundamental (F0) de la señal vocal. Estas propiedades permiten su discriminación respecto al resto de señales de la escena acústica, ya sea mediante la definición de características (para clasificación) o la definición de modelos de señal (para separación).
Tres contribuciones se aportan en esta tesis: 1) un algoritmo de clasificación de entorno acústico basado en F0 para audífonos digitales, capaz de clasificar la señal en las clases voz y no-voz; 2) un algoritmo de detección de voz sonora basado en la aperiodicidad, capaz de funcionar en ruido no estacionario y con aplicación a mejora de voz; 3) un algoritmo de separación de voz y ruido basado en descomposición NMF, donde el ruido se modela de una forma genérica mediante restricciones matemáticas.[EN]This thesis is focused on the development of new classification and speech enhancement algorithms based, explicitly or implicitly, on the fundamental frequency (F0). The F0 of speech has a number of properties that enable speech discrimination from the remaining signals in the acoustic scene, either by defining F0-based signal features (for classification) or F0-based signal models (for separation). Three main contributions are included in this work: 1) an acoustic environment classification algorithm for hearing aids based on F0 to classify the input signal into speech and nonspeech classes; 2) a frame-by-frame basis voiced speech detection algorithm based on the aperiodicity measure, able to work under non-stationary noise and applicable to speech enhancement; 3) a speech denoising algorithm based on a regularized NMF decomposition, in which the background noise is described in a generic way with mathematical constraints.Tesis Univ. Jaén. Departamento de Ingeniería de Telecomunición. Leída el 11 de enero de 201
Analysis/Synthesis Comparison of Vocoders Utilized in Statistical Parametric Speech Synthesis
Tässä työssä esitetään kirjallisuuskatsaus ja kokeellinen osio tilastollisessa parametrisessa puhesynteesissä käytetyistä vokoodereista. Kokeellisessa osassa kolmen valitun vokooderin (GlottHMM, STRAIGHT ja Harmonic/Stochastic Model) analyysi-synteesi -ominaisuuksia tarkastellaan usealla tavalla. Suoritetut kokeet olivat vokooderiparametrien tilastollisten jakaumien analysointi, puheen tunnetilan tilastollinen vaikutus vokooderiparametrien jakaumiin sekä subjektiivinen kuuntelukoe jolla mitattiin vokooderien suhteellista analyysi-synteesi -laatua.
Tulokset osoittavat että STRAIGHT-vokooderi omaa eniten Gaussiset parametrijakaumat ja tasaisimman synteesilaadun. GlottHMM-vokooderin parametrit osoittivat suurinta herkkyyttä puheen tunnetilan funktiona ja vokooderi sai parhaan, mutta laadultaan vaihtelevan kuuntelukoetuloksen. HSM-vokooderin LSF-parametrien havaittiin olevan Gaussisempia kuin GlottHMM-vokooderin LSF parametrit, mutta vokooderin havaittiin kärsivän kohinaherkkyydestä, ja se sai huonoimman kuuntelukoetuloksen.This thesis presents a literature study followed by an experimental part on the state-of-the-art vocoders utilized in statistical parametric speech synthesis. In the experimental part, the analysis/synthesis properties of three selected vocoders (GlottHMM, STRAIGHT and Harmonic/Stochastic Model) are examined. The performed tests were the analysis of vocoder parameter distributions, statistical testing on the effect of emotions to the vocoder parameter distributions, and a subjective listening test evaluating the vocoders' relative analysis/synthesis quality.
The results indicate that the STRAIGHT vocoder has the most Gaussian parameter distributions and most robust synthesis quality, whereas the GlottHMM vocoder has the most emotion sensitive parameters and best but unreliable synthesis quality. The HSM vocoder's LSF parameters were found to be more Gaussian than the GlottHMM vocoder's LSF parameters. HSM was found to be sensitive to noise, and it scored the lowest score on the subjective listening test
HMM-based speech synthesis using an acoustic glottal source model
Parametric speech synthesis has received increased attention in recent years following
the development of statistical HMM-based speech synthesis. However, the speech
produced using this method still does not sound as natural as human speech and there
is limited parametric flexibility to replicate voice quality aspects, such as breathiness.
The hypothesis of this thesis is that speech naturalness and voice quality can be
more accurately replicated by a HMM-based speech synthesiser using an acoustic glottal
source model, the Liljencrants-Fant (LF) model, to represent the source component
of speech instead of the traditional impulse train.
Two different analysis-synthesis methods were developed during this thesis, in order
to integrate the LF-model into a baseline HMM-based speech synthesiser, which is
based on the popular HTS system and uses the STRAIGHT vocoder. The first method,
which is called Glottal Post-Filtering (GPF), consists of passing a chosen LF-model
signal through a glottal post-filter to obtain the source signal and then generating
speech, by passing this source signal through the spectral envelope filter. The system
which uses the GPF method (HTS-GPF system) is similar to the baseline system,
but it uses a different source signal instead of the impulse train used by STRAIGHT.
The second method, called Glottal Spectral Separation (GSS), generates speech by
passing the LF-model signal through the vocal tract filter. The major advantage of the
synthesiser which incorporates the GSS method, named HTS-LF, is that the acoustic
properties of the LF-model parameters are automatically learnt by the HMMs.
In this thesis, an initial perceptual experiment was conducted to compare the LFmodel
to the impulse train. The results showed that the LF-model was significantly
better, both in terms of speech naturalness and replication of two basic voice qualities
(breathy and tense). In a second perceptual evaluation, the HTS-LF system was better
than the baseline system, although the difference between the two had been expected to
be more significant. A third experiment was conducted to evaluate the HTS-GPF system
and an improved HTS-LF system, in terms of speech naturalness, voice similarity
and intelligibility. The results showed that the HTS-GPF system performed similarly
to the baseline. However, the HTS-LF system was significantly outperformed by the
baseline. Finally, acoustic measurements were performed on the synthetic speech to
investigate the speech distortion in the HTS-LF system. The results indicated that a
problem in replicating the rapid variations of the vocal tract filter parameters at transitions
between voiced and unvoiced sounds is the most significant cause of speech
distortion. This problem encourages future work to further improve the system
Effects of Noise on a Speaker-Adaptive Statistical Speech Synthesis System
In this project we study the eff ects of noise on a speaker-adaptive HMM-based synthetic system based on the GlottHMM vocoder. The average voice model is trained with clean data, but it is adapted to the target speaker using speech samples that have been corrupted by arti ficially adding background noise to simulate low quality recordings. The synthesized speech played without background noise should not compromise the intelligibility or naturalness.A comparison is made to system based on the STRAIGHT vocoder when the background noise is babble noise. Both objective and subjective evaluation methods were conducted. GlottHMM is found to be less robust against severe noise. When the noise is less intrusive, the used objective measures gave contradictory results and no preference to either vocoder was shown in the listening tests. In the preference of moderate noise levels, GlottHMM performs as well as the STRAIGHT vocoder
Reconstruction of intelligible audio speech from visual speech information
The aim of the work conducted in this thesis is to reconstruct audio speech signals using information which can be extracted solely from a visual stream of a speaker's face, with application for surveillance scenarios and silent speech interfaces. Visual speech is limited to that which can be seen of the mouth, lips, teeth, and tongue, where the visual articulators convey considerably less information than in the audio domain, leading to the task being difficult. Accordingly, the emphasis is on the reconstruction of intelligible speech, with less regard given to quality.
A speech production model is used to reconstruct audio speech, where methods are presented in this work for generating or estimating the necessary parameters for the model. Three approaches are explored for producing spectral-envelope estimates from visual features as this parameter provides the greatest contribution to speech intelligibility. The first approach uses regression to perform the visual-to-audio mapping, and then two further approaches are explored using vector quantisation techniques and classification models, with long-range temporal information incorporated at the feature and model-level. Excitation information, namely fundamental frequency and aperiodicity, is generated using artificial methods and joint-feature clustering approaches.
Evaluations are first performed using mean squared error analyses and objective measures of speech intelligibility to refine the various system configurations, and then subjective listening tests are conducted to determine word-level accuracy, giving real intelligibility scores, of reconstructed speech. The best performing visual-to-audio domain mapping approach, using a clustering-and-classification framework with feature-level temporal encoding, is able to achieve audio-only intelligibility scores of 77 %, and audiovisual intelligibility scores of 84 %, on the GRID dataset. Furthermore, the methods are applied to a larger and more continuous dataset, with less favourable results, but with the belief that extensions to the work presented will yield a further increase in intelligibility
Statistical parametric speech synthesis based on sinusoidal models
This study focuses on improving the quality of statistical speech synthesis based on sinusoidal
models. Vocoders play a crucial role during the parametrisation and reconstruction process,
so we first lead an experimental comparison of a broad range of the leading vocoder types.
Although our study shows that for analysis / synthesis, sinusoidal models with complex amplitudes
can generate high quality of speech compared with source-filter ones, component
sinusoids are correlated with each other, and the number of parameters is also high and varies
in each frame, which constrains its application for statistical speech synthesis.
Therefore, we first propose a perceptually based dynamic sinusoidal model (PDM) to decrease
and fix the number of components typically used in the standard sinusoidal model.
Then, in order to apply the proposed vocoder with an HMM-based speech synthesis system
(HTS), two strategies for modelling sinusoidal parameters have been compared. In the first
method (DIR parameterisation), features extracted from the fixed- and low-dimensional PDM
are statistically modelled directly. In the second method (INT parameterisation), we convert
both static amplitude and dynamic slope from all the harmonics of a signal, which we term
the Harmonic Dynamic Model (HDM), to intermediate parameters (regularised cepstral coefficients
(RDC)) for modelling. Our results show that HDM with intermediate parameters can
generate comparable quality to STRAIGHT.
As correlations between features in the dynamic model cannot be modelled satisfactorily
by a typical HMM-based system with diagonal covariance, we have applied and tested a deep
neural network (DNN) for modelling features from these two methods. To fully exploit DNN
capabilities, we investigate ways to combine INT and DIR at the level of both DNN modelling
and waveform generation. For DNN training, we propose to use multi-task learning to
model cepstra (from INT) and log amplitudes (from DIR) as primary and secondary tasks. We
conclude from our results that sinusoidal models are indeed highly suited for statistical parametric
synthesis. The proposed method outperforms the state-of-the-art STRAIGHT-based
equivalent when used in conjunction with DNNs.
To further improve the voice quality, phase features generated from the proposed vocoder
also need to be parameterised and integrated into statistical modelling. Here, an alternative
statistical model referred to as the complex-valued neural network (CVNN), which treats complex coefficients as a whole, is proposed to model complex amplitude explicitly. A complex-valued
back-propagation algorithm using a logarithmic minimisation criterion which includes
both amplitude and phase errors is used as a learning rule. Three parameterisation methods
are studied for mapping text to acoustic features: RDC / real-valued log amplitude, complex-valued
amplitude with minimum phase and complex-valued amplitude with mixed phase. Our
results show the potential of using CVNNs for modelling both real and complex-valued acoustic
features. Overall, this thesis has established competitive alternative vocoders for speech
parametrisation and reconstruction. The utilisation of proposed vocoders on various acoustic
models (HMM / DNN / CVNN) clearly demonstrates that it is compelling to apply them for
the parametric statistical speech synthesis
- …