12 research outputs found
Statistical properties of linear prediction analysis underlying the challenge of formant bandwidth estimation
Formant bandwidth estimation is often observed to be more challenging than the estimation of formant center frequencies due to the presence of multiple glottal pulses within a period and short closed-phase durations. This study explores inherently different statistical properties between linear prediction (LP)–based estimates of formant frequencies and their corresponding bandwidths that may be explained in part by the statistical bounds on the variances of estimated LP coefficients. A theoretical analysis of the Cramér-Rao bounds on LP estimator variance indicates that the accuracy of bandwidth estimation is approximately twice as low as that of center frequency estimation. Monte Carlo simulations of all-pole vowels with stochastic and mixed-source excitation demonstrate that the distributions of estimated LP coefficients exhibit expectedly different variances for each coefficient. Transforming the LP coefficients to formant parameters results in variances of bandwidth estimates being typically larger than the variances of respective center frequency estimates, depending on vowel type and fundamental frequency. These results provide additional evidence underlying the challenge of formant bandwidth estimation due to inherent statistical properties of LP-based speech analysi
Object-based modelling for representing and processing speech corpora
This thesis deals with modelling data existing in large speech corpora using an object-oriented paradigm which captures important linguistic structures. Information from corpora is transformed into objects and are assigned properties regarding their behaviour. These objects, called speech units, are placed onto a multi-dimensional framework and have their relationships to other units explicitly defined through the use of links. Frameworks that model temporal utterances or atemporal information like speaker characteristics and recording conditions can be searched efficiently for contextual matches. Speech units that match desired contexts are the result of successful linguistically motivated queries and can be used in further speech processing tasks in the same computational environment. This allows for empirical studies of speech and its relation to linguistic structures to be carried out, and for the training and testing of applications like speech recognition and synthesis.
Information residing in typical speech corpora is discussed first, followed by an overview of object-orientation which sets the tone for this thesis. Then the representation framework is introduced which is generated by a compiler and linker that rely on a set of domain-specific resources that transform corpus data into speech units. Operations on this framework are then presented along with a comparison between a relational and object-oriented model of identical speech data.
The models described in this work are directly applicable to existing large speech corpora, and the methods developed here are tested against relational database methods. The object-oriented methods outperform the relational methods for typical linguistically relevant queries by about three orders of magnitude as measured by database search times. This improvement in simplicity of representation and search speed is crucial for the utilisation of large multi-lingual corpora in basic research on the detailed properties of speech, especially in relation to contextual variation.reviewe
The application of continuous state HMMs to an automatic speech recognition task
Hidden Markov Models (HMMs) have been a popular choice for automatic speech recognition (ASR) for several decades due to their mathematical formulation and computational efficiency, which has consistently resulted in a better performance compared to other methods during this period. However, HMMs are based on the assumption of statistical independence among speech frames, which conflicts with the physiological basis of speech production. Consequently, researchers have produced a substantial amount of literature to extend the HMM model assumptions and incorporate dynamic properties of speech into the underlying model. One such approach involves segmental models, which addresses a frame-wise independence assumption. However, the computational inefficiencies associated with segmental models have limited their practical application. In recent years, there has been a shift from HMM-based systems to neural networks (NN) and deep learning approaches, which offer superior performance com- pared to conventional statistical models. However, as the complexity of neural models increases, so does the number of parameters involved, requiring a greater dependency on training data to optimise model parameters.
This present study extends prior research on segmental HMMs by introducing a Segmental Continuous-State Hidden Markov Model (CSHMM) examining a resolution to the issue of inter-segmental continuity. This is an alternative approach when compared to contemporary speech modelling methods that rely on data-centric NN techniques, with the goal of establishing a statistical model that more accurately reflects the speech production process. The Continuous-State Segmental model offers a flexible mathematical framework which can impose a continuity constraint between adjoining segments addressing a fundamental drawback of conventional HMMs, namely, the independence assumption. Additionally, the CSHMM also benefits from a practical training and decoding algorithm which overcomes the computational inefficiency inherent in conventional decoding algorithms for traditional Segmental HMMs.
This study has formulated four trajectory-based segmental models using a CSHMM framework. CSHMMs have not been extensively studied for ASR tasks due to the absence of open-source standardised speech tool-kits that enable convenient exploration of CSHMMs. As a result, to perform sufficient experiments in this study, training and decoding software has been developed, which can be accessed in (Seivwright, 2015).
The experiments in this study report baseline phone recognition results for the four distinct Segmental CSHMM systems using the TIMIT database. These baseline results are compared against a simple Hidden Markov Model-Gaussian Mixture Model (HMM- GMM) system. In all experiments, a compact acoustic feature representation in the form of bottleneck features (BNF), is employed, motivated by an investigation into the BNFs and their relationship to articulatory properties. Although the proposed CSHMM systems do not surpass discrete-state HMMs in performance, this research has demonstrated a strong association between inter-segmental continuity and the corresponding phonetic categories being modelled. Furthermore, this thesis presents a method for achieving finer control over continuity between segments, which can be expanded to investigate co-articulation in the context of CSHMMs
An acoustic-phonetic approach in automatic Arabic speech recognition
In a large vocabulary speech recognition system the broad phonetic classification
technique is used instead of detailed phonetic analysis to overcome the variability in the
acoustic realisation of utterances. The broad phonetic description of a word is used as a
means of lexical access, where the lexicon is structured into sets of words sharing the
same broad phonetic labelling.
This approach has been applied to a large vocabulary isolated word Arabic speech
recognition system. Statistical studies have been carried out on 10,000 Arabic words
(converted to phonemic form) involving different combinations of broad phonetic
classes. Some particular features of the Arabic language have been exploited. The results
show that vowels represent about 43% of the total number of phonemes. They also show
that about 38% of the words can uniquely be represented at this level by using eight
broad phonetic classes. When introducing detailed vowel identification the percentage of
uniquely specified words rises to 83%. These results suggest that a fully detailed
phonetic analysis of the speech signal is perhaps unnecessary.
In the adopted word recognition model, the consonants are classified into four broad
phonetic classes, while the vowels are described by their phonemic form. A set of 100
words uttered by several speakers has been used to test the performance of the
implemented approach.
In the implemented recognition model, three procedures have been developed, namely
voiced-unvoiced-silence segmentation, vowel detection and identification, and automatic
spectral transition detection between phonemes within a word. The accuracy of both the
V-UV-S and vowel recognition procedures is almost perfect. A broad phonetic
segmentation procedure has been implemented, which exploits information from the
above mentioned three procedures. Simple phonological constraints have been used to
improve the accuracy of the segmentation process. The resultant sequence of labels are
used for lexical access to retrieve the word or a small set of words sharing the same broad
phonetic labelling. For the case of having more than one word-candidates, a verification
procedure is used to choose the most likely one
Acoustic Approaches to Gender and Accent Identification
There has been considerable research on the problems of speaker and language recognition
from samples of speech. A less researched problem is that of accent recognition. Although this
is a similar problem to language identification, di�erent accents of a language exhibit more
fine-grained di�erences between classes than languages. This presents a tougher problem
for traditional classification techniques. In this thesis, we propose and evaluate a number of
techniques for gender and accent classification. These techniques are novel modifications and
extensions to state of the art algorithms, and they result in enhanced performance on gender
and accent recognition.
The first part of the thesis focuses on the problem of gender identification, and presents a
technique that gives improved performance in situations where training and test conditions are
mismatched.
The bulk of this thesis is concerned with the application of the i-Vector technique to accent
identification, which is the most successful approach to acoustic classification to have emerged
in recent years. We show that it is possible to achieve high accuracy accent identification without
reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis
describes various stages in the development of i-Vector based accent classification that improve
the standard approaches usually applied for speaker or language identification, which are
insu�cient. We demonstrate that very good accent identification performance is possible with
acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector
configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can
obtain from the same data.
We claim to have achieved the best accent identification performance on the test corpus
for acoustic methods, with up to 90% identification rate. This performance is even better than
previously reported acoustic-phonotactic based systems on the same corpus, and is very close
to performance obtained via transcription based accent identification. Finally, we demonstrate
that the utilization of our techniques for speech recognition purposes leads to considerably
lower word error rates.
Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian
Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British
English, Prosody, Speech Recognition
Altering speech synthesis prosody through real time natural gestural control
A significant amount of research has been and continues to be undertaken into generating
expressive prosody within speech synthesis. Separately, recent developments in
HMM-based synthesis (specifically pHTS, developed at University of Mons) provide
a platform for reactive speech synthesis, able to react in real time to surroundings or
user interaction.
Considering both of these elements, this project explores whether it is possible to
generate superior prosody in a speech synthesis system, using natural gestural controls,
in real time. Building on a previous piece of work undertaken at The University of Edinburgh,
a system is constructed in which a user may apply a variety of prosodic effects
in real time through natural gestures, recognised by a Microsoft Kinect sensor. Gestures
are recognised and prosodic adjustments made through a series of hand-crafted
rules (based on data gathered from preliminary experiments), though machine learning
techniques are also considered within this project and recommended for future iterations
of the work.
Two sets of formal experiments are implemented, both of which suggest that - under
further development - the system developed may work successfully in a real world
environment. Firstly, user tests show that subjects can learn to control the device successfully,
adding prosodic effects to the intended words in the majority of cases with
practice. Results are likely to improve further as buffering issues are resolved. Secondly,
listening tests show that the prosodic effects currently implemented significantly
increase perceived naturalness, and in some cases are able to alter the semantic perception
of a sentence in an intended way.
Alongside this paper, a demonstration video of the project may be found on the accompanying
CD, or online at http://tinyurl.com/msc-synthesis. The reader is advised
to view this demonstration, as a way of understanding how the system functions and
sounds in action
Optimizing acoustic and perceptual assessment of voice quality in children with vocal nodules
Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 105-109).Few empirically-derived guidelines exist for optimizing the assessment of vocal function in children with voice disorders. The goal of this investigation was to identify a minimal set of speech tasks and associated acoustic analysis methods that are most salient in characterizing the impact of vocal nodules on vocal function in children. Hence, a pediatric assessment protocol was developed based on the standardized Consensus Auditory Perceptual Evaluation of Voice (CAPE-V) used to evaluate adult voices. Adult and pediatric versions of the CAPE-V protocols were used to gather recordings of vowels and sentences from adult females and children (4-6 and 8-10 year olds) with normal voices and vocal nodules, and these recordings were subjected to perceptual and acoustic analyses. Results showed that perceptual ratings for breathiness best characterized the presence of nodules in children's voices, and ratings for the production of sentences best differentiated normal voices and voices with nodules for both children and adults. Selected voice quality-related acoustic algorithms designed to quantitatively evaluate acoustic measures of vowels and sentences, were modified to be pitch-independent for use in analyzing children's voices. Synthesized vowels for children and adults were used to validate the modified algorithms by systematically assessing the effects of manipulating the periodicity and spectral characteristics of the synthesizer's voicing source.(cont.) In applying the validated algorithms to the recordings of subjects with normal voices and vocal nodules, the acoustic measure tended to differentiate normal voices and voices with nodules in children and adults, and some displayed significant correlations with the perceptual attributes of overall severity of dysphonia, roughness, and/or breathiness. None of the acoustic measures correlated significantly with the perceptual attribute of strain. Limitations in the strength of the correlations between acoustic measures and perceptual attributes were attributed to factors that can be addressed in future investigations, which can now utilize the algorithms that were developed in this investigation for children's voices. Preliminary recommendations are made for the clinical assessment of pediatric voice disorders.by Asako Masaki.Ph.D