41 research outputs found
New measurement techniques for the assessment of velopharyngeal function in cleft palate patients
The present day treatment of the cleft palate is very much a multi-disciplinary team approach calling upon the skills of plastic surgeon, orthodontist, maxillo-facial surgeon and speech therapist. The aesthetic result of modern plastic surgery on the lip and face is unquestionably successful; however the improvement to speech due to changes in velopharyngeal function as a result of surgery is not so readily agreed upon. It is generally acknowledged that cleft repair should be carried out as early as possible after birth followed by several years of developmental monitoring; however considerable debate relating to the surgical technique employed and the long term effect of surgery on speech development still abounds.
This thesis undertakes to make a contribution to the debate of efficacy of cleft repair in relation to speech function in the following manner. Firstly a new instrument called a Nasal Resonometer has been specifically designed for use by speech therapists for the pre and post operative assessment of hyper- and hypo-nasal speech. Secondly a new measurement technique involving the computer assisted analysis of x-ray videofluoroscopy images of clinically significant aspects of velar function has been introduced.
Several studies of patients attending the clef repair clinics over a three year period are presented. The correlations between objective Resonometer measurement, subjective speech therapist analysis, velopharyngeal function and surgical technique are examined.
The extensive clinical use of the Nasal Resonometer and image analysis technique have proven to be a successful addition to routine cleft palate measurements. Further the application of these measurements in specific studies has led to a clearer understanding of the effect of cleft palate surgery and has highlighted future areas of research
Determination of articulatory parameters from speech waveforms
Imperial Users onl
Glottal-synchronous speech processing
Glottal-synchronous speech processing is a field of speech science where the pseudoperiodicity
of voiced speech is exploited. Traditionally, speech processing involves segmenting
and processing short speech frames of predefined length; this may fail to exploit the inherent
periodic structure of voiced speech which glottal-synchronous speech frames have
the potential to harness. Glottal-synchronous frames are often derived from the glottal
closure instants (GCIs) and glottal opening instants (GOIs).
The SIGMA algorithm was developed for the detection of GCIs and GOIs from
the Electroglottograph signal with a measured accuracy of up to 99.59%. For GCI and
GOI detection from speech signals, the YAGA algorithm provides a measured accuracy
of up to 99.84%. Multichannel speech-based approaches are shown to be more robust to
reverberation than single-channel algorithms.
The GCIs are applied to real-world applications including speech dereverberation,
where SNR is improved by up to 5 dB, and to prosodic manipulation where the importance
of voicing detection in glottal-synchronous algorithms is demonstrated by subjective
testing. The GCIs are further exploited in a new area of data-driven speech modelling,
providing new insights into speech production and a set of tools to aid deployment into
real-world applications. The technique is shown to be applicable in areas of speech coding,
identification and artificial bandwidth extension of telephone speec
An acoustic-phonetic approach in automatic Arabic speech recognition
In a large vocabulary speech recognition system the broad phonetic classification
technique is used instead of detailed phonetic analysis to overcome the variability in the
acoustic realisation of utterances. The broad phonetic description of a word is used as a
means of lexical access, where the lexicon is structured into sets of words sharing the
same broad phonetic labelling.
This approach has been applied to a large vocabulary isolated word Arabic speech
recognition system. Statistical studies have been carried out on 10,000 Arabic words
(converted to phonemic form) involving different combinations of broad phonetic
classes. Some particular features of the Arabic language have been exploited. The results
show that vowels represent about 43% of the total number of phonemes. They also show
that about 38% of the words can uniquely be represented at this level by using eight
broad phonetic classes. When introducing detailed vowel identification the percentage of
uniquely specified words rises to 83%. These results suggest that a fully detailed
phonetic analysis of the speech signal is perhaps unnecessary.
In the adopted word recognition model, the consonants are classified into four broad
phonetic classes, while the vowels are described by their phonemic form. A set of 100
words uttered by several speakers has been used to test the performance of the
implemented approach.
In the implemented recognition model, three procedures have been developed, namely
voiced-unvoiced-silence segmentation, vowel detection and identification, and automatic
spectral transition detection between phonemes within a word. The accuracy of both the
V-UV-S and vowel recognition procedures is almost perfect. A broad phonetic
segmentation procedure has been implemented, which exploits information from the
above mentioned three procedures. Simple phonological constraints have been used to
improve the accuracy of the segmentation process. The resultant sequence of labels are
used for lexical access to retrieve the word or a small set of words sharing the same broad
phonetic labelling. For the case of having more than one word-candidates, a verification
procedure is used to choose the most likely one
Phase-Distortion-Robust Voice-Source Analysis
This work concerns itself with the analysis of voiced speech signals, in particular the analysis of the glottal source signal. Following the source-filter theory of speech, the glottal signal is produced by the vibratory behaviour of the vocal folds and is modulated by the resonances of the vocal tract and radiation characteristic of the lips to form the speech signal. As it is thought that the glottal source signal contributes much of the non-linguistic and prosodical information to speech, it is useful to develop techniques which can estimate and parameterise this signal accurately. Because of vocal tract modulation, estimating the glottal source waveform from the speech signal is a blind deconvolution problem which necessarily makes assumptions about the characteristics of both the glottal source and vocal tract. A common assumption is that the glottal signal and/or vocal tract can be approximated by a parametric model. Other assumptions include the causality of the speech signal: the vocal tract is assumed to be a minimum phase system while the glottal source is assumed to exhibit mixed phase characteristics. However, as the literature review within this thesis will show, the error criteria utilised to determine the parameters are not robust to the conditions under which the speech signal is recorded, and are particularly degraded in the common scenario where low frequency phase distortion is introduced. Those that are robust to this type of distortion are not well suited to the analysis of real-world signals. This research proposes a voice-source estimation and parameterisation technique, called the Power-spectrum-based determination of the Rd parameter (PowRd) method. Illustrated by theory and demonstrated by experiment, the new technique is robust to the time placement of the analysis frame and phase issues that are generally encountered during recording. The method assumes that the derivative glottal flow signal is approximated by the transformed Liljencrants-Fant model and that the vocal tract can be represented by an all-pole filter. Unlike many existing glottal source estimation methods, the PowRd method employs a new error criterion to optimise the parameters which is also suitable to determine the optimal vocal-tract filter order. In addition to the issue of glottal source parameterisation, nonlinear phase recording conditions can also adversely affect the results of other speech processing tasks such as the estimation of the instant of glottal closure. In this thesis, a new glottal closing instant estimation algorithm is proposed which incorporates elements from the state-of-the-art techniques and is specifically designed for operation upon speech recorded under nonlinear phase conditions. The new method, called the Fundamental RESidual Search or FRESS algorithm, is shown to estimate the glottal closing instant of voiced speech with superior precision and comparable accuracy as other existing methods over a large database of real speech signals under real and simulated recording conditions. An application of the proposed glottal source parameterisation method and glottal closing instant detection algorithm is a system which can analyse and re-synthesise voiced speech signals. This thesis describes perceptual experiments which show that, iunder linear and nonlinear recording conditions, the system produces synthetic speech which is generally preferred to speech synthesised based upon a state-of-the-art timedomain- based parameterisation technique. In sum, this work represents a movement towards flexible and robust voice-source analysis, with potential for a wide range of applications including speech analysis, modification and synthesis
Vowel normalisation : an interface between acoustic and linguistic descriptions of speaker characteristics in Australian English
This thesis examines existing normalisation procedures against the background
of a theoretical model of inter-speaker formant variability, which
describes observed formant differences in three major categories: phonetic
variation, non-uniform variation, and uniform variation. A new
normalisation strategy based on this model is proposed which involves
the removal of uniform and non-uniform components of inter-speaker
variation in order to isolate phonetic variation. The nature of this nonuniformity
is subject to empirical investigation. Working along the above
strategy, the method adopted in this thesis is to initially acquire a phonetically
stable vowel database, which is then screened for phonetic variations
through a rigorous phonetic control procedure. The resulting
data, now considered to be phonetically homogeneous, are used for exploring
two essential domains of inter-speaker variability that contribute
to the designing of a future normalisation procedure: (1) By applying
uniform transformations using a variety of published scaling parameters,
the most effective uniform scaling parameters are identified. (2)
Non-uniform inter-speaker variation patterns are analysed and compared
with the published results of Fant (1975). A major discovery is that
non-uniform inter-speaker variation patterns obtained from phonetically
controlled data are grossly different from those observed by Fant.
The present database comprises 594 vowels in the /h_d/ word context
(11 phonemic monophthongs x 9 speakers x 6 repetitions), and the speakers
include 4 adult females, 3 adult males and 2 children (male)