12 research outputs found

    Statistical properties of linear prediction analysis underlying the challenge of formant bandwidth estimation

    Get PDF
    Formant bandwidth estimation is often observed to be more challenging than the estimation of formant center frequencies due to the presence of multiple glottal pulses within a period and short closed-phase durations. This study explores inherently different statistical properties between linear prediction (LP)–based estimates of formant frequencies and their corresponding bandwidths that may be explained in part by the statistical bounds on the variances of estimated LP coefficients. A theoretical analysis of the Cramér-Rao bounds on LP estimator variance indicates that the accuracy of bandwidth estimation is approximately twice as low as that of center frequency estimation. Monte Carlo simulations of all-pole vowels with stochastic and mixed-source excitation demonstrate that the distributions of estimated LP coefficients exhibit expectedly different variances for each coefficient. Transforming the LP coefficients to formant parameters results in variances of bandwidth estimates being typically larger than the variances of respective center frequency estimates, depending on vowel type and fundamental frequency. These results provide additional evidence underlying the challenge of formant bandwidth estimation due to inherent statistical properties of LP-based speech analysi

    Object-based modelling for representing and processing speech corpora

    Get PDF
    This thesis deals with modelling data existing in large speech corpora using an object-oriented paradigm which captures important linguistic structures. Information from corpora is transformed into objects and are assigned properties regarding their behaviour. These objects, called speech units, are placed onto a multi-dimensional framework and have their relationships to other units explicitly defined through the use of links. Frameworks that model temporal utterances or atemporal information like speaker characteristics and recording conditions can be searched efficiently for contextual matches. Speech units that match desired contexts are the result of successful linguistically motivated queries and can be used in further speech processing tasks in the same computational environment. This allows for empirical studies of speech and its relation to linguistic structures to be carried out, and for the training and testing of applications like speech recognition and synthesis. Information residing in typical speech corpora is discussed first, followed by an overview of object-orientation which sets the tone for this thesis. Then the representation framework is introduced which is generated by a compiler and linker that rely on a set of domain-specific resources that transform corpus data into speech units. Operations on this framework are then presented along with a comparison between a relational and object-oriented model of identical speech data. The models described in this work are directly applicable to existing large speech corpora, and the methods developed here are tested against relational database methods. The object-oriented methods outperform the relational methods for typical linguistically relevant queries by about three orders of magnitude as measured by database search times. This improvement in simplicity of representation and search speed is crucial for the utilisation of large multi-lingual corpora in basic research on the detailed properties of speech, especially in relation to contextual variation.reviewe

    Two uses for syllables in a speech recognition system

    Get PDF

    The application of continuous state HMMs to an automatic speech recognition task

    Get PDF
    Hidden Markov Models (HMMs) have been a popular choice for automatic speech recognition (ASR) for several decades due to their mathematical formulation and computational efficiency, which has consistently resulted in a better performance compared to other methods during this period. However, HMMs are based on the assumption of statistical independence among speech frames, which conflicts with the physiological basis of speech production. Consequently, researchers have produced a substantial amount of literature to extend the HMM model assumptions and incorporate dynamic properties of speech into the underlying model. One such approach involves segmental models, which addresses a frame-wise independence assumption. However, the computational inefficiencies associated with segmental models have limited their practical application. In recent years, there has been a shift from HMM-based systems to neural networks (NN) and deep learning approaches, which offer superior performance com- pared to conventional statistical models. However, as the complexity of neural models increases, so does the number of parameters involved, requiring a greater dependency on training data to optimise model parameters. This present study extends prior research on segmental HMMs by introducing a Segmental Continuous-State Hidden Markov Model (CSHMM) examining a resolution to the issue of inter-segmental continuity. This is an alternative approach when compared to contemporary speech modelling methods that rely on data-centric NN techniques, with the goal of establishing a statistical model that more accurately reflects the speech production process. The Continuous-State Segmental model offers a flexible mathematical framework which can impose a continuity constraint between adjoining segments addressing a fundamental drawback of conventional HMMs, namely, the independence assumption. Additionally, the CSHMM also benefits from a practical training and decoding algorithm which overcomes the computational inefficiency inherent in conventional decoding algorithms for traditional Segmental HMMs. This study has formulated four trajectory-based segmental models using a CSHMM framework. CSHMMs have not been extensively studied for ASR tasks due to the absence of open-source standardised speech tool-kits that enable convenient exploration of CSHMMs. As a result, to perform sufficient experiments in this study, training and decoding software has been developed, which can be accessed in (Seivwright, 2015). The experiments in this study report baseline phone recognition results for the four distinct Segmental CSHMM systems using the TIMIT database. These baseline results are compared against a simple Hidden Markov Model-Gaussian Mixture Model (HMM- GMM) system. In all experiments, a compact acoustic feature representation in the form of bottleneck features (BNF), is employed, motivated by an investigation into the BNFs and their relationship to articulatory properties. Although the proposed CSHMM systems do not surpass discrete-state HMMs in performance, this research has demonstrated a strong association between inter-segmental continuity and the corresponding phonetic categories being modelled. Furthermore, this thesis presents a method for achieving finer control over continuity between segments, which can be expanded to investigate co-articulation in the context of CSHMMs

    An acoustic-phonetic approach in automatic Arabic speech recognition

    Get PDF
    In a large vocabulary speech recognition system the broad phonetic classification technique is used instead of detailed phonetic analysis to overcome the variability in the acoustic realisation of utterances. The broad phonetic description of a word is used as a means of lexical access, where the lexicon is structured into sets of words sharing the same broad phonetic labelling. This approach has been applied to a large vocabulary isolated word Arabic speech recognition system. Statistical studies have been carried out on 10,000 Arabic words (converted to phonemic form) involving different combinations of broad phonetic classes. Some particular features of the Arabic language have been exploited. The results show that vowels represent about 43% of the total number of phonemes. They also show that about 38% of the words can uniquely be represented at this level by using eight broad phonetic classes. When introducing detailed vowel identification the percentage of uniquely specified words rises to 83%. These results suggest that a fully detailed phonetic analysis of the speech signal is perhaps unnecessary. In the adopted word recognition model, the consonants are classified into four broad phonetic classes, while the vowels are described by their phonemic form. A set of 100 words uttered by several speakers has been used to test the performance of the implemented approach. In the implemented recognition model, three procedures have been developed, namely voiced-unvoiced-silence segmentation, vowel detection and identification, and automatic spectral transition detection between phonemes within a word. The accuracy of both the V-UV-S and vowel recognition procedures is almost perfect. A broad phonetic segmentation procedure has been implemented, which exploits information from the above mentioned three procedures. Simple phonological constraints have been used to improve the accuracy of the segmentation process. The resultant sequence of labels are used for lexical access to retrieve the word or a small set of words sharing the same broad phonetic labelling. For the case of having more than one word-candidates, a verification procedure is used to choose the most likely one

    Framework for proximal personified interfaces

    Get PDF

    Acoustic Approaches to Gender and Accent Identification

    Get PDF
    There has been considerable research on the problems of speaker and language recognition from samples of speech. A less researched problem is that of accent recognition. Although this is a similar problem to language identification, di�erent accents of a language exhibit more fine-grained di�erences between classes than languages. This presents a tougher problem for traditional classification techniques. In this thesis, we propose and evaluate a number of techniques for gender and accent classification. These techniques are novel modifications and extensions to state of the art algorithms, and they result in enhanced performance on gender and accent recognition. The first part of the thesis focuses on the problem of gender identification, and presents a technique that gives improved performance in situations where training and test conditions are mismatched. The bulk of this thesis is concerned with the application of the i-Vector technique to accent identification, which is the most successful approach to acoustic classification to have emerged in recent years. We show that it is possible to achieve high accuracy accent identification without reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis describes various stages in the development of i-Vector based accent classification that improve the standard approaches usually applied for speaker or language identification, which are insu�cient. We demonstrate that very good accent identification performance is possible with acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can obtain from the same data. We claim to have achieved the best accent identification performance on the test corpus for acoustic methods, with up to 90% identification rate. This performance is even better than previously reported acoustic-phonotactic based systems on the same corpus, and is very close to performance obtained via transcription based accent identification. Finally, we demonstrate that the utilization of our techniques for speech recognition purposes leads to considerably lower word error rates. Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British English, Prosody, Speech Recognition

    Altering speech synthesis prosody through real time natural gestural control

    Get PDF
    A significant amount of research has been and continues to be undertaken into generating expressive prosody within speech synthesis. Separately, recent developments in HMM-based synthesis (specifically pHTS, developed at University of Mons) provide a platform for reactive speech synthesis, able to react in real time to surroundings or user interaction. Considering both of these elements, this project explores whether it is possible to generate superior prosody in a speech synthesis system, using natural gestural controls, in real time. Building on a previous piece of work undertaken at The University of Edinburgh, a system is constructed in which a user may apply a variety of prosodic effects in real time through natural gestures, recognised by a Microsoft Kinect sensor. Gestures are recognised and prosodic adjustments made through a series of hand-crafted rules (based on data gathered from preliminary experiments), though machine learning techniques are also considered within this project and recommended for future iterations of the work. Two sets of formal experiments are implemented, both of which suggest that - under further development - the system developed may work successfully in a real world environment. Firstly, user tests show that subjects can learn to control the device successfully, adding prosodic effects to the intended words in the majority of cases with practice. Results are likely to improve further as buffering issues are resolved. Secondly, listening tests show that the prosodic effects currently implemented significantly increase perceived naturalness, and in some cases are able to alter the semantic perception of a sentence in an intended way. Alongside this paper, a demonstration video of the project may be found on the accompanying CD, or online at http://tinyurl.com/msc-synthesis. The reader is advised to view this demonstration, as a way of understanding how the system functions and sounds in action

    Optimizing acoustic and perceptual assessment of voice quality in children with vocal nodules

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 105-109).Few empirically-derived guidelines exist for optimizing the assessment of vocal function in children with voice disorders. The goal of this investigation was to identify a minimal set of speech tasks and associated acoustic analysis methods that are most salient in characterizing the impact of vocal nodules on vocal function in children. Hence, a pediatric assessment protocol was developed based on the standardized Consensus Auditory Perceptual Evaluation of Voice (CAPE-V) used to evaluate adult voices. Adult and pediatric versions of the CAPE-V protocols were used to gather recordings of vowels and sentences from adult females and children (4-6 and 8-10 year olds) with normal voices and vocal nodules, and these recordings were subjected to perceptual and acoustic analyses. Results showed that perceptual ratings for breathiness best characterized the presence of nodules in children's voices, and ratings for the production of sentences best differentiated normal voices and voices with nodules for both children and adults. Selected voice quality-related acoustic algorithms designed to quantitatively evaluate acoustic measures of vowels and sentences, were modified to be pitch-independent for use in analyzing children's voices. Synthesized vowels for children and adults were used to validate the modified algorithms by systematically assessing the effects of manipulating the periodicity and spectral characteristics of the synthesizer's voicing source.(cont.) In applying the validated algorithms to the recordings of subjects with normal voices and vocal nodules, the acoustic measure tended to differentiate normal voices and voices with nodules in children and adults, and some displayed significant correlations with the perceptual attributes of overall severity of dysphonia, roughness, and/or breathiness. None of the acoustic measures correlated significantly with the perceptual attribute of strain. Limitations in the strength of the correlations between acoustic measures and perceptual attributes were attributed to factors that can be addressed in future investigations, which can now utilize the algorithms that were developed in this investigation for children's voices. Preliminary recommendations are made for the clinical assessment of pediatric voice disorders.by Asako Masaki.Ph.D
    corecore