746 research outputs found
Robust speaker identification using artificial neural networks
This research mainly focuses on recognizing the speakers through their speech samples. Numerous Text-Dependent or Text-Independent algorithms have been developed by people so far, to recognize the speaker from his/her speech. In this thesis, we concentrate on the recognition of the speaker from the fixed text i.e. Text-Dependent . Possibility of extending this method to variable text i.e. Text-Independent is also analyzed. Different feature extraction algorithms are employed and their performance with Artificial Neural Networks as a Data Classifier on a fixed training set is analyzed. We find a way to combine all these individual feature extraction algorithms by incorporating their interdependence. The efficiency of these algorithms is determined after the input speech is classified using Back Propagation Algorithm of Artificial Neural Networks. A special case of Back Propagation Algorithm which improves the efficiency of the classification is also discussed
Virtual Exploration of the Human Vocal Tract
Hypothesis: By simulating the acoustic field throughout the entire vocal tract the evolution of speech sounds within the tract can be directly and quantitatively related to physical variations in the tract geometry. This insight into speech production could then be applied to a variety of fields where the ability to alter or investigate speech characteristics in a targeted way could be useful for example in the teaching of speech science, in speech coaching, or as part of the planning of medical procedures. In this research, a bespoke acoustic simulation package has been produced using a continuous 3-dimensional Digital Waveguide Mesh (DWM) which can produce acoustic output throughout the entire simulation domain containing the tract at every time step. This package has been shown to reproduce formant frequencies for a variety of vocal tract shapes with an average mean absolute error of 10.12% at the lips, which is comparable to other research. These results have been investigated by comparing simulation output to recorded output from physical models. This simulation package has also been used to perform studies into the shifting of formant frequencies during speech sound production along the length of the tract, and into the effect on formant frequencies of the removal of geometric features of the tract such as the piriform fossae. These studies have been compared to physical internal measurements of vocal tract models from living subjects, showing preliminary agreement with further development required. A large emphasis has been placed on the accessibility of this research, with the production of several tools for visualisation of the data contained within, and with decisions made during the production of the simulation package itself
Recommended from our members
Modelling and extraction of fundamental frequency in speech signals
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.One of the most important parameters of speech is the fundamental frequency of vibration of voiced sounds. The audio sensation of the fundamental frequency is known as the pitch. Depending on the tonal/non-tonal category of language, the fundamental frequency conveys intonation, pragmatics and meaning. In addition the fundamental frequency and intonation carry speaker gender, age, identity, speaking style and emotional state. Accurate estimation of the fundamental frequency is critically important for functioning of speech processing applications such as speech coding, speech recognition, speech synthesis and voice morphing. This thesis makes contributions to the development of accurate pitch estimation research in three distinct ways: (1) an investigation of the impact of the window length on pitch estimation error, (2) an investigation of the use of the higher order moments and (3) an investigation of an analysis-synthesis method for selection of the best pitch value among N proposed candidates. Experimental evaluations show that the length of the speech window has a major impact on the accuracy of pitch estimation. Depending on the similarity criteria and the order of the statistical moment a window length of 37 to 80 ms gives the least error. In order to avoid excessive delay as a consequence of using a longer window, a method is proposed
ii where the current short window is concatenated with the previous frames to form a longer signal window for pitch extraction. The use of second order and higher order moments, and the magnitude difference function, as the similarity criteria were explored and compared. A novel method of calculation of moments is introduced where the signal is split, i.e. rectified, into positive and negative valued samples. The moments for the positive and negative parts of the signal are computed separately and combined. The new method of calculation of moments from positive and negative parts and the higher order criteria provide competitive results. A challenging issue in pitch estimation is the determination of the best candidate from N extrema of the similarity criteria. The analysis-synthesis method proposed in this thesis selects the pitch candidate that provides the best reproduction (synthesis) of the harmonic spectrum of the original speech. The synthesis method must be such that the distortion increases with the increasing error in the estimate of the fundamental frequency. To this end a new method of spectral synthesis is proposed using an estimate of the spectral envelop and harmonically spaced asymmetric Gaussian pulses as excitation. The N-best method provides consistent reduction in pitch estimation error. The methods described in this thesis result in a significant improvement in the pitch accuracy and outperform the benchmark YIN method
Estimating Housing Demand with an Application to Explaining Racial Segregation in Cities
We present a three-stage estimation procedure to recover willingness to pay for housing attributes. In the first stage, we estimate a non-parametric hedonic home price function. Second, we recover each consumer's taste parameters for product characteristics using first order conditions for utility maximization. Finally, we estimate the distribution of household tastes as a function of household demographics. As an application of our methods, we compare alternative explanations for why blacks choose to live in center cities while whites suburbanize.
Recommended from our members
The role of HG in the analysis of temporal iteration and interaural correlation
Modelling the effects of speech rate variation for automatic speech recognition
Wrede B. Modelling the effects of speech rate variation for automatic speech recognition. Bielefeld (Germany): Bielefeld University; 2002.In automatic speech recognition it is a widely observed phenomenon that variations in speech rate cause severe degradations of the speech recognition performance. This is due to the fact that standard stochastic based speech recognition systems specialise on average speech rate. Although many approaches to modelling speech rate variation have been made, an integrated approach in a substantial system still has be to developed. General approaches to rate modelling are based on rate dependent models which are trained with rate specific subsets of the training data. During decoding a signal based rate estimation is performed according to which the set of rate dependent models is selected. While such approaches are able to reduce the word error rate significantly, they suffer from shortcomings such as the reduction of training data and the expensive training and decoding procedure.
However, phonetic investigations show that there is a systematic relationship between speech rate and the acoustic characteristics of speech. In fast speech a tendency of reduction can be observed which can be described in more detail as a centralisation effect and an increase in coarticulation. Centralisation means that the formant frequencies of vowels tend to shift towards the vowel space center while increased coarticulation denotes the tendency of the spectral features of a vowel to shift towards those of its phonemic neighbour.
The goal of this work is to investigate the possibility to incorporate the knowledge of the systematic nature of the influence of speech rate variation on the acoustic features in speech rate modelling.
In an acoustic-phonetic analysis of a large corpus of spontaneous speech it was shown that an increased degree of the two effects of centralisation and coarticulation can be found in fast speech. Several measures for these effects were developed and used in speech recognition experiments with rate dependent models.
A thorough investigation of rate dependent models showed that with duration and coarticulation based measures significant increases of the performance could be achieved. It was shown that by the use of different measures the models were adapted either to centralisation or coarticulation. Further experiments showed that by a more detailed modelling with more rate classes a further improvement can be achieved. It was also observed that a general basis for the models is needed before rate adaptation can be performed. In a comparison to other sources of acoustic variation it was shown that the effects of speech rate are as severe as those of speaker variation and environmental noise.
All these results show that for a more substantial system that models rate variations accurately it is necessary to focus on both, durational and spectral effects. The systematic nature of the effects indicates that a continuous modelling is possible
Recommended from our members
Speech coding
Speech is the predominant means of communication between human beings and since the invention of the telephone by Alexander Graham Bell in 1876, speech services have remained to be the core service in almost all telecommunication systems. Original analog methods of telephony had the disadvantage of speech signal getting corrupted by noise, cross-talk and distortion Long haul transmissions which use repeaters to compensate for the loss in signal strength on transmission links also increase the associated noise and distortion. On the other hand digital transmission is relatively immune to noise, cross-talk and distortion primarily because of the capability to faithfully regenerate digital signal at each repeater purely based on a binary decision. Hence end-to-end performance of the digital link essentially becomes independent of the length and operating frequency bands of the link Hence from a transmission point of view digital transmission has been the preferred approach due to its higher immunity to noise. The need to carry digital speech became extremely important from a service provision point of view as well. Modem requirements have introduced the need for robust, flexible and secure services that can carry a multitude of signal types (such as voice, data and video) without a fundamental change in infrastructure. Such a requirement could not have been easily met without the advent of digital transmission systems, thereby requiring speech to be coded digitally. The term Speech Coding is often referred to techniques that represent or code speech signals either directly as a waveform or as a set of parameters by analyzing the speech signal. In either case, the codes are transmitted to the distant end where speech is reconstructed or synthesized using the received set of codes. A more generic term that is applicable to these techniques that is often interchangeably used with speech coding is the term voice coding. This term is more generic in the sense that the coding techniques are equally applicable to any voice signal whether or not it carries any intelligible information, as the term speech implies. Other terms that are commonly used are speech compression and voice compression since the fundamental idea behind speech coding is to reduce (compress) the transmission rate (or equivalently the bandwidth) And/or reduce storage requirements In this document the terms speech and voice shall be used interchangeably
- …