276 research outputs found
Analysis and correction of the helium speech effect by autoregressive signal processing
SIGLELD:D48902/84 / BLDSC - British Library Document Supply CentreGBUnited Kingdo
Development of a Real-time Embedded System for Speech Emotion Recognition
Speech emotion recognition is one of the latest challenges in speech processing and Human Computer Interaction (HCI) in order to address the operational needs in real world applications. Besides human facial expressions, speech has proven to be one of the most promising modalities for automatic human emotion recognition. Speech is a spontaneous medium of perceiving emotions which provides in-depth information related to different cognitive states of a human being. In this context, we introduce a novel approach using a combination of prosody features (i.e. pitch, energy, Zero crossing rate), quality features (i.e. Formant Frequencies, Spectral features etc.), derived features ((i.e.) Mel-Frequency Cepstral Coefficient (MFCC), Linear Predictive Coding Coefficients (LPCC)) and dynamic feature (Mel-Energy spectrum dynamic Coefficients (MEDC)) for robust automatic recognition of speaker’s emotional states. Multilevel SVM classifier is used for identification of seven discrete emotional states namely angry, disgust, fear, happy, neutral, sad and surprise in ‘Five native Assamese Languages’. The overall experimental results using MATLAB simulation revealed that the approach using combination of features achieved an average accuracy rate of 82.26% for speaker independent cases. Real time implementation of this algorithm is prepared on ARM CORTEX M3 board
Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy
The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference
An acoustic-phonetic approach in automatic Arabic speech recognition
In a large vocabulary speech recognition system the broad phonetic classification
technique is used instead of detailed phonetic analysis to overcome the variability in the
acoustic realisation of utterances. The broad phonetic description of a word is used as a
means of lexical access, where the lexicon is structured into sets of words sharing the
same broad phonetic labelling.
This approach has been applied to a large vocabulary isolated word Arabic speech
recognition system. Statistical studies have been carried out on 10,000 Arabic words
(converted to phonemic form) involving different combinations of broad phonetic
classes. Some particular features of the Arabic language have been exploited. The results
show that vowels represent about 43% of the total number of phonemes. They also show
that about 38% of the words can uniquely be represented at this level by using eight
broad phonetic classes. When introducing detailed vowel identification the percentage of
uniquely specified words rises to 83%. These results suggest that a fully detailed
phonetic analysis of the speech signal is perhaps unnecessary.
In the adopted word recognition model, the consonants are classified into four broad
phonetic classes, while the vowels are described by their phonemic form. A set of 100
words uttered by several speakers has been used to test the performance of the
implemented approach.
In the implemented recognition model, three procedures have been developed, namely
voiced-unvoiced-silence segmentation, vowel detection and identification, and automatic
spectral transition detection between phonemes within a word. The accuracy of both the
V-UV-S and vowel recognition procedures is almost perfect. A broad phonetic
segmentation procedure has been implemented, which exploits information from the
above mentioned three procedures. Simple phonological constraints have been used to
improve the accuracy of the segmentation process. The resultant sequence of labels are
used for lexical access to retrieve the word or a small set of words sharing the same broad
phonetic labelling. For the case of having more than one word-candidates, a verification
procedure is used to choose the most likely one
Evaluation of glottal characteristics for speaker identification.
Based on the assumption that the physical characteristics of people's vocal apparatus cause their voices to have distinctive characteristics, this thesis reports on investigations into the use of the long-term average glottal response for speaker identification. The long-term average glottal response is a new feature that is obtained by overlaying successive vocal tract responses within an utterance.
The way in which the long-term average glottal response varies with accent and gender is examined using a population of 352 American English speakers from eight different accent regions. Descriptors are defined that characterize the shape of the long-term average glottal response. Factor analysis of the descriptors of the long-term average glottal responses shows that the most important factor contains significant contributions from descriptors comprised of the coefficients of cubics fitted to the long-term average glottal response. Discriminant analysis demonstrates that the long-term average glottal response is potentially useful for classifying speakers according to their gender, but is not useful for distinguishing American accents.
The identification accuracy of the long-term average glottal response is compared with that obtained from vocal tract features. Identification experiments are performed using a speaker database containing utterances from twenty speakers of the digits zero to nine. Vocal tract features, which consist of cepstral coefficients, partial correlation coefficients and linear prediction coefficients, are shown to be more accurate than the long-term average glottal response. Despite analysis of the training data indicating that the long-term average glottal response was uncorrelated with the vocal tract features, various feature combinations gave insignificant improvements in identification accuracy.
The effect of noise and distortion on speaker identification is examined for each of the features. It is found that the identification performance of the long-term average glottal response is insensitive to noise compared with cepstral coefficients, partial correlation coefficients and the long-term average spectrum, but that it is highly sensitive to variations in the phase response of the speech transmission channel.
Before reporting on the identification experiments, the thesis introduces speech production, speech models and background to the various features used in the experiments. Investigations into the long-term average glottal response demonstrate that it approximates the glottal pulse convolved with the long-term average impulse response, and this relationship is verified using synthetic speech. Furthermore, the spectrum of the long-term average glottal response extracted from pre-emphasized speech is shown to be similar to the long-term average spectrum of pre-emphasized speech, but computationally much simpler
Models and Analysis of Vocal Emissions for Biomedical Applications
The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies
Speaker independent isolated word recognition
The work presented in this thesis concerns the recognition of
isolated words using a pattern matching approach. In such a system,
an unknown speech utterance, which is to be identified, is
transformed into a pattern of characteristic features. These
features are then compared with a set of pre-stored reference
patterns that were generated from the vocabulary words. The unknown
word is identified as that vocabulary word for which the reference
pattern gives the best match.
One of the major difficul ties in the pattern comparison process is
that speech patterns, obtained from the same word, exhibit non-linear
temporal fluctuations and thus a high degree of redundancy. The
initial part of this thesis considers various dynamic time warping
techniques used for normalizing the temporal differences between
speech patterns. Redundancy removal methods are also considered, and
their effect on the recognition accuracy is assessed.
Although the use of dynamic time warping algorithms provide
considerable improvement in the accuracy of isolated word recognition
schemes, the performance is ultimately limited by their poor ability
to discriminate between acoustically similar words. Methods for
enhancing the identification rate among acoustically similar words,
by using common pattern features for similar sounding regions, are
investigated.
Pattern matching based, speaker independent systems, can only operate
with a high recognition rate, by using multiple reference patterns
for each of the words included in the vocabulary. These patterns are
obtained from the utterances of a group of speakers. The use of
multiple reference patterns, not only leads to a large increase in
the memory requirements of the recognizer, but also an increase in
the computational load. A recognition system is proposed in this
thesis, which overcomes these difficulties by (i) employing vector
quantization techniques to reduce the storage of reference patterns,
and (ii) eliminating the need for dynamic time warping which reduces
the computational complexity of the system.
Finally, a method of identifying the acoustic structure of an
utterance in terms of voiced, unvoiced, and silence segments by using
fuzzy set theory is proposed. The acoustic structure is then
employed to enhance the recognition accuracy of a conventional
isolated word recognizer
An investigation into glottal waveform based speech coding
Coding of voiced speech by extraction of the glottal waveform has shown promise in improving the efficiency of speech coding systems. This thesis describes an investigation into the performance of such a system.
The effect of reverberation on the radiation impedance at the lips is shown to be negligible under normal conditions. Also, the accuracy of the Image Method for adding artificial reverberation to anechoic speech recordings is established.
A new algorithm, Pre-emphasised Maximum Likelihood Epoch Detection (PMLED), for Glottal Closure Instant detection is proposed. The algorithm is tested on natural speech and is shown to be both accurate and robust.
Two techniques for giottai waveform estimation, Closed Phase Inverse Filtering (CPIF) and Iterative Adaptive Inverse Filtering (IAIF), are compared. In tandem with an LF model fitting procedure, both techniques display a high degree of accuracy However, IAIF is found to be slightly more robust.
Based on these results, a Glottal Excited Linear Predictive (GELP) coding system for voiced speech is proposed and tested. Using a differential LF parameter quantisation scheme, the system achieves speech quality similar to that of U S Federal Standard 1016 CELP at a lower mean bit rate while incurring no extra delay
- …