318,550 research outputs found

    Exploitation of Phase-Based Features for Whispered Speech Emotion Recognition

    No full text
    Features for speech emotion recognition are usually dominated by the spectral magnitude information while they ignore the use of the phase spectrum because of the difficulty of properly interpreting it. Motivated by recent successes of phase-based features for speech processing, this paper investigates the effectiveness of phase information for whispered speech emotion recognition. We select two types of phase-based features (i.e., modified group delay features and all-pole group delay features), both which have shown wide applicability to all sorts of different speech analysis and are now studied in whispered speech emotion recognition. When exploiting these features, we propose a new speech emotion recognition framework, employing outer product in combination with power and L2 normalization. The according technique encodes any variable length sequence of the phase-based features into a fixed dimension vector regardless of the length of the input sequence. The resulting representation is fed to train a classification model with a linear kernel classifier. Experimental results on the Geneva Whispered Emotion Corpus database, including normal and whispered phonation, demonstrate the effectiveness of the proposed method when compared with other modern systems. It is also shown that, combining phase information with magnitude information could significantly improve performance over the common systems solely adopting magnitude information

    Nonlinear Dynamic Invariants for Continuous Speech Recognition

    Get PDF
    In this work, nonlinear acoustic information is combined with traditional linear acoustic information in order to produce a noise-robust set of features for speech recognition. Classical acoustic modeling techniques for speech recognition have relied on a standard assumption of linear acoustics where signal processing is primarily performed in the signal\u27s frequency domain. While these conventional techniques have demonstrated good performance under controlled conditions, the performance of these systems suffers significant degradations when the acoustic data is contaminated with previously unseen noise. The objective of this thesis was to determine whether nonlinear dynamic invariants are able to boost speech recognition performance when combined with traditional acoustic features. Several sets of experiments are used to evaluate both clean and noisy speech data. The invariants resulted in a maximum relative increase of 11.1% for the clean evaluation set. However, an average relative decrease of 7.6% was observed for the noise-contaminated evaluation sets. The fact that recognition performance decreased with the use of dynamic invariants suggests that additional research is required for robust filtering of phase spaces constructed from noisy time series

    Impaired extraction of speech rhythm from temporal modulation patterns in speech in developmental dyslexia

    Get PDF
    Dyslexia is associated with impaired neural representation of the sound structure of words (phonology). The “phonological deficit” in dyslexia may arise in part from impaired speech rhythm perception, thought to depend on neural oscillatory phase-locking to slow amplitude modulation (AM) patterns in the speech envelope. Speech contains AM patterns at multiple temporal rates, and these different AM rates are associated with phonological units of different grain sizes, e.g., related to stress, syllables or phonemes. Here, we assess the ability of adults with dyslexia to use speech AMs to identify rhythm patterns (RPs). We study 3 important temporal rates: “Stress” (~2 Hz), “Syllable” (~4 Hz) and “Sub-beat” (reduced syllables, ~14 Hz). 21 dyslexics and 21 controls listened to nursery rhyme sentences that had been tone-vocoded using either single AM rates from the speech envelope (Stress only, Syllable only, Sub-beat only) or pairs of AM rates (Stress + Syllable, Syllable + Sub-beat). They were asked to use the acoustic rhythm of the stimulus to identity the original nursery rhyme sentence. The data showed that dyslexics were significantly poorer at detecting rhythm compared to controls when they had to utilize multi-rate temporal information from pairs of AMs (Stress + Syllable or Syllable + Sub-beat). These data suggest that dyslexia is associated with a reduced ability to utilize AMs <20 Hz for rhythm recognition. This perceptual deficit in utilizing AM patterns in speech could be underpinned by less efficient neuronal phase alignment and cross-frequency neuronal oscillatory synchronization in dyslexia. Dyslexics' perceptual difficulties in capturing the full spectro-temporal complexity of speech over multiple timescales could contribute to the development of impaired phonological representations for words, the cognitive hallmark of dyslexia across languages

    Demonstration of a prototype for a conversational companion for reminiscing about images

    Get PDF
    This work was funded by the Companions project (2006-2009) sponsored by the European Commission as part of the Information Society Technologies (IST) programme under EC grant number IST-FP6-034434.This paper describes an initial prototype demonstrator of a Companion, designed as a platform for novel approaches to the following: 1) The use of Information Extraction (IE) techniques to extract the content of incoming dialogue utterances after an Automatic Speech Recognition (ASR) phase, 2) The conversion of the input to Resource Descriptor Format (RDF) to allow the generation of new facts from existing ones, under the control of a Dialogue Manger (DM), that also has access to stored knowledge and to open knowledge accessed in real time from the web, all in RDF form, 3) A DM implemented as a stack and network virtual machine that models mixed initiative in dialogue control, and 4) A tuned dialogue act detector based on corpus evidence. The prototype platform was evaluated, and we describe this briefly; it is also designed to support more extensive forms of emotion detection carried by both speech and lexical content, as well as extended forms of machine learning.peer-reviewe

    Development of an Afrikaans test for sentence recognition in noise

    Get PDF
    Speech audiometry is considered an essential tool in the assessment of hearing, not only to validate pure tone measurements, but also to indicate speech perception as a critical communicative function. The use of sentence material in the assessment of speech perception has great value as it simulates – more closely than single words – the type of speech stimuli that listeners are confronted with on a daily basis. In South Africa, speech recognition (reception and discrimination) abilities are most commonly assessed through the use of single words, presented by monitored live voice, a practice sternly criticized in the literature. Furthermore, speech recognition is commonly evaluated in an ideal (quiet) listening environment. This method gives an incomplete impression of a patient’s auditory performance, since everyday listening situations are often characterised by the presence of background noise that influences comprehension of speech. The present study was therefore launched with the aim to develop a reliable measure of speech recognition in noise using Afrikaans sentence material. The development of the test was conducted in three phases. The first phase entailed the compilation of culturally valid, pre-recorded Afrikaans sentence material. During the second phase the uniformity of the recorded sentence collection was improved by determining the intelligibility of each sentence in the presence of noise and eliminating sentences that were not of equivalent difficulty in this regard. The objective of the third phase was to arrange the sentence material into lists using two different methods of list compilation. The first method involved grouping sentences together based solely on their intelligibility in noise (as assessed in the previous phase). The second method was the well-documented method of compiling phonetically balanced lists. The inter-list reliability of both sets of lists was evaluated in both normal hearing listeners and listeners with a simulated high frequency hearing loss. The results provided valuable information on the process of developing a test of speech recognition in noise, especially in terms of options for list compilation. Findings indicated that lists compiled according to intelligibility in noise showed a higher degree of equivalence than phonetically balanced lists when applied to normally hearing listeners. However, when applied to listeners with a simulated loss, phonetically balanced lists displayed greater equivalence. The developed test provides a means of assessing speech recognition in noise in Afrikaans, and shows potential for application in the assessment of hearing impaired populations, individuals with auditory processing difficulties, and the paediatric population. In addition, the methodology described for the development of the test could provide a valuable guideline for future researchers looking to develop similar tests in other languages.Dissertation (MCommunication Pathology)--University of Pretoria, 2008.Speech-Language Pathology and AudiologyMCommunication PathologyUnrestricte

    Improving Phoneme Sequence Recognition using Phoneme Duration Information in DNN-HSMM

    Get PDF
    Improving phoneme recognition has attracted the attention of many researchers due to its applications in various fields of speech processing. Recent research achievements show that using deep neural network (DNN) in speech recognition systems significantly improves the performance of these systems. There are two phases in DNN-based phoneme recognition systems including training and testing. Most previous research attempted to improve training phase such as training algorithms, different types of network, network architecture, feature type, etc. But in this study, we focus on test phase which is related to generate phoneme sequence that is also essential to achieve good phoneme recognition accuracy. Past research used Viterbi algorithm on hidden Markov model (HMM) to generate phoneme sequences. We address an important problem associated with this method. To deal with the problem of considering geometric distribution of state duration in HMM, we use real duration probability distribution for each phoneme with the aid of hidden semi-Markov model (HSMM). We also represent each phoneme with only one state to simply use phonemes duration information in HSMM. Furthermore, we investigate the performance of a post-processing method, which corrects the phoneme sequence obtained from the neural network, based on our knowledge about phonemes. The experimental results using the Persian FarsDat corpus show that using extended Viterbi algorithm on HSMM achieves phoneme recognition accuracy improvements of 2.68% and 0.56% over conventional methods using Gaussian mixture model-hidden Markov models (GMM-HMMs) and Viterbi on HMM, respectively. The post-processing method also increases the accuracy compared to before its application
    • …
    corecore