1,967 research outputs found

    Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema

    No full text
    In this paper, a psychologically-inspired binary cascade classification schema is proposed for speech emotion recognition. Performance is enhanced because commonly confused pairs of emotions are distinguishable from one another. Extracted features are related to statistics of pitch, formants, and energy contours, as well as spectrum, cepstrum, perceptual and temporal features, autocorrelation, MPEG-7 descriptors, Fujisakis model parameters, voice quality, jitter, and shimmer. Selected features are fed as input to K nearest neighborhood classifier and to support vector machines. Two kernels are tested for the latter: Linear and Gaussian radial basis function. The recently proposed speaker-independent experimental protocol is tested on the Berlin emotional speech database for each gender separately. The best emotion recognition accuracy, achieved by support vector machines with linear kernel, equals 87.7%, outperforming state-of-the-art approaches. Statistical analysis is first carried out with respect to the classifiers error rates and then to evaluate the information expressed by the classifiers confusion matrices. © Springer Science+Business Media, LLC 2011

    Proposing a hybrid approach for emotion classification using audio and video data

    Get PDF
    Emotion recognition has been a research topic in the field of Human-Computer Interaction (HCI) during recent years. Computers have become an inseparable part of human life. Users need human-like interaction to better communicate with computers. Many researchers have become interested in emotion recognition and classification using different sources. A hybrid approach of audio and text has been recently introduced. All such approaches have been done to raise the accuracy and appropriateness of emotion classification. In this study, a hybrid approach of audio and video has been applied for emotion recognition. The innovation of this approach is selecting the characteristics of audio and video and their features as a unique specification for classification. In this research, the SVM method has been used for classifying the data in the SAVEE database. The experimental results show the maximum classification accuracy for audio data is 91.63% while by applying the hybrid approach the accuracy achieved is 99.26%

    Consonant gemination in Italian: the affricate and fricative case

    Full text link
    Consonant gemination in Italian affricates and fricatives was investigated, completing the overall study of gemination of Italian consonants. Results of the analysis of other consonant categories, i.e. stops, nasals, and liquids, showed that closure duration for stops and consonant duration for nasals and liquids, form the most salient acoustic cues to gemination. Frequency and energy domain parameters were not significantly affected by gemination in a systematic way for all consonant classes. Results on fricatives and affricates confirmed the above findings, i.e., that the primary acoustic correlate of gemination is durational in nature and corresponds to a lengthened consonant duration for fricative geminates and a lengthened closure duration for affricate geminates. An inverse correlation between consonant and pre-consonant vowel durations was present for both consonant categories, and also for both singleton and geminate word sets when considered separately. This effect was reinforced for combined sets, confirming the hypothesis that a durational compensation between different phonemes may serve to preserve rhythmical structures. Classification tests of single vs. geminate consonants using the durational acoustic cues as classification parameters confirmed their validity, and highlighted peculiarities of the two consonant classes. In particular, a relatively poor classification performance was observed for affricates, which led to refining the analysis by considering dental vs. non-dental affricates in two different sets. Results support the hypothesis that dental affricates, in Italian, may not appear in intervocalic position as singletons but only in their geminate form.Comment: Submitted to Speech Communication. arXiv admin note: substantial text overlap with arXiv:2005.0696

    Automatic Detection of Laryngeal Pathology on Sustained Vowels Using Short-Term Cepstral Parameters: Analysis of Performance and Theoretical Justification

    Get PDF
    The majority of speech signal analysis procedures for automatic detection of laryngeal pathologies mainly rely on parameters extracted from time domain processing. Moreover, calculation of these parameters often requires prior pitch period estimation; therefore, their validity heavily depends on the robustness of pitch detection. Within this paper, an alternative approach based on cepstral- domain processing is presented which has the advantage of not requiring pitch estimation, thus providing a gain in both simplicity and robustness. While the proposed scheme is similar to solutions based on Mel-frequency cepstral parameters, already present in literature, it has an easier physical interpretation while achieving similar performance standards

    Do red deer stags (Cervus elaphus) use roar fundamental frequency (F0) to assess rivals?

    Get PDF
    It is well established that in humans, male voices are disproportionately lower pitched than female voices, and recent studies suggest that this dimorphism in fundamental frequency (F0) results from both intrasexual (male competition) and intersexual (female mate choice) selection for lower pitched voices in men. However, comparative investigations indicate that sexual dimorphism in F0 is not universal in terrestrial mammals. In the highly polygynous and sexually dimorphic Scottish red deer Cervus elaphus scoticus, more successful males give sexually-selected calls (roars) with higher minimum F0s, suggesting that high, rather than low F0s advertise quality in this subspecies. While playback experiments demonstrated that oestrous females prefer higher pitched roars, the potential role of roar F0 in male competition remains untested. Here we examined the response of rutting red deer stags to playbacks of re-synthesized male roars with different median F0s. Our results show that stags’ responses (latencies and durations of attention, vocal and approach responses) were not affected by the F0 of the roar. This suggests that intrasexual selection is unlikely to strongly influence the evolution of roar F0 in Scottish red deer stags, and illustrates how the F0 of terrestrial mammal vocal sexual signals may be subject to different selection pressures across species. Further investigations on species characterized by different F0 profiles are needed to provide a comparative background for evolutionary interpretations of sex differences in mammalian vocalizations

    Speaker-independent negative emotion recognition

    Get PDF
    This work aims to provide a method able to distinguish between negative and non-negative emotions in vocal interaction. A large pool of 1418 features is extracted for that purpose. Several of those features are tested in emotion recognition for the first time. Next, feature selection is applied separately to male and female utterances. In particular, a bidirectional Best First search with backtracking is applied. The first contribution is the demonstration that a significant number of features, first tested here, are retained after feature selection. The selected features are then fed as input to support vector machines with various kernel functions as well as to the K nearest neighbors classifier. The second contribution is in the speaker-independent experiments conducted in order to cope with the limited number of speakers present in the commonly used emotion speech corpora. Speaker-independent systems are known to be more robust and present a better generalization ability than the speaker-dependent ones. Experimental results are reported for the Berlin emotional speech database. The best performing classifier is found to be the support vector machine with the Gaussian radial basis function kernel. Correctly classified utterances are 86.73%±3.95% for male subjects and 91.73%±4.18% for female subjects. The last contribution is in the statistical analysis of the performance of the support vector machine classifier against the K nearest neighbors classifier as well as the statistical analysis of the various support vector machine kernels impact. © 2010 IEEE

    Determination of Formant Features in Czech and Slovak for GMM Emotional Speech Classifier

    Get PDF
    The paper is aimed at determination of formant features (FF) which describe vocal tract characteristics. It comprises analysis of the first three formant positions together with their bandwidths and the formant tilts. Subsequently, the statistical evaluation and comparison of the FF was performed. This experiment was realized with the speech material in the form of sentences of male and female speakers expressing four emotional states (joy, sadness, anger, and a neutral state) in Czech and Slovak languages. The statistical distribution of the analyzed formant frequencies and formant tilts shows good differentiation between neutral and emotional styles for both voices. Contrary to it, the values of the formant 3-dB bandwidths have no correlation with the type of the speaking style or the type of the voice. These spectral parameters together with the values of the other speech characteristics were used in the feature vector for Gaussian mixture models (GMM) emotional speech style classifier that is currently developed. The overall mean classification error rate achieves about 18 %, and the best obtained error rate is 5 % for the sadness style of the female voice. These values are acceptable in this first stage of development of the GMM classifier that should be used for evaluation of the synthetic speech quality after applied voice conversion and emotional speech style transformation

    Finding the Most Uniform Changes in Vowel Polygon Caused by Psychological Stress

    Get PDF
    Using vowel polygons, exactly their parameters, is chosen as the criterion for achievement of differences between normal state of speaker and relevant speech under real psychological stress. All results were experimentally obtained by created software for vowel polygon analysis applied on ExamStress database. Selected 6 methods based on cross-correlation of different features were classified by the coefficient of variation and for each individual vowel polygon, the efficiency coefficient marking the most significant and uniform differences between stressed and normal speech were calculated. As the best method for observing generated differences resulted method considered mean of cross correlation values received for difference area value with vector length and angle parameter couples. Generally, best results for stress detection are achieved by vowel triangles created by /i/-/o/-/u/ and /a/-/i/-/o/ vowel triangles in formant planes containing the fifth formant F5 combined with other formants

    Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State Vowel Identification

    Full text link
    Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. Such a transformation enables speech to be understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitchindependent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

    Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State vowel Categorization

    Full text link
    Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624
    corecore