3,897 research outputs found

    Syllable classification using static matrices and prosodic features

    Get PDF
    In this paper we explore the usefulness of prosodic features for syllable classification. In order to do this, we represent the syllable as a static analysis unit such that its acoustic-temporal dynamics could be merged into a set of features that the SVM classifier will consider as a whole. In the first part of our experiment we used MFCC as features for classification, obtaining a maximum accuracy of 86.66%. The second part of our study tests whether the prosodic information is complementary to the cepstral information for syllable classification. The results obtained show that combining the two types of information does improve the classification, but further analysis is necessary for a more successful combination of the two types of features

    Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification

    Full text link
    In this paper a novel cross-device text-independent speaker verification architecture is proposed. Majority of the state-of-the-art deep architectures that are used for speaker verification tasks consider Mel-frequency cepstral coefficients. In contrast, our proposed Siamese convolutional neural network architecture uses Mel-frequency spectrogram coefficients to benefit from the dependency of the adjacent spectro-temporal features. Moreover, although spectro-temporal features have proved to be highly reliable in speaker verification models, they only represent some aspects of short-term acoustic level traits of the speaker's voice. However, the human voice consists of several linguistic levels such as acoustic, lexicon, prosody, and phonetics, that can be utilized in speaker verification models. To compensate for these inherited shortcomings in spectro-temporal features, we propose to enhance the proposed Siamese convolutional neural network architecture by deploying a multilayer perceptron network to incorporate the prosodic, jitter, and shimmer features. The proposed end-to-end verification architecture performs feature extraction and verification simultaneously. This proposed architecture displays significant improvement over classical signal processing approaches and deep algorithms for forensic cross-device speaker verification.Comment: Accepted in 9th IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS 2018

    Feature extraction based on bio-inspired model for robust emotion recognition

    Get PDF
    Emotional state identification is an important issue to achieve more natural speech interactive systems. Ideally, these systems should also be able to work in real environments in which generally exist some kind of noise. Several bio-inspired representations have been applied to artificial systems for speech processing under noise conditions. In this work, an auditory signal representation is used to obtain a novel bio-inspired set of features for emotional speech signals. These characteristics, together with other spectral and prosodic features, are used for emotion recognition under noise conditions. Neural models were trained as classifiers and results were compared to the well-known mel-frequency cepstral coefficients. Results show that using the proposed representations, it is possible to significantly improve the robustness of an emotion recognition system. The results were also validated in a speaker independent scheme and with two emotional speech corpora.Fil: Albornoz, Enrique Marcelo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Milone, Diego Humberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Rufiner, Hugo Leonardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentin

    Emotion Recognition from Acted and Spontaneous Speech

    Get PDF
    Dizertační práce se zabývá rozpoznáním emočního stavu mluvčích z řečového signálu. Práce je rozdělena do dvou hlavních častí, první část popisuju navržené metody pro rozpoznání emočního stavu z hraných databází. V rámci této části jsou představeny výsledky rozpoznání použitím dvou různých databází s různými jazyky. Hlavními přínosy této části je detailní analýza rozsáhlé škály různých příznaků získaných z řečového signálu, návrh nových klasifikačních architektur jako je například „emoční párování“ a návrh nové metody pro mapování diskrétních emočních stavů do dvou dimenzionálního prostoru. Druhá část se zabývá rozpoznáním emočních stavů z databáze spontánní řeči, která byla získána ze záznamů hovorů z reálných call center. Poznatky z analýzy a návrhu metod rozpoznání z hrané řeči byly využity pro návrh nového systému pro rozpoznání sedmi spontánních emočních stavů. Jádrem navrženého přístupu je komplexní klasifikační architektura založena na fúzi různých systémů. Práce se dále zabývá vlivem emočního stavu mluvčího na úspěšnosti rozpoznání pohlaví a návrhem systému pro automatickou detekci úspěšných hovorů v call centrech na základě analýzy parametrů dialogu mezi účastníky telefonních hovorů.Doctoral thesis deals with emotion recognition from speech signals. The thesis is divided into two main parts; the first part describes proposed approaches for emotion recognition using two different multilingual databases of acted emotional speech. The main contributions of this part are detailed analysis of a big set of acoustic features, new classification schemes for vocal emotion recognition such as “emotion coupling” and new method for mapping discrete emotions into two-dimensional space. The second part of this thesis is devoted to emotion recognition using multilingual databases of spontaneous emotional speech, which is based on telephone records obtained from real call centers. The knowledge gained from experiments with emotion recognition from acted speech was exploited to design a new approach for classifying seven emotional states. The core of the proposed approach is a complex classification architecture based on the fusion of different systems. The thesis also examines the influence of speaker’s emotional state on gender recognition performance and proposes system for automatic identification of successful phone calls in call center by means of dialogue features.

    Jitter and Shimmer measurements for speaker diarization

    Get PDF
    Jitter and shimmer voice quality features have been successfully used to characterize speaker voice traits and detect voice pathologies. Jitter and shimmer measure variations in the fundamental frequency and amplitude of speaker's voice, respectively. Due to their nature, they can be used to assess differences between speakers. In this paper, we investigate the usefulness of these voice quality features in the task of speaker diarization. The combination of voice quality features with the conventional spectral features, Mel-Frequency Cepstral Coefficients (MFCC), is addressed in the framework of Augmented Multiparty Interaction (AMI) corpus, a multi-party and spontaneous speech set of recordings. Both sets of features are independently modeled using mixture of Gaussians and fused together at the score likelihood level. The experiments carried out on the AMI corpus show that incorporating jitter and shimmer measurements to the baseline spectral features decreases the diarization error rate in most of the recordings.Peer ReviewedPostprint (published version

    Employing Emotion Cues to Verify Speakers in Emotional Talking Environments

    Full text link
    Usually, people talk neutrally in environments where there are no abnormal talking conditions such as stress and emotion. Other emotional conditions that might affect people talking tone like happiness, anger, and sadness. Such emotions are directly affected by the patient health status. In neutral talking environments, speakers can be easily verified, however, in emotional talking environments, speakers cannot be easily verified as in neutral talking ones. Consequently, speaker verification systems do not perform well in emotional talking environments as they do in neutral talking environments. In this work, a two-stage approach has been employed and evaluated to improve speaker verification performance in emotional talking environments. This approach employs speaker emotion cues (text-independent and emotion-dependent speaker verification problem) based on both Hidden Markov Models (HMMs) and Suprasegmental Hidden Markov Models (SPHMMs) as classifiers. The approach is comprised of two cascaded stages that combines and integrates emotion recognizer and speaker recognizer into one recognizer. The architecture has been tested on two different and separate emotional speech databases: our collected database and Emotional Prosody Speech and Transcripts database. The results of this work show that the proposed approach gives promising results with a significant improvement over previous studies and other approaches such as emotion-independent speaker verification approach and emotion-dependent speaker verification approach based completely on HMMs.Comment: Journal of Intelligent Systems, Special Issue on Intelligent Healthcare Systems, De Gruyter, 201

    Prosodic Event Recognition using Convolutional Neural Networks with Context Information

    Full text link
    This paper demonstrates the potential of convolutional neural networks (CNN) for detecting and classifying prosodic events on words, specifically pitch accents and phrase boundary tones, from frame-based acoustic features. Typical approaches use not only feature representations of the word in question but also its surrounding context. We show that adding position features indicating the current word benefits the CNN. In addition, this paper discusses the generalization from a speaker-dependent modelling approach to a speaker-independent setup. The proposed method is simple and efficient and yields strong results not only in speaker-dependent but also speaker-independent cases.Comment: Interspeech 2017 4 pages, 1 figur
    corecore