27 research outputs found

    Signal Enhancement by Single Channel Source Separation

    Get PDF
    Most gadgets and electronics devices are commonly equipped with single microphone only. This is difficult task in source separation world which traditionally required more sensors than sources to achieve better performance. In this paper we evaluated single channel source separation to enhance target signal from interred noise. The method we used is non-negative matrix factorization (NMF) that decompose signal into its components and find the matched signal to target speaker. As objective evaluation, coherence score is used to measure the perceptual similarity from enhanced to original one. It show the extracted has 0.5 of average coherence that shows medium correlation between both signals

    Deep Multilayer Perceptrons for Dimensional Speech Emotion Recognition

    Full text link
    Modern deep learning architectures are ordinarily performed on high-performance computing facilities due to the large size of the input features and complexity of its model. This paper proposes traditional multilayer perceptrons (MLP) with deep layers and small input size to tackle that computation requirement limitation. The result shows that our proposed deep MLP outperformed modern deep learning architectures, i.e., LSTM and CNN, on the same number of layers and value of parameters. The deep MLP exhibited the highest performance on both speaker-dependent and speaker-independent scenarios on IEMOCAP and MSP-IMPROV corpus.Comment: 2 figures, 4 tables, submitted to EUSIPCO 202

    Ensembling Multilingual Pre-Trained Models for Predicting Multi-Label Regression Emotion Share from Speech

    Full text link
    Speech emotion recognition has evolved from research to practical applications. Previous studies of emotion recognition from speech have focused on developing models on certain datasets like IEMOCAP. The lack of data in the domain of emotion modeling emerges as a challenge to evaluate models in the other dataset, as well as to evaluate speech emotion recognition models that work in a multilingual setting. This paper proposes an ensemble learning to fuse results of pre-trained models for emotion share recognition from speech. The models were chosen to accommodate multilingual data from English and Spanish. The results show that ensemble learning can improve the performance of the baseline model with a single model and the previous best model from the late fusion. The performance is measured using the Spearman rank correlation coefficient since the task is a regression problem with ranking values. A Spearman rank correlation coefficient of 0.537 is reported for the test set, while for the development set, the score is 0.524. These scores are higher than the previous study of a fusion method from monolingual data, which achieved scores of 0.476 for the test and 0.470 for the development.Comment: 4 pages, 6 tables, accepted in APSIPA-ASC 202

    Dimensional Speech Emotion Recognition from Acoustic and Text Features using Recurrent Neural Networks

    Get PDF
    Emotion can be inferred from tonal and verbal information, where both features can be extracted from speech. While most researchers conducted studies on categorical emotion recognition from a single modality, this research presents a dimensional emotion recognition combining acoustic and text features. A number of 31 acoustic features are extracted from speech, while word vector is used as text features. The initial result on single modality emotion recognition can be used as a cue to combine both features with improving the recognition result. The latter result shows that a combination of acoustic and text features decreases the error of dimensional emotion score prediction by about 5% from the acoustic system and 1% from the text system. This smallest error is achieved by combining the text system with Long Short-Term Memory (LSTM) networks and acoustic systems with bidirectional LSTM networks and concatenated both systems with dense networks

    音響情報および言語情報の統合による次元的音声感情認識

    Get PDF
    Supervisor:赤木 正人先端科学技術研究科博

    音響情報および言語情報の統合による次元的音声感情認識

    No full text
    Supervisor: 赤木 正人先端科学技術研究科博士identifier:https://dspace.jaist.ac.jp/dspace/handle/10119/1747

    Multitask Learning and Multistage Fusion for Dimensional Audiovisual Emotion Recognition

    Get PDF
    Due to its ability to accurately predict emotional state using multimodal features, audiovisual emotion recognition has recently gained more interest from researchers. This paper proposes two methods to predict emotional attributes from audio and visual data using a multitask learning and a fusion strategy. First, multitask learning is employed by adjusting three parameters for each attribute to improve recognition rate. Second, a multistage fusion is proposed to combine results from various modalities final prediction. Our approach used multitask learning, employed at unimodal and early fusion methods, shows improvement over single-task learning with an average CCC score of 0.431 compared to 0.297. Multistage method, employed at the late fusion approach, significantly improved the agreement score between true and predicted values on the development set of data (from [0.537, 0.565, 0.083] to [0.68, 0.656, 0.443]) for arousal, valence, and liking

    Sentiment Analysis and Emotion Recognition from Speech Using Universal Speech Representations

    No full text
    The study of understanding sentiment and emotion in speech is a challenging task in human multimodal language. However, in certain cases, such as telephone calls, only audio data can be obtained. In this study, we independently evaluated sentiment analysis and emotion recognition from speech using recent self-supervised learning models—specifically, universal speech representations with speaker-aware pre-training models. Three different sizes of universal models were evaluated for three sentiment tasks and an emotion task. The evaluation revealed that the best results were obtained with two classes of sentiment analysis, based on both weighted and unweighted accuracy scores (81% and 73%). This binary classification with unimodal acoustic analysis also performed competitively compared to previous methods which used multimodal fusion. The models failed to make accurate predictionsin an emotion recognition task and in sentiment analysis tasks with higher numbers of classes. The unbalanced property of the datasets may also have contributed to the performance degradations observed in the six-class emotion, three-class sentiment, and seven-class sentiment tasks
    corecore