Search CORE

19 research outputs found

Speech Emotion Recognition Using Multi-hop Attention Mechanism

Author: Byun Seokhyun
Dey Subhadeep
Jung Kyomin
Yoon Seunghyun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/05/2019
Field of study

In this paper, we are interested in exploiting textual and acoustic data of an utterance for the speech emotion classification task. The baseline approach models the information from audio and text independently using two deep neural networks (DNNs). The outputs from both the DNNs are then fused for classification. As opposed to using knowledge from both the modalities separately, we propose a framework to exploit acoustic information in tandem with lexical data. The proposed framework uses two bi-directional long short-term memory (BLSTM) for obtaining hidden representations of the utterance. Furthermore, we propose an attention mechanism, referred to as the multi-hop, which is trained to automatically infer the correlation between the modalities. The multi-hop attention first computes the relevant segments of the textual data corresponding to the audio signal. The relevant textual data is then applied to attend parts of the audio signal. To evaluate the performance of the proposed system, experiments are performed in the IEMOCAP dataset. Experimental results show that the proposed technique outperforms the state-of-the-art system by 6.5% relative improvement in terms of weighted accuracy.Comment: 5 pages, Accepted as a conference paper at ICASSP 2019 (oral presentation

arXiv.org e-Print Archive

Crossref

준언어적 신호 압축 및 분류를 위한 심층 신경망

Author: 변석현
Publication venue: 서울대학교 대학원
Publication date: 01/02/2019
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2019. 2. 정교민.Recognizing and classifying paralinguistic signals, with its various applications, is an important problem. In general, this task is considered challenging because the sound information from the signals is difficult to distinguish even by humans. Thus, analyzing signals with machine learning techniques is a reasonable approach to understanding signals. Audio features extracted from paralinguistic signals usually consist of highdimensional vectors such as prosody, energy, cepstrum, and other speech-related information. Therefore, when the size of a training corpus is not sufficiently large, it is extremely difficult to apply machine learning methods to analyze these signals due to their high feature dimensions. This paper addresses these limitations by using neural networks' feature learning abilities. First, we use a neural network-based autoencoder to compress the signal to eliminate redundancy within the signal feature, and we show than the compressed signal features are competitive in distinguishing the signal compared to the original methods such as logistic regression, support vector machine, decision trees, and boosted trees.준언어적 신호를 분류를 인식하고 분류하는 일은 그의 다양한 응용성 측면에서 매우 중요한 문제이다. 일반적으로 이 문제가 어려운 이유는 소리 정보가 인간에게도 구별되기 힘들다는 애매모호한 특성 때문이다. 이에 신호를 더 잘 이해하기 위해 기계학습 기법이 고안되는데, 이 때 분석에 사용되는 신호 특징 벡터는 운율, 에너지, 주파수 등 신호에 관련된 정보로 이루어진 고차원 벡터이다. 즉 훈련에 사용되는 데이터의 크기가 작은 경우에는, 특징 벡터의 높은 차원 때문에 적절히 기계학습 모델을 잘 훈련시키기가 어렵게 된다. 이 논문에서는 심층 신경망 모델을 이용하여 이와 같은 문제를 해결하고자 한다. 우선 다양한 압축 기법을 이용하여 특징 벡터 내의 불필요한 정보를 제거하고, 이 압축된 특징들이 전통적 기계학습 분류방법들보다 심층 신경망에 의해 더 잘 분류됨을 실험적으로 보인다.Chapter 1. Introduction ...........................................................1 Chapter 2. Related Work .........................................................3 Chapter 3. Task Description .................................................... 4 Chapter 4. Proposed Framework .............................................5 Chapter 5. Performance Evaluation ......................................... 9 Chapter 6. Discussion ............................................................ 11 Chapter 7. Conclusion ............................................................13 Bibliography ...........................................................................14 Abstract in Korean ................................................................ 17Maste

SNU Open Repository and Archive

Semi-Supervised Speech Emotion Recognition with Ladder Networks

Author: Busso Carlos
Parthasarathy Srinivas
Publication venue
Publication date: 08/05/2019
Field of study

Speech emotion recognition (SER) systems find applications in various fields such as healthcare, education, and security and defense. A major drawback of these systems is their lack of generalization across different conditions. This problem can be solved by training models on large amounts of labeled data from the target domain, which is expensive and time-consuming. Another approach is to increase the generalization of the models. An effective way to achieve this goal is by regularizing the models through multitask learning (MTL), where auxiliary tasks are learned along with the primary task. These methods often require the use of labeled data which is computationally expensive to collect for emotion recognition (gender, speaker identity, age or other emotional descriptors). This study proposes the use of ladder networks for emotion recognition, which utilizes an unsupervised auxiliary task. The primary task is a regression problem to predict emotional attributes. The auxiliary task is the reconstruction of intermediate feature representations using a denoising autoencoder. This auxiliary task does not require labels so it is possible to train the framework in a semi-supervised fashion with abundant unlabeled data from the target domain. This study shows that the proposed approach creates a powerful framework for SER, achieving superior performance than fully supervised single-task learning (STL) and MTL baselines. The approach is implemented with several acoustic features, showing that ladder networks generalize significantly better in cross-corpus settings. Compared to the STL baselines, the proposed approach achieves relative gains in concordance correlation coefficient (CCC) between 3.0% and 3.5% for within corpus evaluations, and between 16.1% and 74.1% for cross corpus evaluations, highlighting the power of the architecture

arXiv.org e-Print Archive

Intonation Template Matching for Syllable-Level Prosody Encoding

Author: Rehman A.
Yang Xiaosong
Zhang Jian J.
Publication venue: CEUR-WS
Publication date: 01/01/2023
Field of study

We address the challenge of machine interpretation of subtle speech intonations that convey complex meanings. We assume that emotions and interrogative statements follow regular prosodic patterns, allowing us to create an unsupervised intonation template dictionary. These templates can then serve as encoding mechanisms for higher-level labels. We use piecewise interpolation of syllable-level formant features to create intonation templates and evaluate their effectiveness on three speech emotion recognition datasets and declarative-interrogative utterances. The results indicate that individual syllables can be detected for basic emotions with nearly double the accuracy of chance. Additionally, certain intonation templates exhibit a correlation with interrogative implications

Bournemouth University Research Online

A survey on the semi supervised learning paradigm in the context of speech emotion recognition

Author: Andrade Guilherme
Novais Paulo
Rodrigues Manuel Fernando Silva
Publication venue: Springer, Cham
Publication date: 01/01/2022
Field of study

The area of Automatic Speech Emotion Recognition has been a hot topic for researchers for quite some time now. The recent breakthroughs on technology in the field of Machine Learning open up doors for multiple approaches of many kinds. However, some concerns have been persistent throughout the years where we highlight the design and collection of data. Proper annotation of data can be quite expensive and sometimes not even viable, as specialists are often needed for such a complex task as emotion recognition. The evolution of the semi supervised learning paradigm tries to drag down the high dependency on labelled data, potentially facilitating the design of a proper pipeline of tasks, single or multi modal, towards the final objective of the recognition of the human emotional mental state. In this paper, a review of the current single modal (audio) Semi Supervised Learning state of art is explored as a possible solution to the bottlenecking issues mentioned, as a way of helping and guiding future researchers when getting to the planning phase of such task, where many positive aspects from each piece of work can be drawn and combined.This work has been supported by FCT - Fundação para a Ciencia e Tecnologia within the R&D Units Project Scope: UIDB/00319/202

Universidade do Minho: RepositoriUM