19 research outputs found
Speech Emotion Recognition Using Multi-hop Attention Mechanism
In this paper, we are interested in exploiting textual and acoustic data of
an utterance for the speech emotion classification task. The baseline approach
models the information from audio and text independently using two deep neural
networks (DNNs). The outputs from both the DNNs are then fused for
classification. As opposed to using knowledge from both the modalities
separately, we propose a framework to exploit acoustic information in tandem
with lexical data. The proposed framework uses two bi-directional long
short-term memory (BLSTM) for obtaining hidden representations of the
utterance. Furthermore, we propose an attention mechanism, referred to as the
multi-hop, which is trained to automatically infer the correlation between the
modalities. The multi-hop attention first computes the relevant segments of the
textual data corresponding to the audio signal. The relevant textual data is
then applied to attend parts of the audio signal. To evaluate the performance
of the proposed system, experiments are performed in the IEMOCAP dataset.
Experimental results show that the proposed technique outperforms the
state-of-the-art system by 6.5% relative improvement in terms of weighted
accuracy.Comment: 5 pages, Accepted as a conference paper at ICASSP 2019 (oral
presentation
์ค์ธ์ด์ ์ ํธ ์์ถ ๋ฐ ๋ถ๋ฅ๋ฅผ ์ํ ์ฌ์ธต ์ ๊ฒฝ๋ง
ํ์๋
ผ๋ฌธ (์์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2019. 2. ์ ๊ต๋ฏผ.Recognizing and classifying paralinguistic signals, with its various applications, is an important problem. In general, this task is considered challenging because the sound information from the signals is difficult to distinguish even by humans. Thus, analyzing signals with machine learning techniques is a reasonable approach to understanding signals. Audio features extracted from paralinguistic signals usually consist of highdimensional vectors such as prosody, energy, cepstrum, and other speech-related information. Therefore, when the size of a training corpus is not sufficiently large, it is extremely difficult to apply machine learning methods to analyze these signals due to their high feature dimensions. This paper addresses these limitations by using neural networks' feature learning abilities. First, we use a neural network-based autoencoder to compress the signal to eliminate redundancy within the signal feature, and we show than the compressed signal features are competitive in distinguishing the signal compared to the original methods such as logistic regression, support vector machine, decision trees, and boosted trees.์ค์ธ์ด์ ์ ํธ๋ฅผ ๋ถ๋ฅ๋ฅผ ์ธ์ํ๊ณ ๋ถ๋ฅํ๋ ์ผ์ ๊ทธ์ ๋ค์ํ ์์ฉ์ฑ ์ธก๋ฉด์์ ๋งค์ฐ ์ค์ํ ๋ฌธ์ ์ด๋ค. ์ผ๋ฐ์ ์ผ๋ก ์ด ๋ฌธ์ ๊ฐ ์ด๋ ค์ด ์ด์ ๋ ์๋ฆฌ ์ ๋ณด๊ฐ ์ธ๊ฐ์๊ฒ๋ ๊ตฌ๋ณ๋๊ธฐ ํ๋ค๋ค๋ ์ ๋งค๋ชจํธํ ํน์ฑ ๋๋ฌธ์ด๋ค. ์ด์ ์ ํธ๋ฅผ ๋ ์ ์ดํดํ๊ธฐ ์ํด ๊ธฐ๊ณํ์ต ๊ธฐ๋ฒ์ด ๊ณ ์๋๋๋ฐ, ์ด ๋ ๋ถ์์ ์ฌ์ฉ๋๋ ์ ํธ ํน์ง ๋ฒกํฐ๋ ์ด์จ, ์๋์ง, ์ฃผํ์ ๋ฑ ์ ํธ์ ๊ด๋ จ๋ ์ ๋ณด๋ก ์ด๋ฃจ์ด์ง ๊ณ ์ฐจ์ ๋ฒกํฐ์ด๋ค. ์ฆ ํ๋ จ์ ์ฌ์ฉ๋๋ ๋ฐ์ดํฐ์ ํฌ๊ธฐ๊ฐ ์์ ๊ฒฝ์ฐ์๋, ํน์ง ๋ฒกํฐ์ ๋์ ์ฐจ์ ๋๋ฌธ์ ์ ์ ํ ๊ธฐ๊ณํ์ต ๋ชจ๋ธ์ ์ ํ๋ จ์ํค๊ธฐ๊ฐ ์ด๋ ต๊ฒ ๋๋ค. ์ด ๋
ผ๋ฌธ์์๋ ์ฌ์ธต ์ ๊ฒฝ๋ง ๋ชจ๋ธ์ ์ด์ฉํ์ฌ ์ด์ ๊ฐ์ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ณ ์ ํ๋ค. ์ฐ์ ๋ค์ํ ์์ถ ๊ธฐ๋ฒ์ ์ด์ฉํ์ฌ ํน์ง ๋ฒกํฐ ๋ด์ ๋ถํ์ํ ์ ๋ณด๋ฅผ ์ ๊ฑฐํ๊ณ , ์ด ์์ถ๋ ํน์ง๋ค์ด ์ ํต์ ๊ธฐ๊ณํ์ต ๋ถ๋ฅ๋ฐฉ๋ฒ๋ค๋ณด๋ค ์ฌ์ธต ์ ๊ฒฝ๋ง์ ์ํด ๋ ์ ๋ถ๋ฅ๋จ์ ์คํ์ ์ผ๋ก ๋ณด์ธ๋ค.Chapter 1. Introduction ...........................................................1
Chapter 2. Related Work .........................................................3
Chapter 3. Task Description .................................................... 4
Chapter 4. Proposed Framework .............................................5
Chapter 5. Performance Evaluation ......................................... 9
Chapter 6. Discussion ............................................................ 11
Chapter 7. Conclusion ............................................................13
Bibliography ...........................................................................14
Abstract in Korean ................................................................ 17Maste
Semi-Supervised Speech Emotion Recognition with Ladder Networks
Speech emotion recognition (SER) systems find applications in various fields
such as healthcare, education, and security and defense. A major drawback of
these systems is their lack of generalization across different conditions. This
problem can be solved by training models on large amounts of labeled data from
the target domain, which is expensive and time-consuming. Another approach is
to increase the generalization of the models. An effective way to achieve this
goal is by regularizing the models through multitask learning (MTL), where
auxiliary tasks are learned along with the primary task. These methods often
require the use of labeled data which is computationally expensive to collect
for emotion recognition (gender, speaker identity, age or other emotional
descriptors). This study proposes the use of ladder networks for emotion
recognition, which utilizes an unsupervised auxiliary task. The primary task is
a regression problem to predict emotional attributes. The auxiliary task is the
reconstruction of intermediate feature representations using a denoising
autoencoder. This auxiliary task does not require labels so it is possible to
train the framework in a semi-supervised fashion with abundant unlabeled data
from the target domain. This study shows that the proposed approach creates a
powerful framework for SER, achieving superior performance than fully
supervised single-task learning (STL) and MTL baselines. The approach is
implemented with several acoustic features, showing that ladder networks
generalize significantly better in cross-corpus settings. Compared to the STL
baselines, the proposed approach achieves relative gains in concordance
correlation coefficient (CCC) between 3.0% and 3.5% for within corpus
evaluations, and between 16.1% and 74.1% for cross corpus evaluations,
highlighting the power of the architecture
Intonation Template Matching for Syllable-Level Prosody Encoding
We address the challenge of machine interpretation of subtle speech intonations that convey complex meanings. We assume that emotions and interrogative statements follow regular prosodic patterns, allowing us to create an unsupervised intonation template dictionary. These templates can then serve as encoding mechanisms for higher-level labels. We use piecewise interpolation of syllable-level formant features to create intonation templates and evaluate their effectiveness on three speech emotion recognition datasets and declarative-interrogative utterances. The results indicate that individual syllables can be detected for basic emotions with nearly double the accuracy of chance. Additionally, certain intonation templates exhibit a correlation with interrogative implications
A survey on the semi supervised learning paradigm in the context of speech emotion recognition
The area of Automatic Speech Emotion Recognition has been a hot topic for researchers for quite some time now. The recent breakthroughs on technology in the field of Machine Learning open up doors for multiple approaches of many kinds. However, some concerns have been persistent throughout the years where we highlight the design and collection of data. Proper annotation of data can be quite expensive and sometimes not even viable, as specialists are often needed for such a complex task as emotion recognition. The evolution of the semi supervised learning paradigm tries to drag down the high dependency on labelled data, potentially facilitating the design of a proper pipeline of tasks, single or multi modal, towards the final objective of the recognition of the human emotional mental state. In this paper, a review of the current single modal (audio) Semi Supervised Learning state of art is explored as a possible solution to the bottlenecking issues mentioned, as a way of helping and guiding future researchers when getting to the planning phase of such task, where many positive aspects from each piece of work can be drawn and combined.This work has been supported by FCT - Fundaรงรฃo para a Ciencia e Tecnologia within the R&D Units Project Scope: UIDB/00319/202