19 research outputs found

    Speech Emotion Recognition Using Multi-hop Attention Mechanism

    Full text link
    In this paper, we are interested in exploiting textual and acoustic data of an utterance for the speech emotion classification task. The baseline approach models the information from audio and text independently using two deep neural networks (DNNs). The outputs from both the DNNs are then fused for classification. As opposed to using knowledge from both the modalities separately, we propose a framework to exploit acoustic information in tandem with lexical data. The proposed framework uses two bi-directional long short-term memory (BLSTM) for obtaining hidden representations of the utterance. Furthermore, we propose an attention mechanism, referred to as the multi-hop, which is trained to automatically infer the correlation between the modalities. The multi-hop attention first computes the relevant segments of the textual data corresponding to the audio signal. The relevant textual data is then applied to attend parts of the audio signal. To evaluate the performance of the proposed system, experiments are performed in the IEMOCAP dataset. Experimental results show that the proposed technique outperforms the state-of-the-art system by 6.5% relative improvement in terms of weighted accuracy.Comment: 5 pages, Accepted as a conference paper at ICASSP 2019 (oral presentation

    ์ค€์–ธ์–ด์  ์‹ ํ˜ธ ์••์ถ• ๋ฐ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2019. 2. ์ •๊ต๋ฏผ.Recognizing and classifying paralinguistic signals, with its various applications, is an important problem. In general, this task is considered challenging because the sound information from the signals is difficult to distinguish even by humans. Thus, analyzing signals with machine learning techniques is a reasonable approach to understanding signals. Audio features extracted from paralinguistic signals usually consist of highdimensional vectors such as prosody, energy, cepstrum, and other speech-related information. Therefore, when the size of a training corpus is not sufficiently large, it is extremely difficult to apply machine learning methods to analyze these signals due to their high feature dimensions. This paper addresses these limitations by using neural networks' feature learning abilities. First, we use a neural network-based autoencoder to compress the signal to eliminate redundancy within the signal feature, and we show than the compressed signal features are competitive in distinguishing the signal compared to the original methods such as logistic regression, support vector machine, decision trees, and boosted trees.์ค€์–ธ์–ด์  ์‹ ํ˜ธ๋ฅผ ๋ถ„๋ฅ˜๋ฅผ ์ธ์‹ํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•˜๋Š” ์ผ์€ ๊ทธ์˜ ๋‹ค์–‘ํ•œ ์‘์šฉ์„ฑ ์ธก๋ฉด์—์„œ ๋งค์šฐ ์ค‘์š”ํ•œ ๋ฌธ์ œ์ด๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ์ด ๋ฌธ์ œ๊ฐ€ ์–ด๋ ค์šด ์ด์œ ๋Š” ์†Œ๋ฆฌ ์ •๋ณด๊ฐ€ ์ธ๊ฐ„์—๊ฒŒ๋„ ๊ตฌ๋ณ„๋˜๊ธฐ ํž˜๋“ค๋‹ค๋Š” ์• ๋งค๋ชจํ˜ธํ•œ ํŠน์„ฑ ๋•Œ๋ฌธ์ด๋‹ค. ์ด์— ์‹ ํ˜ธ๋ฅผ ๋” ์ž˜ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์ด ๊ณ ์•ˆ๋˜๋Š”๋ฐ, ์ด ๋•Œ ๋ถ„์„์— ์‚ฌ์šฉ๋˜๋Š” ์‹ ํ˜ธ ํŠน์ง• ๋ฒกํ„ฐ๋Š” ์šด์œจ, ์—๋„ˆ์ง€, ์ฃผํŒŒ์ˆ˜ ๋“ฑ ์‹ ํ˜ธ์— ๊ด€๋ จ๋œ ์ •๋ณด๋กœ ์ด๋ฃจ์–ด์ง„ ๊ณ ์ฐจ์› ๋ฒกํ„ฐ์ด๋‹ค. ์ฆ‰ ํ›ˆ๋ จ์— ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์€ ๊ฒฝ์šฐ์—๋Š”, ํŠน์ง• ๋ฒกํ„ฐ์˜ ๋†’์€ ์ฐจ์› ๋•Œ๋ฌธ์— ์ ์ ˆํžˆ ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ์„ ์ž˜ ํ›ˆ๋ จ์‹œํ‚ค๊ธฐ๊ฐ€ ์–ด๋ ต๊ฒŒ ๋œ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ์ด์™€ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•œ๋‹ค. ์šฐ์„  ๋‹ค์–‘ํ•œ ์••์ถ• ๊ธฐ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ ํŠน์ง• ๋ฒกํ„ฐ ๋‚ด์˜ ๋ถˆํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ฑฐํ•˜๊ณ , ์ด ์••์ถ•๋œ ํŠน์ง•๋“ค์ด ์ „ํ†ต์  ๊ธฐ๊ณ„ํ•™์Šต ๋ถ„๋ฅ˜๋ฐฉ๋ฒ•๋“ค๋ณด๋‹ค ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์— ์˜ํ•ด ๋” ์ž˜ ๋ถ„๋ฅ˜๋จ์„ ์‹คํ—˜์ ์œผ๋กœ ๋ณด์ธ๋‹ค.Chapter 1. Introduction ...........................................................1 Chapter 2. Related Work .........................................................3 Chapter 3. Task Description .................................................... 4 Chapter 4. Proposed Framework .............................................5 Chapter 5. Performance Evaluation ......................................... 9 Chapter 6. Discussion ............................................................ 11 Chapter 7. Conclusion ............................................................13 Bibliography ...........................................................................14 Abstract in Korean ................................................................ 17Maste

    Semi-Supervised Speech Emotion Recognition with Ladder Networks

    Full text link
    Speech emotion recognition (SER) systems find applications in various fields such as healthcare, education, and security and defense. A major drawback of these systems is their lack of generalization across different conditions. This problem can be solved by training models on large amounts of labeled data from the target domain, which is expensive and time-consuming. Another approach is to increase the generalization of the models. An effective way to achieve this goal is by regularizing the models through multitask learning (MTL), where auxiliary tasks are learned along with the primary task. These methods often require the use of labeled data which is computationally expensive to collect for emotion recognition (gender, speaker identity, age or other emotional descriptors). This study proposes the use of ladder networks for emotion recognition, which utilizes an unsupervised auxiliary task. The primary task is a regression problem to predict emotional attributes. The auxiliary task is the reconstruction of intermediate feature representations using a denoising autoencoder. This auxiliary task does not require labels so it is possible to train the framework in a semi-supervised fashion with abundant unlabeled data from the target domain. This study shows that the proposed approach creates a powerful framework for SER, achieving superior performance than fully supervised single-task learning (STL) and MTL baselines. The approach is implemented with several acoustic features, showing that ladder networks generalize significantly better in cross-corpus settings. Compared to the STL baselines, the proposed approach achieves relative gains in concordance correlation coefficient (CCC) between 3.0% and 3.5% for within corpus evaluations, and between 16.1% and 74.1% for cross corpus evaluations, highlighting the power of the architecture

    Intonation Template Matching for Syllable-Level Prosody Encoding

    Get PDF
    We address the challenge of machine interpretation of subtle speech intonations that convey complex meanings. We assume that emotions and interrogative statements follow regular prosodic patterns, allowing us to create an unsupervised intonation template dictionary. These templates can then serve as encoding mechanisms for higher-level labels. We use piecewise interpolation of syllable-level formant features to create intonation templates and evaluate their effectiveness on three speech emotion recognition datasets and declarative-interrogative utterances. The results indicate that individual syllables can be detected for basic emotions with nearly double the accuracy of chance. Additionally, certain intonation templates exhibit a correlation with interrogative implications

    A survey on the semi supervised learning paradigm in the context of speech emotion recognition

    Get PDF
    The area of Automatic Speech Emotion Recognition has been a hot topic for researchers for quite some time now. The recent breakthroughs on technology in the field of Machine Learning open up doors for multiple approaches of many kinds. However, some concerns have been persistent throughout the years where we highlight the design and collection of data. Proper annotation of data can be quite expensive and sometimes not even viable, as specialists are often needed for such a complex task as emotion recognition. The evolution of the semi supervised learning paradigm tries to drag down the high dependency on labelled data, potentially facilitating the design of a proper pipeline of tasks, single or multi modal, towards the final objective of the recognition of the human emotional mental state. In this paper, a review of the current single modal (audio) Semi Supervised Learning state of art is explored as a possible solution to the bottlenecking issues mentioned, as a way of helping and guiding future researchers when getting to the planning phase of such task, where many positive aspects from each piece of work can be drawn and combined.This work has been supported by FCT - Fundaรงรฃo para a Ciencia e Tecnologia within the R&D Units Project Scope: UIDB/00319/202
    corecore