32,184 research outputs found
EmoDiarize: Speaker Diarization and Emotion Identification from Speech Signals using Convolutional Neural Networks
In the era of advanced artificial intelligence and human-computer
interaction, identifying emotions in spoken language is paramount. This
research explores the integration of deep learning techniques in speech emotion
recognition, offering a comprehensive solution to the challenges associated
with speaker diarization and emotion identification. It introduces a framework
that combines a pre-existing speaker diarization pipeline and an emotion
identification model built on a Convolutional Neural Network (CNN) to achieve
higher precision. The proposed model was trained on data from five speech
emotion datasets, namely, RAVDESS, CREMA-D, SAVEE, TESS, and Movie Clips, out
of which the latter is a speech emotion dataset created specifically for this
research. The features extracted from each sample include Mel Frequency
Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR), Root Mean Square (RMS),
and various data augmentation algorithms like pitch, noise, stretch, and shift.
This feature extraction approach aims to enhance prediction accuracy while
reducing computational complexity. The proposed model yields an unweighted
accuracy of 63%, demonstrating remarkable efficiency in accurately identifying
emotional states within speech signals
Emotion sensing from head motion capture
Computational analysis of emotion from verbal and non-verbal behavioral cues is critical for human-centric intelligent systems. Among the non-verbal cues, head motion has received relatively less attention, although its importance has been noted in several research. We propose a new approach for emotion recognition using head motion captured using Motion Capture (MoCap). Our approach is motivated by the well known kinesics-phonetic analogy, which advocates that, analogous to human speech being composed of phonemes, head motion is composed of kinemes i.e., elementary motion units. We discover a set of kinemes from head motion in an unsupervised manner by projecting them onto a learned basis domain and subsequently clustering them. This transforms any head motion to a sequence of kinemes. Next, we learn the temporal latent structures within the kineme sequence pertaining to each emotion. For this purpose, we explore two separate approaches – one using Hidden Markov Model and another using artificial neural network. This class-specific, kineme-based representation of head motion is used to perform emotion recognition on the popular IEMOCAP database. We achieve high recognition accuracy (61.8% for three class) for various emotion recognition tasks using head motion alone. This work adds to our understanding of head motion dynamics, and has applications in emotion analysis and head motion animation and synthesis
Representation Analysis Methods to Model Context for Speech Technology
Speech technology has developed to levels equivalent with human parity through the use of deep neural networks. However, it is unclear how the learned dependencies within these networks can be attributed to metrics such as recognition performance. This research focuses on strategies to interpret and exploit these learned context dependencies to improve speech recognition models. Context dependency analysis had not yet been explored for speech recognition networks.
In order to highlight and observe dependent representations within speech recognition models, a novel analysis framework is proposed. This analysis framework uses statistical correlation indexes to compute the coefficiency between neural representations. By comparing the coefficiency of neural representations between models using different approaches, it is possible to observe specific context dependencies within network layers. By providing insights on context dependencies it is then possible to adapt modelling approaches to become more computationally efficient and improve recognition performance. Here the performance of End-to-End speech recognition models are analysed, providing insights on the acoustic and language modelling context dependencies. The modelling approach for a speaker recognition task is adapted to exploit acoustic context dependencies and reach comparable performance with the state-of-the-art methods, reaching 2.89% equal error rate using the Voxceleb1 training and test sets with 50% of the parameters. Furthermore, empirical analysis of the
role of acoustic context for speech emotion recognition modelling revealed that emotion cues are presented as a distributed event. These analyses and results for speech recognition applications aim to provide objective direction for future development of automatic speech recognition systems
Hi,KIA: A Speech Emotion Recognition Dataset for Wake-Up Words
Wake-up words (WUW) is a short sentence used to activate a speech recognition
system to receive the user's speech input. WUW utterances include not only the
lexical information for waking up the system but also non-lexical information
such as speaker identity or emotion. In particular, recognizing the user's
emotional state may elaborate the voice communication. However, there is few
dataset where the emotional state of the WUW utterances is labeled. In this
paper, we introduce Hi, KIA, a new WUW dataset which consists of 488 Korean
accent emotional utterances collected from four male and four female speakers
and each of utterances is labeled with four emotional states including anger,
happy, sad, or neutral. We present the step-by-step procedure to build the
dataset, covering scenario selection, post-processing, and human validation for
label agreement. Also, we provide two classification models for WUW speech
emotion recognition using the dataset. One is based on traditional hand-craft
features and the other is a transfer-learning approach using a pre-trained
neural network. These classification models could be used as benchmarks in
further research.Comment: Asia Pacific Signal and Information Processing Association Annual
Summit and Conference (APSIPA), 202
Real-time speech emotion analysis for smart home assistants
Artificial Intelligence (AI) based Speech Emotion Recognition (SER) has been widely used in the consumer field for control of smart home personal assistants, with many such devices on the market. However, with the increase in computational power, connectivity and the need to enable people to live in the home for longer though the use of technology, then smart home assistants that could detect human emotion will improve the communication between a user and the assistant enabling the assistant of offer more productive feedback. Thus, the aim of this work is to analyze emotional states in speech and propose a suitable method considering performance verses complexity for deployment in Consumer Electronics home products, and to present a practical live demonstration of the research. In this paper, a comprehensive approach has been introduced for the human speech-based emotion analysis. The 1-D convolutional neural network (CNN) has been implemented to learn and classify the emotions associated with human speech. The paper has been implemented on the standard datasets (emotion classification) Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Toronto Emotional Speech Set database (TESS) (Young and Old). The proposed approach gives 90.48%, 95.79% and94.47% classification accuracies in the aforementioned datasets. We conclude that the 1-D CNN classification models used in speaker-independent experiments are highly effective in the automatic prediction of emotion and are ideal for deployment in smart home assistants to detect emotion
Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition
Speech Emotion Recognition (SER) plays a pivotal role in enhancing
human-computer interaction by enabling a deeper understanding of emotional
states across a wide range of applications, contributing to more empathetic and
effective communication. This study proposes an innovative approach that
integrates self-supervised feature extraction with supervised classification
for emotion recognition from small audio segments. In the preprocessing step,
to eliminate the need of crafting audio features, we employed a self-supervised
feature extractor, based on the Wav2Vec model, to capture acoustic features
from audio data. Then, the output featuremaps of the preprocessing step are fed
to a custom designed Convolutional Neural Network (CNN)-based model to perform
emotion classification. Utilizing the ShEMO dataset as our testing ground, the
proposed method surpasses two baseline methods, i.e. support vector machine
classifier and transfer learning of a pretrained CNN. comparing the propose
method to the state-of-the-art methods in SER task indicates the superiority of
the proposed method. Our findings underscore the pivotal role of deep
unsupervised feature learning in elevating the landscape of SER, offering
enhanced emotional comprehension in the realm of human-computer interactions
- …