    Enhancing Emotion Classification in Malayalam Accented Speech: An In-Depth Clustering Approach

    Accurate emotion classification in accented speech for the Malayalam language poses a unique challenge in the realm of speech recognition. In this study, we explore the application of various clustering algorithms to this specific dataset, evaluating their effectiveness using the Silhouette Score as a measure of cluster quality. Our findings reveal significant insights into the performance of these algorithms. Among the clustering methods, Affinity Propagation emerged as the frontrunner, achieving the highest Silhouette Score of 0.5255. This result indicates a superior cluster quality characterized by well-defined and distinct groups. OPTICS and Mean Shift Clustering also demonstrated strong performance with scores of 0.4029 and 0.2511, respectively, indicating the presence of relatively distinct and well-formed clusters. In addition, we introduced Ensemble Clustering (Majority Voting), which achieved a score of 0.2399, indicating moderate cluster distinction. These findings provide a valuable perspective on the potential advantages of ensemble methods in this context. Our experiment results shed light on the effectiveness of various clustering methods in the context of emotion classification in accented Malayalam speech. This study contributes to the advancement of speech recognition technology and lays the groundwork for further research in this area.

    Emotion-aware cross-modal domain adaptation in video sequences

    Automatic Detection of Depression in Speech Using Ensemble Convolutional Neural Networks

    This paper proposes a speech-based method for automatic depression classification. The system is based on ensemble learning for Convolutional Neural Networks (CNNs) and is evaluated using the data and the experimental protocol provided in the Depression Classification Sub-Challenge (DCC) at the 2016 Audio-Visual Emotion Challenge (AVEC-2016). In the pre-processing phase, speech files are represented as a sequence of log-spectrograms and randomly sampled to balance positive and negative samples. For the classification task itself, first, a more suitable architecture for this task, based on One-Dimensional Convolutional Neural Networks, is built. Secondly, several of these CNN-based models are trained with different initializations and then the corresponding individual predictions are fused by using an Ensemble Averaging algorithm and combined per speaker to get an appropriate final decision. The proposed ensemble system achieves satisfactory results on the DCC at the AVEC-2016 in comparison with a reference system based on Support Vector Machines and hand-crafted features, with a CNN+LSTM-based system called DepAudionet, and with the case of a single CNN-based classifier

    A Comparison Between Convolutional and Transformer Architectures for Speech Emotion Recognition

    Creating speech emotion recognition models com-parable to the capability of how humans recognise emotions is a long-standing challenge in the field of speech technology with many potential commercial applications. As transformer-based architectures have recently become the state-of-the-art for many natural language processing related applications, this paper investigates their suitability for acoustic emotion recognition and compares them to the well-known AlexNet convolutional approach. This comparison is made using several publicly available speech emotion corpora. Experimental results demonstrate the efficacy of the different architectural approaches for particular emotions. The results show that the transformer-based models outperform their convolutional counterparts yielding F1-scores in the range [70.33%, 75.76%]. This paper further provides insights via dimensionality reduction analysis of output layer activations in both architectures and reveals significantly improved clustering in transformer-based models whilst highlighting the nuances with regard to the separability of different emotion classes

    This paper proposes a speech-based method for automatic depression classification. The system is based on ensemble learning for Convolutional Neural Networks (CNNs) and is evaluated using the data and the experimental protocol provided in the Depression Classification Sub-Challenge (DCC) at the 2016 Audio–Visual Emotion Challenge (AVEC-2016). In the pre-processing phase, speech files are represented as a sequence of log-spectrograms and randomly sampled to balance positive and negative samples. For the classification task itself, first, a more suitable architecture for this task, based on One-Dimensional Convolutional Neural Networks, is built. Secondly, several of these CNN-based models are trained with different initializations and then the corresponding individual predictions are fused by using an Ensemble Averaging algorithm and combined per speaker to get an appropriate final decision. The proposed ensemble system achieves satisfactory results on the DCC at the AVEC-2016 in comparison with a reference system based on Support Vector Machines and hand-crafted features, with a CNN+LSTM-based system called DepAudionet, and with the case of a single CNN-based classifier.This research was partly funded by Spanish Government grant TEC2017-84395-P

    An ongoing review of speech emotion recognition

    User emotional status recognition is becoming a key feature in advanced Human Computer Interfaces (HCI). A key source of emotional information is the spoken expression, which may be part of the interaction between the human and the machine. Speech emotion recognition (SER) is a very active area of research that involves the application of current machine learning and neural networks tools. This ongoing review covers recent and classical approaches to SER reported in the literature.This work has been carried out with the support of project PID2020-116346GB-I00 funded by the Spanish MICIN

    GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion Causality for Speech Emotion Recognition

    In human-computer interaction, Speech Emotion Recognition (SER) plays an essential role in understanding the user's intent and improving the interactive experience. While similar sentimental speeches own diverse speaker characteristics but share common antecedents and consequences, an essential challenge for SER is how to produce robust and discriminative representations through causality between speech emotions. In this paper, we propose a Gated Multi-scale Temporal Convolutional Network (GM-TCNet) to construct a novel emotional causality representation learning component with a multi-scale receptive field. GM-TCNet deploys a novel emotional causality representation learning component to capture the dynamics of emotion across the time domain, constructed with dilated causal convolution layer and gating mechanism. Besides, it utilizes skip connection fusing high-level features from different gated convolution blocks to capture abundant and subtle emotion changes in human speech. GM-TCNet first uses a single type of feature, mel-frequency cepstral coefficients, as inputs and then passes them through the gated temporal convolutional module to generate the high-level features. Finally, the features are fed to the emotion classifier to accomplish the SER task. The experimental results show that our model maintains the highest performance in most cases compared to state-of-the-art techniques.Comment: The source code is available at: https://github.com/Jiaxin-Ye/GM-TCNe

    Speech Mode Classification using the Fusion of CNNs and LSTM Networks

    Speech mode classification is an area that has not been as widely explored in the field of sound classification as others such as environmental sounds, music genre, and speaker identification. But what is speech mode? While mode is defined as the way or the manner in which something occurs or is expressed or done, speech mode is defined as the style in which the speech is delivered by a person. There are some reports on speech mode classification using conventional methods, such as whispering and talking using a normal phonetic sound. However, to the best of our knowledge, deep learning-based methods have not been reported in the open literature for the aforementioned classification scenario. Specifically, in this work we assess the performance of image-based classification algorithms on this challenging speech mode classification problem, including the usage of pre-trained deep neural networks, namely AlexNet, ResNet18 and SqueezeNet. Thus, we compare the classification efficiency of a set of deep learning-based classifiers, while we also assess the impact of different 2D image representations (spectrograms, mel-spectrograms, and their image-based fusion) on classification accuracy. These representations are used as input to the networks after being generated from the original audio signals. Next, we compare the accuracy of the DL-based classifies to a set of machine learning (ML) ones that use as their inputs Mel-Frequency Cepstral Coefficients (MFCCs) features. Then, after determining the most efficient sampling rate for our classification problem (i.e. 32kHz), we study the performance of our proposed method of combining CNN with LSTM (Long Short-Term Memory) networks. For this purpose, we use the features extracted from the deep networks of the previous step. We conclude our study by evaluating the role of sampling rates on classification accuracy by generating two sets of 2D image representations – one with 32kHz and the other with 16kHz sampling. Experimental results show that after cross validation the accuracy of DL-based approaches is 15% higher than ML ones, with SqueezeNet yielding an accuracy of more than 91% at 32kHz, whether we use transfer learning, feature-level fusion or score-level fusion (92.5%). Our proposed method using LSTMs further increased that accuracy by more than 3%, resulting in an average accuracy of 95.7%

    Statistical Machine Learning for Human Behaviour Analysis

    Human behaviour analysis has introduced several challenges in various fields, such as applied information theory, affective computing, robotics, biometrics and pattern recognition. This Special Issue focused on novel vision-based approaches, mainly related to computer vision and machine learning, for the automatic analysis of human behaviour. We solicited submissions on the following topics: information theory-based pattern classification, biometric recognition, multimodal human analysis, low resolution human activity analysis, face analysis, abnormal behaviour analysis, unsupervised human analysis scenarios, 3D/4D human pose and shape estimation, human analysis in virtual/augmented reality, affective computing, social signal processing, personality computing, activity recognition, human tracking in the wild, and application of information-theoretic concepts for human behaviour analysis. In the end, 15 papers were accepted for this special issue [1-15]. These papers, that are reviewed in this editorial, analyse human behaviour from the aforementioned perspectives, defining in most of the cases the state of the art in their corresponding field. Most of the included papers are application-based systems, while [15] focuses on the understanding and interpretation of a classification model, which is an important factor for the classifier's credibility. Given a set of categorical data, [15] utilizes multi-objective optimization algorithms, like ENORA and NSGA-II, to produce rule-based classification models that are easy to interpret. Performance of the classifier and its number of rules are optimized during the learning, where the first one is obviously expected to bemaximizedwhile the second one is expected to beminimized. Testing on public databases, using 10-fold cross-validation, shows the superiority of the proposed method against classifiers that are generated using other previously published methods like PART, JRip, OneR and ZeroR. Two published papers ([1,9]) have privacy as their main concern, while they develop their respective systems for biometrics recognition and action recognition. Reference [1] has considered a privacy-aware biometrics system. The idea is that the identity of the users should not be readily revealed from their biometrics, like facial images. Therefore, they have collected a database of foot and hand traits of users while opening a door to grant or deny access, while [9] develops a privacy-aware method for action recognition using recurrent neural networks. The system accumulates reflections of light pulses omitted by a laser, using a single-pixel hybrid photodetector. This includes information about the distance of the objects to the capturing device and their shapes