159 research outputs found
Emotion recognition based on the energy distribution of plosive syllables
We usually encounter two problems during speech emotion recognition (SER): expression and perception problems, which vary considerably between speakers, languages, and sentence pronunciation. In fact, finding an optimal system that characterizes the emotions overcoming all these differences is a promising prospect. In this perspective, we considered two emotional databases: Moroccan Arabic dialect emotional database (MADED), and Ryerson audio-visual database on emotional speech and song (RAVDESS) which present notable differences in terms of type (natural/acted), and language (Arabic/English). We proposed a detection process based on 27 acoustic features extracted from consonant-vowel (CV) syllabic units: \ba, \du, \ki, \ta common to both databases. We tested two classification strategies: multiclass (all emotions combined: joy, sadness, neutral, anger) and binary (neutral vs. others, positive emotions (joy) vs. negative emotions (sadness, anger), sadness vs. anger). These strategies were tested three times: i) on MADED, ii) on RAVDESS, iii) on MADED and RAVDESS. The proposed method gave better recognition accuracy in the case of binary classification. The rates reach an average of 78% for the multi-class classification, 100% for neutral vs. other cases, 100% for the negative emotions (i.e. anger vs. sadness), and 96% for the positive vs. negative emotions
Layer-Adapted Implicit Distribution Alignment Networks for Cross-Corpus Speech Emotion Recognition
In this paper, we propose a new unsupervised domain adaptation (DA) method
called layer-adapted implicit distribution alignment networks (LIDAN) to
address the challenge of cross-corpus speech emotion recognition (SER). LIDAN
extends our previous ICASSP work, deep implicit distribution alignment networks
(DIDAN), whose key contribution lies in the introduction of a novel
regularization term called implicit distribution alignment (IDA). This term
allows DIDAN trained on source (training) speech samples to remain applicable
to predicting emotion labels for target (testing) speech samples, regardless of
corpus variance in cross-corpus SER. To further enhance this method, we extend
IDA to layer-adapted IDA (LIDA), resulting in LIDAN. This layer-adpated
extention consists of three modified IDA terms that consider emotion labels at
different levels of granularity. These terms are strategically arranged within
different fully connected layers in LIDAN, aligning with the increasing
emotion-discriminative abilities with respect to the layer depth. This
arrangement enables LIDAN to more effectively learn emotion-discriminative and
corpus-invariant features for SER across various corpora compared to DIDAN. It
is also worthy to mention that unlike most existing methods that rely on
estimating statistical moments to describe pre-assumed explicit distributions,
both IDA and LIDA take a different approach. They utilize an idea of target
sample reconstruction to directly bridge the feature distribution gap without
making assumptions about their distribution type. As a result, DIDAN and LIDAN
can be viewed as implicit cross-corpus SER methods. To evaluate LIDAN, we
conducted extensive cross-corpus SER experiments on EmoDB, eNTERFACE, and CASIA
corpora. The experimental results demonstrate that LIDAN surpasses recent
state-of-the-art explicit unsupervised DA methods in tackling cross-corpus SER
tasks
GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion Causality for Speech Emotion Recognition
In human-computer interaction, Speech Emotion Recognition (SER) plays an
essential role in understanding the user's intent and improving the interactive
experience. While similar sentimental speeches own diverse speaker
characteristics but share common antecedents and consequences, an essential
challenge for SER is how to produce robust and discriminative representations
through causality between speech emotions. In this paper, we propose a Gated
Multi-scale Temporal Convolutional Network (GM-TCNet) to construct a novel
emotional causality representation learning component with a multi-scale
receptive field. GM-TCNet deploys a novel emotional causality representation
learning component to capture the dynamics of emotion across the time domain,
constructed with dilated causal convolution layer and gating mechanism.
Besides, it utilizes skip connection fusing high-level features from different
gated convolution blocks to capture abundant and subtle emotion changes in
human speech. GM-TCNet first uses a single type of feature, mel-frequency
cepstral coefficients, as inputs and then passes them through the gated
temporal convolutional module to generate the high-level features. Finally, the
features are fed to the emotion classifier to accomplish the SER task. The
experimental results show that our model maintains the highest performance in
most cases compared to state-of-the-art techniques.Comment: The source code is available at:
https://github.com/Jiaxin-Ye/GM-TCNe
Speech emotion recognition based on bi-directional acoustic–articulatory conversion
Acoustic and articulatory signals are naturally coupled and complementary. The challenge of acquiring articulatory data and the nonlinear ill-posedness of acoustic–articulatory conversions have resulted in previous studies on speech emotion recognition (SER) primarily relying on unidirectional acoustic–articulatory conversions. However, these studies have ignored the potential benefits of bi-directional acoustic–articulatory conversion. Addressing the problem of nonlinear ill-posedness and effectively extracting and utilizing these two modal features in SER remain open research questions. To bridge this gap, this study proposes a Bi-A2CEmo framework that simultaneously addresses the bi-directional acoustic-articulatory conversion for SER. This framework comprises three components: a Bi-MGAN that addresses the nonlinear ill-posedness problem, KCLNet that enhances the emotional attributes of the mapped features, and ResTCN-FDA that fully exploits the emotional attributes of the features. Another challenge is the absence of a parallel acoustic-articulatory emotion database. To overcome this issue, this study utilizes electromagnetic articulography (EMA) to create a multi-modal acoustic-articulatory emotion database for Mandarin Chinese called STEM-EVA. A comparative analysis is then conducted between the proposed method and state-of-the-art models to evaluate the effectiveness of the framework. Bi-A2CEmo achieves an accuracy of 89.04\% in SER, which is an improvement of 5.27\% compared with the actual acoustic and articulatory features recorded by the EMA. The results for the STEM-EVA dataset show that Bi-MGAN achieves a higher accuracy in mapping and inversion than conventional conversion networks. Visualization of the mapped features before and after enhancement reveals that KCLNet reduces the intra-class spacing while increasing the inter-class spacing of the features. ResTCN-FDA demonstrates high recognition accuracy on three publicly available datasets. The experimental results show that the proposed bi-directional acoustic-articulatory conversion framework can significantly improve the SER performance
- …