638 research outputs found
Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning
One of the challenges in Speech Emotion Recognition (SER) "in the wild" is
the large mismatch between training and test data (e.g. speakers and tasks). In
order to improve the generalisation capabilities of the emotion models, we
propose to use Multi-Task Learning (MTL) and use gender and naturalness as
auxiliary tasks in deep neural networks. This method was evaluated in
within-corpus and various cross-corpus classification experiments that simulate
conditions "in the wild". In comparison to Single-Task Learning (STL) based
state of the art methods, we found that our MTL method proposed improved
performance significantly. Particularly, models using both gender and
naturalness achieved more gains than those using either gender or naturalness
separately. This benefit was also found in the high-level representations of
the feature space, obtained from our method proposed, where discriminative
emotional clusters could be observed.Comment: Published in the proceedings of INTERSPEECH, Stockholm, September,
201
Comprehensive Study of Automatic Speech Emotion Recognition Systems
Speech emotion recognition (SER) is the technology that recognizes psychological characteristics and feelings from the speech signals through techniques and methodologies. SER is challenging because of more considerable variations in different languages arousal and valence levels. Various technical developments in artificial intelligence and signal processing methods have encouraged and made it possible to interpret emotions.SER plays a vital role in remote communication. This paper offers a recent survey of SER using machine learning (ML) and deep learning (DL)-based techniques. It focuses on the various feature representation and classification techniques used for SER. Further, it describes details about databases and evaluation metrics used for speech emotion recognition
Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization
Automatic speech recognition (ASR) has recently become an important challenge
when using deep learning (DL). It requires large-scale training datasets and
high computational and storage resources. Moreover, DL techniques and machine
learning (ML) approaches in general, hypothesize that training and testing data
come from the same domain, with the same input feature space and data
distribution characteristics. This assumption, however, is not applicable in
some real-world artificial intelligence (AI) applications. Moreover, there are
situations where gathering real data is challenging, expensive, or rarely
occurring, which can not meet the data requirements of DL models. deep transfer
learning (DTL) has been introduced to overcome these issues, which helps
develop high-performing models using real datasets that are small or slightly
different but related to the training data. This paper presents a comprehensive
survey of DTL-based ASR frameworks to shed light on the latest developments and
helps academics and professionals understand current challenges. Specifically,
after presenting the DTL background, a well-designed taxonomy is adopted to
inform the state-of-the-art. A critical analysis is then conducted to identify
the limitations and advantages of each framework. Moving on, a comparative
study is introduced to highlight the current challenges before deriving
opportunities for future research
Layer-Adapted Implicit Distribution Alignment Networks for Cross-Corpus Speech Emotion Recognition
In this paper, we propose a new unsupervised domain adaptation (DA) method
called layer-adapted implicit distribution alignment networks (LIDAN) to
address the challenge of cross-corpus speech emotion recognition (SER). LIDAN
extends our previous ICASSP work, deep implicit distribution alignment networks
(DIDAN), whose key contribution lies in the introduction of a novel
regularization term called implicit distribution alignment (IDA). This term
allows DIDAN trained on source (training) speech samples to remain applicable
to predicting emotion labels for target (testing) speech samples, regardless of
corpus variance in cross-corpus SER. To further enhance this method, we extend
IDA to layer-adapted IDA (LIDA), resulting in LIDAN. This layer-adpated
extention consists of three modified IDA terms that consider emotion labels at
different levels of granularity. These terms are strategically arranged within
different fully connected layers in LIDAN, aligning with the increasing
emotion-discriminative abilities with respect to the layer depth. This
arrangement enables LIDAN to more effectively learn emotion-discriminative and
corpus-invariant features for SER across various corpora compared to DIDAN. It
is also worthy to mention that unlike most existing methods that rely on
estimating statistical moments to describe pre-assumed explicit distributions,
both IDA and LIDA take a different approach. They utilize an idea of target
sample reconstruction to directly bridge the feature distribution gap without
making assumptions about their distribution type. As a result, DIDAN and LIDAN
can be viewed as implicit cross-corpus SER methods. To evaluate LIDAN, we
conducted extensive cross-corpus SER experiments on EmoDB, eNTERFACE, and CASIA
corpora. The experimental results demonstrate that LIDAN surpasses recent
state-of-the-art explicit unsupervised DA methods in tackling cross-corpus SER
tasks
Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning
One of the challenges in Speech Emotion Recognition (SER) "in the wild" is the large mismatch between training and test data (e.g. speakers and tasks). In order to improve the generalisation capabilities of the emotion models, we propose to use Multi-Task Learning (MTL) and use gender and naturalness as auxiliary tasks in deep neural networks. This method was evaluated in within-corpus and various cross-corpus classification experiments that simulate conditions "in the wild". In comparison to Single-Task Learning (STL) based state of the art methods, we found that our MTL method proposed improved performance significantly. Particularly, models using both gender and naturalness achieved more gains than those using either gender or naturalness separately. This benefit was also found in the high-level representations of the feature space, obtained from our method proposed, where discriminative emotional clusters could be observed
- …