235 research outputs found
Seeking Salient Facial Regions for Cross-Database Micro-Expression Recognition
Cross-Database Micro-Expression Recognition (CDMER) aims to develop the
Micro-Expression Recognition (MER) methods with strong domain adaptability,
i.e., the ability to recognize the Micro-Expressions (MEs) of different
subjects captured by different imaging devices in different scenes. The
development of CDMER is faced with two key problems: 1) the severe feature
distribution gap between the source and target databases; 2) the feature
representation bottleneck of ME such local and subtle facial expressions. To
solve these problems, this paper proposes a novel Transfer Group Sparse
Regression method, namely TGSR, which aims to 1) optimize the measurement and
better alleviate the difference between the source and target databases, and 2)
highlight the valid facial regions to enhance extracted features, by the
operation of selecting the group features from the raw face feature, where each
region is associated with a group of raw face feature, i.e., the salient facial
region selection. Compared with previous transfer group sparse methods, our
proposed TGSR has the ability to select the salient facial regions, which is
effective in alleviating the aforementioned problems for better performance and
reducing the computational cost at the same time. We use two public ME
databases, i.e., CASME II and SMIC, to evaluate our proposed TGSR method.
Experimental results show that our proposed TGSR learns the discriminative and
explicable regions, and outperforms most state-of-the-art
subspace-learning-based domain-adaptive methods for CDMER
Improving Speaker-independent Speech Emotion Recognition Using Dynamic Joint Distribution Adaptation
In speaker-independent speech emotion recognition, the training and testing
samples are collected from diverse speakers, leading to a multi-domain shift
challenge across the feature distributions of data from different speakers.
Consequently, when the trained model is confronted with data from new speakers,
its performance tends to degrade. To address the issue, we propose a Dynamic
Joint Distribution Adaptation (DJDA) method under the framework of multi-source
domain adaptation. DJDA firstly utilizes joint distribution adaptation (JDA),
involving marginal distribution adaptation (MDA) and conditional distribution
adaptation (CDA), to more precisely measure the multi-domain distribution
shifts caused by different speakers. This helps eliminate speaker bias in
emotion features, allowing for learning discriminative and speaker-invariant
speech emotion features from coarse-level to fine-level. Furthermore, we
quantify the adaptation contributions of MDA and CDA within JDA by using a
dynamic balance factor based on -Distance, promoting to
effectively handle the unknown distributions encountered in data from new
speakers. Experimental results demonstrate the superior performance of our DJDA
as compared to other state-of-the-art (SOTA) methods.Comment: Accepted by ICASSP 202
Learning Local to Global Feature Aggregation for Speech Emotion Recognition
Transformer has emerged in speech emotion recognition (SER) at present.
However, its equal patch division not only damages frequency information but
also ignores local emotion correlations across frames, which are key cues to
represent emotion. To handle the issue, we propose a Local to Global Feature
Aggregation learning (LGFA) for SER, which can aggregate longterm emotion
correlations at different scales both inside frames and segments with entire
frequency information to enhance the emotion discrimination of utterance-level
speech features. For this purpose, we nest a Frame Transformer inside a Segment
Transformer. Firstly, Frame Transformer is designed to excavate local emotion
correlations between frames for frame embeddings. Then, the frame embeddings
and their corresponding segment features are aggregated as different-level
complements to be fed into Segment Transformer for learning utterance-level
global emotion features. Experimental results show that the performance of LGFA
is superior to the state-of-the-art methods.Comment: This paper has been accepted on INTERSPEECH 202
Emotion-Aware Contrastive Adaptation Network for Source-Free Cross-Corpus Speech Emotion Recognition
Cross-corpus speech emotion recognition (SER) aims to transfer emotional
knowledge from a labeled source corpus to an unlabeled corpus. However, prior
methods require access to source data during adaptation, which is unattainable
in real-life scenarios due to data privacy protection concerns. This paper
tackles a more practical task, namely source-free cross-corpus SER, where a
pre-trained source model is adapted to the target domain without access to
source data. To address the problem, we propose a novel method called
emotion-aware contrastive adaptation network (ECAN). The core idea is to
capture local neighborhood information between samples while considering the
global class-level adaptation. Specifically, we propose a nearest neighbor
contrastive learning to promote local emotion consistency among features of
highly similar samples. Furthermore, relying solely on nearest neighborhoods
may lead to ambiguous boundaries between clusters. Thus, we incorporate
supervised contrastive learning to encourage greater separation between
clusters representing different emotions, thereby facilitating improved
class-level adaptation. Extensive experiments indicate that our proposed ECAN
significantly outperforms state-of-the-art methods under the source-free
cross-corpus SER setting on several speech emotion corpora.Comment: Accepted by ICASSP 202
SDFE-LV: A Large-Scale, Multi-Source, and Unconstrained Database for Spotting Dynamic Facial Expressions in Long Videos
In this paper, we present a large-scale, multi-source, and unconstrained
database called SDFE-LV for spotting the onset and offset frames of a complete
dynamic facial expression from long videos, which is known as the topic of
dynamic facial expression spotting (DFES) and a vital prior step for lots of
facial expression analysis tasks. Specifically, SDFE-LV consists of 1,191 long
videos, each of which contains one or more complete dynamic facial expressions.
Moreover, each complete dynamic facial expression in its corresponding long
video was independently labeled for five times by 10 well-trained annotators.
To the best of our knowledge, SDFE-LV is the first unconstrained large-scale
database for the DFES task whose long videos are collected from multiple
real-world/closely real-world media sources, e.g., TV interviews,
documentaries, movies, and we-media short videos. Therefore, DFES tasks on
SDFE-LV database will encounter numerous difficulties in practice such as head
posture changes, occlusions, and illumination. We also provided a comprehensive
benchmark evaluation from different angles by using lots of recent
state-of-the-art deep spotting methods and hence researchers interested in DFES
can quickly and easily get started. Finally, with the deep discussions on the
experimental evaluation results, we attempt to point out several meaningful
directions to deal with DFES tasks and hope that DFES can be better advanced in
the future. In addition, SDFE-LV will be freely released for academic use only
as soon as possible
- …