12 research outputs found
SDFE-LV: A Large-Scale, Multi-Source, and Unconstrained Database for Spotting Dynamic Facial Expressions in Long Videos
In this paper, we present a large-scale, multi-source, and unconstrained
database called SDFE-LV for spotting the onset and offset frames of a complete
dynamic facial expression from long videos, which is known as the topic of
dynamic facial expression spotting (DFES) and a vital prior step for lots of
facial expression analysis tasks. Specifically, SDFE-LV consists of 1,191 long
videos, each of which contains one or more complete dynamic facial expressions.
Moreover, each complete dynamic facial expression in its corresponding long
video was independently labeled for five times by 10 well-trained annotators.
To the best of our knowledge, SDFE-LV is the first unconstrained large-scale
database for the DFES task whose long videos are collected from multiple
real-world/closely real-world media sources, e.g., TV interviews,
documentaries, movies, and we-media short videos. Therefore, DFES tasks on
SDFE-LV database will encounter numerous difficulties in practice such as head
posture changes, occlusions, and illumination. We also provided a comprehensive
benchmark evaluation from different angles by using lots of recent
state-of-the-art deep spotting methods and hence researchers interested in DFES
can quickly and easily get started. Finally, with the deep discussions on the
experimental evaluation results, we attempt to point out several meaningful
directions to deal with DFES tasks and hope that DFES can be better advanced in
the future. In addition, SDFE-LV will be freely released for academic use only
as soon as possible
Visual-to-EEG cross-modal knowledge distillation for continuous emotion recognition
Visual modality is one of the most dominant modalities for current continuous emotion recognition methods. Compared to which the EEG modality is relatively less sound due to its intrinsic limitation such as subject bias and low spatial resolution. This work attempts to improve the continuous prediction of the EEG modality by using the dark knowledge from the visual modality. The teacher model is built by a cascade convolutional neural network - temporal convolutional network (CNN-TCN) architecture, and the student model is built by TCNs. They are fed by video frames and EEG average band power features, respectively. Two data partitioning schemes are employed, i.e., the trial-level random shuffling (TRS) and the leave-one-subject-out (LOSO). The standalone teacher and student can produce continuous prediction superior to the baseline method, and the employment of the visual-to-EEG cross-modal KD further improves the prediction with statistical significance, i.e., p-value <0.01 for TRS and p-value <0.05 for LOSO partitioning. The saliency maps of the trained student model show that the brain areas associated with the active valence state are not located in precise brain areas. Instead, it results from synchronized activity among various brain areas. And the fast beta and gamma waves, with the frequency of 18−30Hz and 30−45Hz, contribute the most to the human emotion process compared to other bands. The code is available at https://github.com/sucv/Visual_to_EEG_Cross_Modal_KD_for_CER.Agency for Science, Technology and Research (A*STAR)Published versionThis work was supported by the RIE2020 AME Programmatic Fund, Singapore (No. A20G8b0102)
Adapting Multiple Distributions for Bridging Emotions from Different Speech Corpora
In this paper, we focus on a challenging, but interesting, task in speech emotion recognition (SER), i.e., cross-corpus SER. Unlike conventional SER, a feature distribution mismatch may exist between the labeled source (training) and target (testing) speech samples in cross-corpus SER because they come from different speech emotion corpora, which degrades the performance of most well-performing SER methods. To address this issue, we propose a novel transfer subspace learning method called multiple distribution-adapted regression (MDAR) to bridge the gap between speech samples from different corpora. Specifically, MDAR aims to learn a projection matrix to build the relationship between the source speech features and emotion labels. A novel regularization term called multiple distribution adaption (MDA), consisting of a marginal and two conditional distribution-adapted operations, is designed to collaboratively enable such a discriminative projection matrix to be applicable to the target speech samples, regardless of speech corpus variance. Consequently, by resorting to the learned projection matrix, we are able to predict the emotion labels of target speech samples when only the source label information is given. To evaluate the proposed MDAR method, extensive cross-corpus SER tasks based on three different speech emotion corpora, i.e., EmoDB, eNTERFACE, and CASIA, were designed. Experimental results showed that the proposed MDAR outperformed most recent state-of-the-art transfer subspace learning methods and even performed better than several well-performing deep transfer learning methods in dealing with cross-corpus SER tasks
Implicitly Aligning Joint Distributions for Cross-Corpus Speech Emotion Recognition
In this paper, we investigate the problem of cross-corpus speech emotion recognition (SER), in which the training (source) and testing (target) speech samples belong to different corpora. This case thus leads to a feature distribution mismatch between the source and target speech samples. Hence, the performance of most existing SER methods drops sharply. To solve this problem, we propose a simple yet effective transfer subspace learning method called joint distribution implicitly aligned subspace learning (JIASL). The basic idea of JIASL is very straightforward, i.e., building an emotion discriminative and corpus invariant linear regression model under an implicit distribution alignment strategy. Following this idea, we first make use of the source speech features and emotion labels to endow such a regression model with emotion-discriminative ability. Then, a well-designed reconstruction regularization term, jointly considering the marginal and conditional distribution alignments between the speech samples in both corpora, is adopted to implicitly enable the regression model to predict the emotion labels of target speech samples. To evaluate the performance of our proposed JIASL, extensive cross-corpus SER experiments are carried out, and the results demonstrate the promising performance of the proposed JIASL in coping with the tasks of cross-corpus SER
Implicitly Aligning Joint Distributions for Cross-Corpus Speech Emotion Recognition
In this paper, we investigate the problem of cross-corpus speech emotion recognition (SER), in which the training (source) and testing (target) speech samples belong to different corpora. This case thus leads to a feature distribution mismatch between the source and target speech samples. Hence, the performance of most existing SER methods drops sharply. To solve this problem, we propose a simple yet effective transfer subspace learning method called joint distribution implicitly aligned subspace learning (JIASL). The basic idea of JIASL is very straightforward, i.e., building an emotion discriminative and corpus invariant linear regression model under an implicit distribution alignment strategy. Following this idea, we first make use of the source speech features and emotion labels to endow such a regression model with emotion-discriminative ability. Then, a well-designed reconstruction regularization term, jointly considering the marginal and conditional distribution alignments between the speech samples in both corpora, is adopted to implicitly enable the regression model to predict the emotion labels of target speech samples. To evaluate the performance of our proposed JIASL, extensive cross-corpus SER experiments are carried out, and the results demonstrate the promising performance of the proposed JIASL in coping with the tasks of cross-corpus SER
Cross-database micro-expression recognition:a benchmark
Abstract
Cross-database micro-expression recognition (CDMER) is one of recently emerging and interesting problems in micro-expression analysis. CDMER is more challenging than the conventional micro-expression recognition (MER), because the training and testing samples in CDMER come from different micro-expression databases, resulting in inconsistency of the feature distributions between the training and testing sets. In this paper, we contribute to this topic from two aspects. First, we establish a CDMER experimental evaluation protocol and provide a standard platform for evaluating their proposed methods. Second, we conduct extensive benchmark experiments by using NINE state-of-the-art domain adaptation (DA) methods and SIX popular spatiotemporal descriptors for investigating the CDMER problem from two different perspectives and deeply analyze and discuss the experimental results. In addition, all the data and codes involving CDMER in this paper are released on our project website: http://aip.seu.edu.cn/cdmer