Recently, wearable emotion recognition based on peripheral physiological
signals has drawn massive attention due to its less invasive nature and its
applicability in real-life scenarios. However, how to effectively fuse
multimodal data remains a challenging problem. Moreover, traditional
fully-supervised based approaches suffer from overfitting given limited labeled
data. To address the above issues, we propose a novel self-supervised learning
(SSL) framework for wearable emotion recognition, where efficient multimodal
fusion is realized with temporal convolution-based modality-specific encoders
and a transformer-based shared encoder, capturing both intra-modal and
inter-modal correlations. Extensive unlabeled data is automatically assigned
labels by five signal transforms, and the proposed SSL model is pre-trained
with signal transformation recognition as a pretext task, allowing the
extraction of generalized multimodal representations for emotion-related
downstream tasks. For evaluation, the proposed SSL model was first pre-trained
on a large-scale self-collected physiological dataset and the resulting encoder
was subsequently frozen or fine-tuned on three public supervised emotion
recognition datasets. Ultimately, our SSL-based method achieved
state-of-the-art results in various emotion classification tasks. Meanwhile,
the proposed model proved to be more accurate and robust compared to
fully-supervised methods on low data regimes.Comment: Accepted IEEE Transactions On Affective Computin