2,406 research outputs found
Learnable PINs: Cross-Modal Embeddings for Person Identity
We propose and investigate an identity sensitive joint embedding of face and
voice. Such an embedding enables cross-modal retrieval from voice to face and
from face to voice. We make the following four contributions: first, we show
that the embedding can be learnt from videos of talking faces, without
requiring any identity labels, using a form of cross-modal self-supervision;
second, we develop a curriculum learning schedule for hard negative mining
targeted to this task, that is essential for learning to proceed successfully;
third, we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using the
joint embedding for automatically retrieving and labelling characters in TV
dramas.Comment: To appear in ECCV 201
Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling
This paper aims to synthesize target speaker's speech with desired speaking
style and emotion by transferring the style and emotion from reference speech
recorded by other speakers. Specifically, we address this challenging problem
with a two-stage framework composed of a text-to-style-and-emotion (Text2SE)
module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural
bottleneck (BN) features. To further solve the multi-factor (speaker timbre,
speaking style and emotion) decoupling problem, we adopt the multi-label binary
vector (MBV) and mutual information (MI) minimization to respectively
discretize the extracted embeddings and disentangle these highly entangled
factors in both Text2SE and SE2Wave modules. Moreover, we introduce a
semi-supervised training strategy to leverage data from multiple speakers,
including emotion-labelled data, style-labelled data, and unlabeled data. To
better transfer the fine-grained expressiveness from references to the target
speaker in the non-parallel transfer, we introduce a reference-candidate pool
and propose an attention based reference selection approach. Extensive
experiments demonstrate the good design of our model.Comment: Submitted to ICASSP202
Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching
Automatic emotion recognition is an active research topic with wide range of
applications. Due to the high manual annotation cost and inevitable label
ambiguity, the development of emotion recognition dataset is limited in both
scale and quality. Therefore, one of the key challenges is how to build
effective models with limited data resource. Previous works have explored
different approaches to tackle this challenge including data enhancement,
transfer learning, and semi-supervised learning etc. However, the weakness of
these existing approaches includes such as training instability, large
performance loss during transfer, or marginal improvement.
In this work, we propose a novel semi-supervised multi-modal emotion
recognition model based on cross-modality distribution matching, which
leverages abundant unlabeled data to enhance the model training under the
assumption that the inner emotional status is consistent at the utterance level
across modalities.
We conduct extensive experiments to evaluate the proposed model on two
benchmark datasets, IEMOCAP and MELD. The experiment results prove that the
proposed semi-supervised learning model can effectively utilize unlabeled data
and combine multi-modalities to boost the emotion recognition performance,
which outperforms other state-of-the-art approaches under the same condition.
The proposed model also achieves competitive capacity compared with existing
approaches which take advantage of additional auxiliary information such as
speaker and interaction context.Comment: 10 pages, 5 figures, to be published on ACM Multimedia 202
Simple Model Also Works: A Novel Emotion Recognition Network in Textual Conversation Based on Curriculum Learning Strategy
Emotion Recognition in Conversation (ERC) has emerged as a research hotspot
in domains such as conversational robots and question-answer systems. How to
efficiently and adequately retrieve contextual emotional cues has been one of
the key challenges in the ERC task. Existing efforts do not fully model the
context and employ complex network structures, resulting in excessive
computational resource overhead without substantial performance improvement. In
this paper, we propose a novel Emotion Recognition Network based on Curriculum
Learning strategy (ERNetCL). The proposed ERNetCL primarily consists of
Temporal Encoder (TE), Spatial Encoder (SE), and Curriculum Learning (CL) loss.
We utilize TE and SE to combine the strengths of previous methods in a
simplistic manner to efficiently capture temporal and spatial contextual
information in the conversation. To simulate the way humans learn curriculum
from easy to hard, we apply the idea of CL to the ERC task to progressively
optimize the network parameters of ERNetCL. At the beginning of training, we
assign lower learning weights to difficult samples. As the epoch increases, the
learning weights for these samples are gradually raised. Extensive experiments
on four datasets exhibit that our proposed method is effective and dramatically
beats other baseline models.Comment: 12 pages,9 figure
- …