879 research outputs found
RBA-GCN: Relational Bilevel Aggregation Graph Convolutional Network for Emotion Recognition
Emotion recognition in conversation (ERC) has received increasing attention
from researchers due to its wide range of applications. As conversation has a
natural graph structure, numerous approaches used to model ERC based on graph
convolutional networks (GCNs) have yielded significant results. However, the
aggregation approach of traditional GCNs suffers from the node information
redundancy problem, leading to node discriminant information loss.
Additionally, single-layer GCNs lack the capacity to capture long-range
contextual information from the graph. Furthermore, the majority of approaches
are based on textual modality or stitching together different modalities,
resulting in a weak ability to capture interactions between modalities. To
address these problems, we present the relational bilevel aggregation graph
convolutional network (RBA-GCN), which consists of three modules: the graph
generation module (GGM), similarity-based cluster building module (SCBM) and
bilevel aggregation module (BiAM). First, GGM constructs a novel graph to
reduce the redundancy of target node information. Then, SCBM calculates the
node similarity in the target node and its structural neighborhood, where noisy
information with low similarity is filtered out to preserve the discriminant
information of the node. Meanwhile, BiAM is a novel aggregation method that can
preserve the information of nodes during the aggregation process. This module
can construct the interaction between different modalities and capture
long-range contextual information based on similarity clusters. On both the
IEMOCAP and MELD datasets, the weighted average F1 score of RBA-GCN has a
2.175.21\% improvement over that of the most advanced method
Knowing What to Listen to: Early Attention for Deep Speech Representation Learning
Deep learning techniques have considerably improved speech processing in
recent years. Speech representations extracted by deep learning models are
being used in a wide range of tasks such as speech recognition, speaker
recognition, and speech emotion recognition. Attention models play an important
role in improving deep learning models. However current attention mechanisms
are unable to attend to fine-grained information items. In this paper we
propose the novel Fine-grained Early Frequency Attention (FEFA) for speech
signals. This model is capable of focusing on information items as small as
frequency bins. We evaluate the proposed model on two popular tasks of speaker
recognition and speech emotion recognition. Two widely used public datasets,
VoxCeleb and IEMOCAP, are used for our experiments. The model is implemented on
top of several prominent deep models as backbone networks to evaluate its
impact on performance compared to the original networks and other related work.
Our experiments show that by adding FEFA to different CNN architectures,
performance is consistently improved by substantial margins, even setting a new
state-of-the-art for the speaker recognition task. We also tested our model
against different levels of added noise showing improvements in robustness and
less sensitivity compared to the backbone networks
Multimodal Language Analysis with Recurrent Multistage Fusion
Computational modeling of human multimodal language is an emerging research
area in natural language processing spanning the language, visual and acoustic
modalities. Comprehending multimodal language requires modeling not only the
interactions within each modality (intra-modal interactions) but more
importantly the interactions between modalities (cross-modal interactions). In
this paper, we propose the Recurrent Multistage Fusion Network (RMFN) which
decomposes the fusion problem into multiple stages, each of them focused on a
subset of multimodal signals for specialized, effective fusion. Cross-modal
interactions are modeled using this multistage fusion approach which builds
upon intermediate representations of previous stages. Temporal and intra-modal
interactions are modeled by integrating our proposed fusion approach with a
system of recurrent neural networks. The RMFN displays state-of-the-art
performance in modeling human multimodal language across three public datasets
relating to multimodal sentiment analysis, emotion recognition, and speaker
traits recognition. We provide visualizations to show that each stage of fusion
focuses on a different subset of multimodal signals, learning increasingly
discriminative multimodal representations.Comment: EMNLP 201
GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition
Emotion Recognition in Conversation (ERC) plays a significant part in
Human-Computer Interaction (HCI) systems since it can provide empathetic
services. Multimodal ERC can mitigate the drawbacks of uni-modal approaches.
Recently, Graph Neural Networks (GNNs) have been widely used in a variety of
fields due to their superior performance in relation modeling. In multimodal
ERC, GNNs are capable of extracting both long-distance contextual information
and inter-modal interactive information. Unfortunately, since existing methods
such as MMGCN directly fuse multiple modalities, redundant information may be
generated and diverse information may be lost. In this work, we present a
directed Graph based Cross-modal Feature Complementation (GraphCFC) module that
can efficiently model contextual and interactive information. GraphCFC
alleviates the problem of heterogeneity gap in multimodal fusion by utilizing
multiple subspace extractors and Pair-wise Cross-modal Complementary (PairCC)
strategy. We extract various types of edges from the constructed graph for
encoding, thus enabling GNNs to extract crucial contextual and interactive
information more accurately when performing message passing. Furthermore, we
design a GNN structure called GAT-MLP, which can provide a new unified network
framework for multimodal learning. The experimental results on two benchmark
datasets show that our GraphCFC outperforms the state-of-the-art (SOTA)
approaches.Comment: 13 page
MASR: Metadata Aware Speech Representation
In the recent years, speech representation learning is constructed primarily
as a self-supervised learning (SSL) task, using the raw audio signal alone,
while ignoring the side-information that is often available for a given speech
recording. In this paper, we propose MASR, a Metadata Aware Speech
Representation learning framework, which addresses the aforementioned
limitations. MASR enables the inclusion of multiple external knowledge sources
to enhance the utilization of meta-data information. The external knowledge
sources are incorporated in the form of sample-level pair-wise similarity
matrices that are useful in a hard-mining loss. A key advantage of the MASR
framework is that it can be combined with any choice of SSL method. Using MASR
representations, we perform evaluations on several downstream tasks such as
language identification, speech recognition and other non-semantic tasks such
as speaker and emotion recognition. In these experiments, we illustrate
significant performance improvements for the MASR over other established
benchmarks. We perform a detailed analysis on the language identification task
to provide insights on how the proposed loss function enables the
representations to separate closely related languages
USING DEEP LEARNING-BASED FRAMEWORK FOR CHILD SPEECH EMOTION RECOGNITION
Biological languages of the body through which human emotion can be detected abound including heart rate, facial expressions, movement of the eyelids and dilation of the eyes, body postures, skin conductance, and even the speech we make. Speech emotion recognition research started some three decades ago, and the popular Interspeech Emotion Challenge has helped to propagate this research area. However, most speech recognition research is focused on adults and there is very little research on child speech. This dissertation is a description of the development and evaluation of a child speech emotion recognition framework. The higher-level components of the framework are designed to sort and separate speech based on the speaker’s age, ensuring that focus is only on speeches made by children. The framework uses Baddeley’s Theory of Working Memory to model a Working Memory Recurrent Network that can process and recognize emotions from speech. Baddeley’s Theory of Working Memory offers one of the best explanations on how the human brain holds and manipulates temporary information which is very crucial in the development of neural networks that learns effectively. Experiments were designed and performed to provide answers to the research questions, evaluate the proposed framework, and benchmark the performance of the framework with other methods. Satisfactory results were obtained from the experiments and in many cases, our framework was able to outperform other popular approaches. This study has implications for various applications of child speech emotion recognition such as child abuse detection and child learning robots
- …