123,328 research outputs found
A Contextualized Real-Time Multimodal Emotion Recognition for Conversational Agents using Graph Convolutional Networks in Reinforcement Learning
Owing to the recent developments in Generative Artificial Intelligence
(GenAI) and Large Language Models (LLM), conversational agents are becoming
increasingly popular and accepted. They provide a human touch by interacting in
ways familiar to us and by providing support as virtual companions. Therefore,
it is important to understand the user's emotions in order to respond
considerately. Compared to the standard problem of emotion recognition,
conversational agents face an additional constraint in that recognition must be
real-time. Studies on model architectures using audio, visual, and textual
modalities have mainly focused on emotion classification using full video
sequences that do not provide online features. In this work, we present a novel
paradigm for contextualized Emotion Recognition using Graph Convolutional
Network with Reinforcement Learning (conER-GRL). Conversations are partitioned
into smaller groups of utterances for effective extraction of contextual
information. The system uses Gated Recurrent Units (GRU) to extract multimodal
features from these groups of utterances. More importantly, Graph Convolutional
Networks (GCN) and Reinforcement Learning (RL) agents are cascade trained to
capture the complex dependencies of emotion features in interactive scenarios.
Comparing the results of the conER-GRL model with other state-of-the-art models
on the benchmark dataset IEMOCAP demonstrates the advantageous capabilities of
the conER-GRL architecture in recognizing emotions in real-time from multimodal
conversational signals.Comment: 5 pages (4 main + 1 reference), 2 figures. Submitted to IEEE FG202
A Dynamic Approach to Pose Invariant Face Identification Using Cellular Simultaneous Recurrent Networks
Face recognition is a widely covered and desirable research field that produced multiple techniques and different approaches. Most of them have severe limitations with pose variations or face rotation. The immediate goal of this thesis is to deal with pose variations by implementing a face recognition system using a Cellular Simultaneous Recurrent Network (CSRN). The CSRN is a novel bio-inspired recurrent neural network that mimics reinforcement learning in the brain. The recognition task is defined as an identification problem on image sequences. The goal is to correctly match a set of unknown pose distorted probe face sequences with a set of known gallery sequences. This system comprises of a pre-processing stage for face and feature extraction and a recognition stage to perform the identification. The face detection algorithm is based on the scale-space method combined with facial structural knowledge. These steps include extraction of key landmark points and motion unit vectors that describe movement of face sequqnces. The identification process applies Eigenface and PCA and reduces each image to a pattern vector used as input for the CSRN. In the training phase the CSRN learns the temporal information contained in image sequences. In the testing phase the network predicts the output pattern and finds similarity with a test input pattern indicating a match or mismatch.Previous applications of a CSRN system in face recognition have shown promise. The first objective of this research is to evaluate those prior implementations of CSRN-based pose invariant face recognition in video images with large scale databases. The publicly available VidTIMIT Audio-Video face dataset provides all the sequences needed for this study. The second objective is to modify a few well know standard face recognition algorithms to handle pose invariant face recognition for appropriate benchmarking with the CSRN. The final objective is to further improve CSRN face recognition by introducing motion units which can be used to capture the direction and intensity of movement of feature points in a rotating fac
Dynamic Face Video Segmentation via Reinforcement Learning
For real-time semantic video segmentation, most recent works utilised a
dynamic framework with a key scheduler to make online key/non-key decisions.
Some works used a fixed key scheduling policy, while others proposed adaptive
key scheduling methods based on heuristic strategies, both of which may lead to
suboptimal global performance. To overcome this limitation, we model the online
key decision process in dynamic video segmentation as a deep reinforcement
learning problem and learn an efficient and effective scheduling policy from
expert information about decision history and from the process of maximising
global return. Moreover, we study the application of dynamic video segmentation
on face videos, a field that has not been investigated before. By evaluating on
the 300VW dataset, we show that the performance of our reinforcement key
scheduler outperforms that of various baselines in terms of both effective key
selections and running speed. Further results on the Cityscapes dataset
demonstrate that our proposed method can also generalise to other scenarios. To
the best of our knowledge, this is the first work to use reinforcement learning
for online key-frame decision in dynamic video segmentation, and also the first
work on its application on face videos.Comment: CVPR 2020. 300VW with segmentation labels is available at:
https://github.com/mapleandfire/300VW-Mas
Attention-Aware Face Hallucination via Deep Reinforcement Learning
Face hallucination is a domain-specific super-resolution problem with the
goal to generate high-resolution (HR) faces from low-resolution (LR) input
images. In contrast to existing methods that often learn a single
patch-to-patch mapping from LR to HR images and are regardless of the
contextual interdependency between patches, we propose a novel Attention-aware
Face Hallucination (Attention-FH) framework which resorts to deep reinforcement
learning for sequentially discovering attended patches and then performing the
facial part enhancement by fully exploiting the global interdependency of the
image. Specifically, in each time step, the recurrent policy network is
proposed to dynamically specify a new attended region by incorporating what
happened in the past. The state (i.e., face hallucination result for the whole
image) can thus be exploited and updated by the local enhancement network on
the selected region. The Attention-FH approach jointly learns the recurrent
policy network and local enhancement network through maximizing the long-term
reward that reflects the hallucination performance over the whole image.
Therefore, our proposed Attention-FH is capable of adaptively personalizing an
optimal searching path for each face image according to its own characteristic.
Extensive experiments show our approach significantly surpasses the
state-of-the-arts on in-the-wild faces with large pose and illumination
variations
- …