82 research outputs found
Adversarial Training in Affective Computing and Sentiment Analysis: Recent Advances and Perspectives
Over the past few years, adversarial training has become an extremely active
research topic and has been successfully applied to various Artificial
Intelligence (AI) domains. As a potentially crucial technique for the
development of the next generation of emotional AI systems, we herein provide a
comprehensive overview of the application of adversarial training to affective
computing and sentiment analysis. Various representative adversarial training
algorithms are explained and discussed accordingly, aimed at tackling diverse
challenges associated with emotional AI systems. Further, we highlight a range
of potential future research directions. We expect that this overview will help
facilitate the development of adversarial training for affective computing and
sentiment analysis in both the academic and industrial communities
KNOT: Knowledge Distillation using Optimal Transport for Solving NLP Tasks
We propose a new approach, Knowledge Distillation using Optimal Transport
(KNOT), to distill the natural language semantic knowledge from multiple
teacher networks to a student network. KNOT aims to train a (global) student
model by learning to minimize the optimal transport cost of its assigned
probability distribution over the labels to the weighted sum of probabilities
predicted by the (local) teacher models, under the constraints, that the
student model does not have access to teacher models' parameters or training
data. To evaluate the quality of knowledge transfer, we introduce a new metric,
Semantic Distance (SD), that measures semantic closeness between the predicted
and ground truth label distributions. The proposed method shows improvements in
the global model's SD performance over the baseline across three NLP tasks
while performing on par with Entropy-based distillation on standard accuracy
and F1 metrics. The implementation pertaining to this work is publicly
available at: https://github.com/declare-lab/KNOT.Comment: COLING 202
Leveraging audio-visual speech effectively via deep learning
The rising popularity of neural networks, combined with the recent proliferation of online audio-visual media, has led to a revolution in the way machines encode, recognize, and generate acoustic and visual speech. Despite the ubiquity of naturally paired audio-visual data, only a limited number of works have applied recent advances in deep learning to leverage the duality between audio and video within this domain. This thesis considers the use of neural networks to learn from large unlabelled datasets of audio-visual speech to enable new practical applications. We begin by training a visual speech encoder that predicts latent features extracted from the corresponding audio on a large unlabelled audio-visual corpus. We apply the trained visual encoder to improve performance on lip reading in real-world scenarios. Following this, we extend the idea of video learning from audio by training a model to synthesize raw speech directly from raw video, without the need for text transcriptions. Remarkably, we find that this framework is capable of reconstructing intelligible audio from videos of new, previously unseen speakers. We also experiment with a separate speech reconstruction framework, which leverages recent advances in sequence modeling and spectrogram inversion to improve the realism of the generated speech. We then apply our research in video-to-speech synthesis to advance the state-of-the-art in audio-visual speech enhancement, by proposing a new vocoder-based model that performs particularly well under extremely noisy scenarios. Lastly, we aim to fully realize the potential of paired audio-visual data by proposing two novel frameworks that leverage acoustic and visual speech to train two encoders that learn from each other simultaneously. We leverage these pre-trained encoders for deepfake detection, speech recognition, and lip reading, and find that they consistently yield improvements over training from scratch.Open Acces
Recommended from our members
Human Motion Anticipation and Recognition from RGB-D
Predicting and understanding the dynamic of human motion has many applications such as motion synthesis, augmented reality, security, education, reinforcement learning, autonomous vehicles, and many others. In this thesis, we create a novel end-to-end pipeline that can predict multiple future poses from the same input, and, in addition, can classify the entire sequence. Our focus is on the following two aspects of human motion understanding:
Probabilistic human action prediction: Given a sequence of human poses as input, we sample multiple possible future poses from the same input sequence using a new GAN-based network.
Human motion understanding: Given a sequence of human poses as input, we classify the actual action performed in the sequence and improve the classification performance using the presentation learned from the prediction network.
We also demonstrate how to improve model training from noisy labels, using facial expression recognition as an example. More specifically, we have 10 taggers to label each input image, and compare four different approaches: majority voting, multi-label learning, probabilistic label drawing, and cross-entropy loss. We show that the traditional majority voting scheme does not perform as well as the last two approaches that fully leverage the label distribution. We shared the enhanced FER+ data set with multiple labels for each face image with the research community (https://github.com/Microsoft/FERPlus).
For predicting and understanding of human motion, we propose a novel sequence-to-sequence model trained with an improved version of generative adversarial networks (GAN). Our model, which we call HP-GAN2, learns a probability density function of future human poses conditioned on previous poses. It predicts multiple sequences of possible future human poses, each from the same input sequence but seeded with a different vector z drawn from a random distribution. Moreover, to quantify the quality of the non-deterministic predictions, we simultaneously train a motion-quality-assessment model that learns the probability that a given skeleton pose sequence is a real or fake human motion.
In order to classify the action performed in a video clip, we took two approaches. In the first approach, we train on a sequence of skeleton poses from scratch using random parameters initialization with the same network architecture used in the discriminator of the HP-GAN2 model. For the second approach, we use the discriminator of the HP-GAN2 network, extend it with an action classification branch, and fine tune the end-to-end model on the classification tasks, since the discriminator in HP-GAN2 learned to differentiate between fake and real human motion. So, our hypothesis is that if the discriminator network can differentiate between synthetic and real skeleton poses, then it also has learned some of the dynamics of a real human motion, and that those dynamics are useful in classification as well. We will show through multiple experiments that that is indeed the case.
Therefore, our model learns to predict multiple future sequences of human poses from the same input sequence. We also show that the discriminator learns a general representation of human motion by using the learned features in an action recognition task. And we train a motion-quality-assessment network that measure the probability of a given sequence of poses are valid human poses or not.
We test our model on two of the largest human pose datasets: NTURGB-D, and Human3.6M. We train on both single and multiple action types. The predictive power of our model for motion estimation is demonstrated by generating multiple plausible futures from the same input and showing the effect of each of the several loss functions in the ablation study. We also show the advantage of switching to GAN from WGAN-GP, which we used in our previous work. Furthermore, we show that it takes less than half the number of epochs to train an activity recognition network by using the features learned from the discriminator
Accountable, Explainable Artificial Intelligence Incorporation Framework for a Real-Time Affective State Assessment Module
The rapid growth of artificial intelligence (AI) and machine learning (ML) solutions has seen it adopted across various industries. However, the concern of ‘black-box’ approaches has led to an increase in the demand for high accuracy, transparency, accountability, and explainability in AI/ML approaches. This work contributes through an accountable, explainable AI (AXAI) framework for delineating and assessing AI systems. This framework has been incorporated into the development of a real-time, multimodal affective state assessment system
Generation of realistic human behaviour
As the use of computers and robots in our everyday lives increases so does the need for better interaction with these devices. Human-computer interaction relies on the ability to understand and generate human behavioural signals such as speech, facial expressions and motion. This thesis deals with the synthesis and evaluation of such signals, focusing not only on their intelligibility but also on their realism. Since these signals are often correlated, it is common for methods to drive the generation of one signal using another. The thesis begins by tackling the problem of speech-driven facial animation and proposing models capable of producing realistic animations from a single image and an audio clip. The goal of these models is to produce a video of a target person, whose lips move in accordance with the driving audio. Particular focus is also placed on a) generating spontaneous expression such as blinks, b) achieving audio-visual synchrony and c) transferring or producing natural head motion. The second problem addressed in this thesis is that of video-driven speech reconstruction, which aims at converting a silent video into waveforms containing speech. The method proposed for solving this problem is capable of generating intelligible and accurate speech for both seen and unseen speakers. The spoken content is correctly captured thanks to a perceptual loss, which uses features from pre-trained speech-driven animation models. The ability of the video-to-speech model to run in real-time allows its use in hearing assistive devices and telecommunications. The final work proposed in this thesis is a generic domain translation system, that can be used for any translation problem including those mapping across different modalities. The framework is made up of two networks performing translations in opposite directions and can be successfully applied to solve diverse sets of translation problems, including speech-driven animation and video-driven speech reconstruction.Open Acces
Reinforcement Learning for Generative AI: A Survey
Deep Generative AI has been a long-standing essential topic in the machine
learning community, which can impact a number of application areas like text
generation and computer vision. The major paradigm to train a generative model
is maximum likelihood estimation, which pushes the learner to capture and
approximate the target data distribution by decreasing the divergence between
the model distribution and the target distribution. This formulation
successfully establishes the objective of generative tasks, while it is
incapable of satisfying all the requirements that a user might expect from a
generative model. Reinforcement learning, serving as a competitive option to
inject new training signals by creating new objectives that exploit novel
signals, has demonstrated its power and flexibility to incorporate human
inductive bias from multiple angles, such as adversarial learning,
hand-designed rules and learned reward model to build a performant model.
Thereby, reinforcement learning has become a trending research field and has
stretched the limits of generative AI in both model design and application. It
is reasonable to summarize and conclude advances in recent years with a
comprehensive review. Although there are surveys in different application areas
recently, this survey aims to shed light on a high-level review that spans a
range of application areas. We provide a rigorous taxonomy in this area and
make sufficient coverage on various models and applications. Notably, we also
surveyed the fast-developing large language model area. We conclude this survey
by showing the potential directions that might tackle the limit of current
models and expand the frontiers for generative AI
Multi-Sensory Emotion Recognition with Speech and Facial Expression
Emotion plays an important role in human beings’ daily lives. Understanding emotions and recognizing how to react to others’ feelings are fundamental to engaging in successful social interactions. Currently, emotion recognition is not only significant in human beings’ daily lives, but also a hot topic in academic research, as new techniques such as emotion recognition from speech context inspires us as to how emotions are related to the content we are uttering.
The demand and importance of emotion recognition have highly increased in many applications in recent years, such as video games, human computer interaction, cognitive computing, and affective computing. Emotion recognition can be done from many sources including text, speech, hand, and body gesture as well as facial expression. Presently, most of the emotion recognition methods only use one of these sources. The emotion of human beings changes every second and using a single way to process the emotion recognition may not reflect the emotion correctly. This research is motivated by the desire to understand and evaluate human beings’ emotion from multiple ways such as speech and facial expressions.
In this dissertation, multi-sensory emotion recognition has been exploited. The proposed framework can recognize emotion from speech, facial expression, and both of them. There are three important parts in the design of the system: the facial emotion recognizer, the speech emotion recognizer, and the information fusion. The information fusion part uses the results from the speech emotion recognition and facial emotion recognition. Then, a novel weighted method is used to integrate the results, and a final decision of the emotion is given after the fusion.
The experiments show that with the weighted fusion methods, the accuracy can be improved to an average of 3.66% compared to fusion without adding weight. The improvement of the recognition rate can reach 18.27% and 5.66% compared to the speech emotion recognition and facial expression recognition, respectively. By improving the emotion recognition accuracy, the proposed multi-sensory emotion recognition system can help to improve the naturalness of human computer interaction
Multi-Sensory Emotion Recognition with Speech and Facial Expression
Emotion plays an important role in human beings’ daily lives. Understanding emotions and recognizing how to react to others’ feelings are fundamental to engaging in successful social interactions. Currently, emotion recognition is not only significant in human beings’ daily lives, but also a hot topic in academic research, as new techniques such as emotion recognition from speech context inspires us as to how emotions are related to the content we are uttering.
The demand and importance of emotion recognition have highly increased in many applications in recent years, such as video games, human computer interaction, cognitive computing, and affective computing. Emotion recognition can be done from many sources including text, speech, hand, and body gesture as well as facial expression. Presently, most of the emotion recognition methods only use one of these sources. The emotion of human beings changes every second and using a single way to process the emotion recognition may not reflect the emotion correctly. This research is motivated by the desire to understand and evaluate human beings’ emotion from multiple ways such as speech and facial expressions.
In this dissertation, multi-sensory emotion recognition has been exploited. The proposed framework can recognize emotion from speech, facial expression, and both of them. There are three important parts in the design of the system: the facial emotion recognizer, the speech emotion recognizer, and the information fusion. The information fusion part uses the results from the speech emotion recognition and facial emotion recognition. Then, a novel weighted method is used to integrate the results, and a final decision of the emotion is given after the fusion.
The experiments show that with the weighted fusion methods, the accuracy can be improved to an average of 3.66% compared to fusion without adding weight. The improvement of the recognition rate can reach 18.27% and 5.66% compared to the speech emotion recognition and facial expression recognition, respectively. By improving the emotion recognition accuracy, the proposed multi-sensory emotion recognition system can help to improve the naturalness of human computer interaction
- …