82 research outputs found

    Emotion-aware cross-modal domain adaptation in video sequences

    Get PDF

    Adversarial Training in Affective Computing and Sentiment Analysis: Recent Advances and Perspectives

    Get PDF
    Over the past few years, adversarial training has become an extremely active research topic and has been successfully applied to various Artificial Intelligence (AI) domains. As a potentially crucial technique for the development of the next generation of emotional AI systems, we herein provide a comprehensive overview of the application of adversarial training to affective computing and sentiment analysis. Various representative adversarial training algorithms are explained and discussed accordingly, aimed at tackling diverse challenges associated with emotional AI systems. Further, we highlight a range of potential future research directions. We expect that this overview will help facilitate the development of adversarial training for affective computing and sentiment analysis in both the academic and industrial communities

    KNOT: Knowledge Distillation using Optimal Transport for Solving NLP Tasks

    Full text link
    We propose a new approach, Knowledge Distillation using Optimal Transport (KNOT), to distill the natural language semantic knowledge from multiple teacher networks to a student network. KNOT aims to train a (global) student model by learning to minimize the optimal transport cost of its assigned probability distribution over the labels to the weighted sum of probabilities predicted by the (local) teacher models, under the constraints, that the student model does not have access to teacher models' parameters or training data. To evaluate the quality of knowledge transfer, we introduce a new metric, Semantic Distance (SD), that measures semantic closeness between the predicted and ground truth label distributions. The proposed method shows improvements in the global model's SD performance over the baseline across three NLP tasks while performing on par with Entropy-based distillation on standard accuracy and F1 metrics. The implementation pertaining to this work is publicly available at: https://github.com/declare-lab/KNOT.Comment: COLING 202

    Leveraging audio-visual speech effectively via deep learning

    Get PDF
    The rising popularity of neural networks, combined with the recent proliferation of online audio-visual media, has led to a revolution in the way machines encode, recognize, and generate acoustic and visual speech. Despite the ubiquity of naturally paired audio-visual data, only a limited number of works have applied recent advances in deep learning to leverage the duality between audio and video within this domain. This thesis considers the use of neural networks to learn from large unlabelled datasets of audio-visual speech to enable new practical applications. We begin by training a visual speech encoder that predicts latent features extracted from the corresponding audio on a large unlabelled audio-visual corpus. We apply the trained visual encoder to improve performance on lip reading in real-world scenarios. Following this, we extend the idea of video learning from audio by training a model to synthesize raw speech directly from raw video, without the need for text transcriptions. Remarkably, we find that this framework is capable of reconstructing intelligible audio from videos of new, previously unseen speakers. We also experiment with a separate speech reconstruction framework, which leverages recent advances in sequence modeling and spectrogram inversion to improve the realism of the generated speech. We then apply our research in video-to-speech synthesis to advance the state-of-the-art in audio-visual speech enhancement, by proposing a new vocoder-based model that performs particularly well under extremely noisy scenarios. Lastly, we aim to fully realize the potential of paired audio-visual data by proposing two novel frameworks that leverage acoustic and visual speech to train two encoders that learn from each other simultaneously. We leverage these pre-trained encoders for deepfake detection, speech recognition, and lip reading, and find that they consistently yield improvements over training from scratch.Open Acces

    Accountable, Explainable Artificial Intelligence Incorporation Framework for a Real-Time Affective State Assessment Module

    Get PDF
    The rapid growth of artificial intelligence (AI) and machine learning (ML) solutions has seen it adopted across various industries. However, the concern of ‘black-box’ approaches has led to an increase in the demand for high accuracy, transparency, accountability, and explainability in AI/ML approaches. This work contributes through an accountable, explainable AI (AXAI) framework for delineating and assessing AI systems. This framework has been incorporated into the development of a real-time, multimodal affective state assessment system

    Generation of realistic human behaviour

    Get PDF
    As the use of computers and robots in our everyday lives increases so does the need for better interaction with these devices. Human-computer interaction relies on the ability to understand and generate human behavioural signals such as speech, facial expressions and motion. This thesis deals with the synthesis and evaluation of such signals, focusing not only on their intelligibility but also on their realism. Since these signals are often correlated, it is common for methods to drive the generation of one signal using another. The thesis begins by tackling the problem of speech-driven facial animation and proposing models capable of producing realistic animations from a single image and an audio clip. The goal of these models is to produce a video of a target person, whose lips move in accordance with the driving audio. Particular focus is also placed on a) generating spontaneous expression such as blinks, b) achieving audio-visual synchrony and c) transferring or producing natural head motion. The second problem addressed in this thesis is that of video-driven speech reconstruction, which aims at converting a silent video into waveforms containing speech. The method proposed for solving this problem is capable of generating intelligible and accurate speech for both seen and unseen speakers. The spoken content is correctly captured thanks to a perceptual loss, which uses features from pre-trained speech-driven animation models. The ability of the video-to-speech model to run in real-time allows its use in hearing assistive devices and telecommunications. The final work proposed in this thesis is a generic domain translation system, that can be used for any translation problem including those mapping across different modalities. The framework is made up of two networks performing translations in opposite directions and can be successfully applied to solve diverse sets of translation problems, including speech-driven animation and video-driven speech reconstruction.Open Acces

    Reinforcement Learning for Generative AI: A Survey

    Full text link
    Deep Generative AI has been a long-standing essential topic in the machine learning community, which can impact a number of application areas like text generation and computer vision. The major paradigm to train a generative model is maximum likelihood estimation, which pushes the learner to capture and approximate the target data distribution by decreasing the divergence between the model distribution and the target distribution. This formulation successfully establishes the objective of generative tasks, while it is incapable of satisfying all the requirements that a user might expect from a generative model. Reinforcement learning, serving as a competitive option to inject new training signals by creating new objectives that exploit novel signals, has demonstrated its power and flexibility to incorporate human inductive bias from multiple angles, such as adversarial learning, hand-designed rules and learned reward model to build a performant model. Thereby, reinforcement learning has become a trending research field and has stretched the limits of generative AI in both model design and application. It is reasonable to summarize and conclude advances in recent years with a comprehensive review. Although there are surveys in different application areas recently, this survey aims to shed light on a high-level review that spans a range of application areas. We provide a rigorous taxonomy in this area and make sufficient coverage on various models and applications. Notably, we also surveyed the fast-developing large language model area. We conclude this survey by showing the potential directions that might tackle the limit of current models and expand the frontiers for generative AI

    Multi-Sensory Emotion Recognition with Speech and Facial Expression

    Get PDF
    Emotion plays an important role in human beings’ daily lives. Understanding emotions and recognizing how to react to others’ feelings are fundamental to engaging in successful social interactions. Currently, emotion recognition is not only significant in human beings’ daily lives, but also a hot topic in academic research, as new techniques such as emotion recognition from speech context inspires us as to how emotions are related to the content we are uttering. The demand and importance of emotion recognition have highly increased in many applications in recent years, such as video games, human computer interaction, cognitive computing, and affective computing. Emotion recognition can be done from many sources including text, speech, hand, and body gesture as well as facial expression. Presently, most of the emotion recognition methods only use one of these sources. The emotion of human beings changes every second and using a single way to process the emotion recognition may not reflect the emotion correctly. This research is motivated by the desire to understand and evaluate human beings’ emotion from multiple ways such as speech and facial expressions. In this dissertation, multi-sensory emotion recognition has been exploited. The proposed framework can recognize emotion from speech, facial expression, and both of them. There are three important parts in the design of the system: the facial emotion recognizer, the speech emotion recognizer, and the information fusion. The information fusion part uses the results from the speech emotion recognition and facial emotion recognition. Then, a novel weighted method is used to integrate the results, and a final decision of the emotion is given after the fusion. The experiments show that with the weighted fusion methods, the accuracy can be improved to an average of 3.66% compared to fusion without adding weight. The improvement of the recognition rate can reach 18.27% and 5.66% compared to the speech emotion recognition and facial expression recognition, respectively. By improving the emotion recognition accuracy, the proposed multi-sensory emotion recognition system can help to improve the naturalness of human computer interaction

    Multi-Sensory Emotion Recognition with Speech and Facial Expression

    Get PDF
    Emotion plays an important role in human beings’ daily lives. Understanding emotions and recognizing how to react to others’ feelings are fundamental to engaging in successful social interactions. Currently, emotion recognition is not only significant in human beings’ daily lives, but also a hot topic in academic research, as new techniques such as emotion recognition from speech context inspires us as to how emotions are related to the content we are uttering. The demand and importance of emotion recognition have highly increased in many applications in recent years, such as video games, human computer interaction, cognitive computing, and affective computing. Emotion recognition can be done from many sources including text, speech, hand, and body gesture as well as facial expression. Presently, most of the emotion recognition methods only use one of these sources. The emotion of human beings changes every second and using a single way to process the emotion recognition may not reflect the emotion correctly. This research is motivated by the desire to understand and evaluate human beings’ emotion from multiple ways such as speech and facial expressions. In this dissertation, multi-sensory emotion recognition has been exploited. The proposed framework can recognize emotion from speech, facial expression, and both of them. There are three important parts in the design of the system: the facial emotion recognizer, the speech emotion recognizer, and the information fusion. The information fusion part uses the results from the speech emotion recognition and facial emotion recognition. Then, a novel weighted method is used to integrate the results, and a final decision of the emotion is given after the fusion. The experiments show that with the weighted fusion methods, the accuracy can be improved to an average of 3.66% compared to fusion without adding weight. The improvement of the recognition rate can reach 18.27% and 5.66% compared to the speech emotion recognition and facial expression recognition, respectively. By improving the emotion recognition accuracy, the proposed multi-sensory emotion recognition system can help to improve the naturalness of human computer interaction
    • …
    corecore