Quantum Cognitively Motivated Context-Aware Multimodal Representation Learning for Human Language Analysis

Abstract

A long-standing goal in the field of Artificial Intelligence (AI) is to develop systems that can perceive and understand human multimodal language. This requires both the consideration of context in the form of surrounding utterances in a conversation, i.e., context modelling, as well as the impact of different modalities (e.g., linguistic, visual acoustic), i.e., multimodal fusion. In the last few years, significant strides have been made towards the interpretation of human language due to simultaneous advancement in deep learning, data gathering and computing infrastructure. AI models have been investigated to either model interactions across distinct modalities, i.e., linguistic, visual and acoustic, or model interactions across parties in a conversation, achieving unprecedented levels of performance. However, AI models are often designed with only performance as their design target, leaving aside other essential factors such as transparency, interpretability, and how humans understand and reason about cognitive states. In line with this observation, in this dissertation, we develop quantum probabilistic neural models and techniques that allow us to capture rational and irrational cognitive biases, without requiring a priori understanding and identification of them. First, we present a comprehensive empirical comparison of state-of-the-art (SOTA) modality fusion strategies for video sentiment analysis. The findings provide us helpful insights into the development of more effective modality fusion models incorporating quantum-inspired components. Second, we introduce an end-to-end complex-valued neural model for video sentiment analysis, simulating quantum procedural steps, outside of physics, into the neural network modelling paradigm. Third, we investigate non-classical correlations across different modalities. In particular, we describe a methodology to model interactions between image and text for an information retrieval scenario. The results provide us with theoretical and empirical insights to develop a transparent end-to-end probabilistic neural model for video emotion detection in conversations, capturing non-classical correlations across distinct modalities. Fourth, we introduce a theoretical framework to model user's cognitive states underlying their multimodal decision perspectives, and propose a methodology to capture interference of modalities in decision making. Overall, we show that our models advance the SOTA on various affective analysis tasks, achieve high transparency due to the mapping to quantum physics meanings, and improve post-hoc interpretability, unearthing useful and explainable knowledge about cross-modal interactions

    Similar works