6 research outputs found
Hierarchical denoising representation disentanglement and dual-channel cross-modal-context interaction for multimodal sentiment analysis
Multimodal sentiment analysis aims to extract sentiment cues from various modalities, such as textual, acoustic, and visual data, and manipulate them to determine the inherent sentiment polarity in the data. Despite significant achievements in multimodal sentiment analysis, challenges persist in addressing noise features in modal representations, eliminating substantial gaps in sentiment information among modal representations, and exploring contextual information that expresses different sentiments between modalities. To tackle these challenges, our paper proposes a new Multimodal Sentiment Analysis (MSA) framework. Firstly, we introduce the Hierarchical Denoising Representation Disentanglement module (HDRD), which employs hierarchical disentanglement techniques. This ensures the extraction of both common and private sentiment information while eliminating interference noise from modal representations. Furthermore, to address the uneven distribution of sentiment information among modalities, our Inter-Modal Representation Enhancement module (IMRE) enhances non-textual representations by extracting sentiment information related to non-textual representations from textual representations. Next, we introduce a new interaction mechanism, the Dual-Channel Cross-Modal Context Interaction module (DCCMCI). This module not only mines correlated contextual sentiment information within modalities but also explores positive and negative correlation contextual sentiment information between modalities. We conducted extensive experiments on two benchmark datasets, MOSI and MOSEI, and the results indicate that our proposed method offers state-of-the-art approaches.publishedVersio
A systematic literature review on incomplete multimodal learning: techniques and challenges
Data availability:
The data that support the findings of this study are available from the corresponding author, R.Y., upon reasonable request.Recently, machine learning technologies have been successfully applied across various fields. However, most existing machine learning models rely on unimodal data for information inference, which hinders their ability to generalize to complex application scenarios. This limitation has resulted in the development of multimodal learning, a field that integrates information from different modalities to enhance models' capabilities. However, data often suffers from missing or incomplete modalities in practical applications. This necessitates that models maintain robustness and effectively infer complete information in the presence of missing modalities. The emerging research direction of incomplete multimodal learning (IML) aims to facilitate effective learning from incomplete multimodal training sets, ensuring that models can dynamically and robustly address new instances with arbitrary missing modalities during the testing phase. This paper offers a comprehensive review of methods based on IML. It categorizes existing approaches based on their information sources into two main types: based on internal information and external information methods. These categories are further subdivided into data-based, feature-based, knowledge transfer-based, graph knowledge enhancement-based, and human-in-the-loop-based methods. The paper conducts comparative analyses from two perspectives: comparisons among similar methods and comparisons among different types of methods. Finally, it offers insights into the research trends in IML.This work is supported by National Natural Science Foundation of China (72401233), Jiangsu Provincial Qinglan Project, Natural Science Foundation of Jiangsu Higher Education Institutions of China (23KJB520038), Suzhou Science and Technology Programme (SYG202106), and Research Enhancement Fund of XJTLU (REF-23-01-008)
Foundations of Multisensory Artificial Intelligence
Building multisensory AI systems that learn from multiple sensory inputs such
as text, speech, video, real-world sensors, wearable devices, and medical data
holds great promise for impact in many scientific areas with practical
benefits, such as in supporting human health and well-being, enabling
multimedia content processing, and enhancing real-world autonomous agents. By
synthesizing a range of theoretical frameworks and application domains, this
thesis aims to advance the machine learning foundations of multisensory AI. In
the first part, we present a theoretical framework formalizing how modalities
interact with each other to give rise to new information for a task. These
interactions are the basic building blocks in all multimodal problems, and
their quantification enables users to understand their multimodal datasets,
design principled approaches to learn these interactions, and analyze whether
their model has succeeded in learning. In the second part, we study the design
of practical multimodal foundation models that generalize over many modalities
and tasks, which presents a step toward grounding large language models to
real-world sensory modalities. We introduce MultiBench, a unified large-scale
benchmark across a wide range of modalities, tasks, and research areas,
followed by the cross-modal attention and multimodal transformer architectures
that now underpin many of today's multimodal foundation models. Scaling these
architectures on MultiBench enables the creation of general-purpose
multisensory AI systems, and we discuss our collaborative efforts in applying
these models for real-world impact in affective computing, mental health,
cancer prognosis, and robotics. Finally, we conclude this thesis by discussing
how future work can leverage these ideas toward more general, interactive, and
safe multisensory AI.Comment: CMU Machine Learning Department PhD Thesi
Multimodal Variational Autoencoder for Instruction-Based Action Generation
Multimodální variační autoenkodéry (VAE) jsou generativní modely umožňující propojení jednotlivých datových vstupů do společné latentní reprezentace, nebo generování jedné modality na základě jiné. Ačkoli bylo navrženo několik variant multimodálních VAE, jejich testování na praktických a reálných úlohách s heterogenními nebo sekvenčními daty bylo doposud velmi omezené. Tato práce nejprve systematicky hodnotí a porovnává existující metody pomocí nového nástroje a datasetu, které umožňují definovat jejich silné a slabé stránky. Dále práce navrhuje a zkoumá různé úpravy VAE, jako je automatické hledání hyperparametrů, inkrementální učení z malého množství dat nebo nově navržený přístup fúze modalit založený na transformerech. Nakonec je vytvořena kolekce simulovaných robotických datasetů obsahujících pokyny v přirozeném jazyce, obrázky a robotické akce, a je použita k ohodnocení současných metod i nově navržených úprav. Celkově práce přispívá k adaptaci multimodálních VAE pro komplexní sekvenční data a otevírá nové možnosti pro jejich použití v reálných aplikacích.Multimodal Variational Autoencoders (VAEs) are powerful generative models, enabling the fusion of various inputs into a joint representation or generation of one modality from another. Although several variants of multimodal VAEs have been proposed, their evaluation in practical real-world applications with heterogenous or sequential data has been limited. This dissertation first systematically evaluates and compares the existing methods on a new benchmarking toolkit and dataset to define their strengths and limitations. Next, several adjustments to the state-of-the-art VAEs are explored such as automatic hyperparameter tuning, incremental few-shot learning or training with the newly proposed transformer-based modality fusion approach. Finally, a collection of simulated robotic datasets comprising natural language instructions, images and robotic actions is created. The datasets are used to evaluate the current methods as well as the proposed adjustments. Overall, the work contributes to the advancement of multimodal VAEs in handling complex sequential data and opens up new possibilities for their use in practical real-world applications
Multimodal Variational Autoencoder for Instruction-Based Action Generation
Multimodální variační autoenkodéry (VAE) jsou generativní modely umožňující propojení jednotlivých datových vstupů do společné latentní reprezentace, nebo generování jedné modality na základě jiné. Ačkoli bylo navrženo několik variant multimodálních VAE, jejich testování na praktických a reálných úlohách s heterogenními nebo sekvenčními daty bylo doposud velmi omezené. Tato práce nejprve systematicky hodnotí a porovnává existující metody pomocí nového nástroje a datasetu, které umožňují definovat jejich silné a slabé stránky. Dále práce navrhuje a zkoumá různé úpravy VAE, jako je automatické hledání hyperparametrů, inkrementální učení z malého množství dat nebo nově navržený přístup fúze modalit založený na transformerech. Nakonec je vytvořena kolekce simulovaných robotických datasetů obsahujících pokyny v přirozeném jazyce, obrázky a robotické akce, a je použita k ohodnocení současných metod i nově navržených úprav. Celkově práce přispívá k adaptaci multimodálních VAE pro komplexní sekvenční data a otevírá nové možnosti pro jejich použití v reálných aplikacích.Multimodal Variational Autoencoders (VAEs) are powerful generative models, enabling the fusion of various inputs into a joint representation or generation of one modality from another. Although several variants of multimodal VAEs have been proposed, their evaluation in practical real-world applications with heterogenous or sequential data has been limited. This dissertation first systematically evaluates and compares the existing methods on a new benchmarking toolkit and dataset to define their strengths and limitations. Next, several adjustments to the state-of-the-art VAEs are explored such as automatic hyperparameter tuning, incremental few-shot learning or training with the newly proposed transformer-based modality fusion approach. Finally, a collection of simulated robotic datasets comprising natural language instructions, images and robotic actions is created. The datasets are used to evaluate the current methods as well as the proposed adjustments. Overall, the work contributes to the advancement of multimodal VAEs in handling complex sequential data and opens up new possibilities for their use in practical real-world applications