7 research outputs found
What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?
With the growing success of multi-modal learning, research on the robustness
of multi-modal models, especially when facing situations with missing
modalities, is receiving increased attention. Nevertheless, previous studies in
this domain exhibit certain limitations, as they often lack theoretical
insights or their methodologies are tied to specific network architectures or
modalities. We model the scenarios of multi-modal models encountering missing
modalities from an information-theoretic perspective and illustrate that the
performance ceiling in such scenarios can be approached by efficiently
utilizing the information inherent in non-missing modalities. In practice,
there are two key aspects: (1) The encoder should be able to extract
sufficiently good features from the non-missing modality; (2) The extracted
features should be robust enough not to be influenced by noise during the
fusion process across modalities. To this end, we introduce Uni-Modal Ensemble
with Missing Modality Adaptation (UME-MMA). UME-MMA employs uni-modal
pre-trained weights for the multi-modal model to enhance feature extraction and
utilizes missing modality data augmentation techniques to better adapt to
situations with missing modalities. Apart from that, UME-MMA, built on a
late-fusion learning framework, allows for the plug-and-play use of various
encoders, making it suitable for a wide range of modalities and enabling
seamless integration of large-scale pre-trained encoders to further enhance
performance. And we demonstrate UME-MMA's effectiveness in audio-visual
datasets~(e.g., AV-MNIST, Kinetics-Sound, AVE) and vision-language
datasets~(e.g., MM-IMDB, UPMC Food101)
ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory
Large language models (LLMs) with memory are computationally universal.
However, mainstream LLMs are not taking full advantage of memory, and the
designs are heavily influenced by biological brains. Due to their approximate
nature and proneness to the accumulation of errors, conventional neural memory
mechanisms cannot support LLMs to simulate complex reasoning. In this paper, we
seek inspiration from modern computer architectures to augment LLMs with
symbolic memory for complex multi-hop reasoning. Such a symbolic memory
framework is instantiated as an LLM and a set of SQL databases, where the LLM
generates SQL instructions to manipulate the SQL databases. We validate the
effectiveness of the proposed memory framework on a synthetic dataset requiring
complex reasoning. The project website is available at
https://chatdatabase.github.io/
On Uni-Modal Feature Learning in Supervised Multi-Modal Learning
We abstract the features (i.e. learned representations) of multi-modal data
into 1) uni-modal features, which can be learned from uni-modal training, and
2) paired features, which can only be learned from cross-modal interactions.
Multi-modal models are expected to benefit from cross-modal interactions on the
basis of ensuring uni-modal feature learning. However, recent supervised
multi-modal late-fusion training approaches still suffer from insufficient
learning of uni-modal features on each modality. We prove that this phenomenon
does hurt the model's generalization ability. To this end, we propose to choose
a targeted late-fusion learning method for the given supervised multi-modal
task from Uni-Modal Ensemble(UME) and the proposed Uni-Modal Teacher(UMT),
according to the distribution of uni-modal and paired features. We demonstrate
that, under a simple guiding strategy, we can achieve comparable results to
other complex late-fusion or intermediate-fusion methods on various multi-modal
datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40
Intrinsically Motivated Self-supervised Learning in Reinforcement Learning
In vision-based reinforcement learning (RL) tasks, it is prevalent to assign
auxiliary tasks with a surrogate self-supervised loss so as to obtain more
semantic representations and improve sample efficiency. However, abundant
information in self-supervised auxiliary tasks has been disregarded, since the
representation learning part and the decision-making part are separated. To
sufficiently utilize information in auxiliary tasks, we present a simple yet
effective idea to employ self-supervised loss as an intrinsic reward, called
Intrinsically Motivated Self-Supervised learning in Reinforcement learning
(IM-SSR). We formally show that the self-supervised loss can be decomposed as
exploration for novel states and robustness improvement from nuisance
elimination. IM-SSR can be effortlessly plugged into any reinforcement learning
with self-supervised auxiliary objectives with nearly no additional cost.
Combined with IM-SSR, the previous underlying algorithms achieve salient
improvements on both sample efficiency and generalization in various
vision-based robotics tasks from the DeepMind Control Suite, especially when
the reward signal is sparse