2,231 research outputs found
Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field
that aims to design computer agents with intelligent capabilities such as
understanding, reasoning, and learning through integrating multiple
communicative modalities, including linguistic, acoustic, visual, tactile, and
physiological messages. With the recent interest in video understanding,
embodied autonomous agents, text-to-image generation, and multisensor fusion in
application domains such as healthcare and robotics, multimodal machine
learning has brought unique computational and theoretical challenges to the
machine learning community given the heterogeneity of data sources and the
interconnections often found between modalities. However, the breadth of
progress in multimodal research has made it difficult to identify the common
themes and open questions in the field. By synthesizing a broad range of
application domains and theoretical frameworks from both historical and recent
perspectives, this paper is designed to provide an overview of the
computational and theoretical foundations of multimodal machine learning. We
start by defining two key principles of modality heterogeneity and
interconnections that have driven subsequent innovations, and propose a
taxonomy of 6 core technical challenges: representation, alignment, reasoning,
generation, transference, and quantification covering historical and recent
trends. Recent technical achievements will be presented through the lens of
this taxonomy, allowing researchers to understand the similarities and
differences across new approaches. We end by motivating several open problems
for future research as identified by our taxonomy
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
With the urgent demand for generalized deep models, many pre-trained big
models are proposed, such as BERT, ViT, GPT, etc. Inspired by the success of
these models in single domains (like computer vision and natural language
processing), the multi-modal pre-trained big models have also drawn more and
more attention in recent years. In this work, we give a comprehensive survey of
these models and hope this paper could provide new insights and helps fresh
researchers to track the most cutting-edge works. Specifically, we firstly
introduce the background of multi-modal pre-training by reviewing the
conventional deep learning, pre-training works in natural language process,
computer vision, and speech. Then, we introduce the task definition, key
challenges, and advantages of multi-modal pre-training models (MM-PTMs), and
discuss the MM-PTMs with a focus on data, objectives, network architectures,
and knowledge enhanced pre-training. After that, we introduce the downstream
tasks used for the validation of large-scale MM-PTMs, including generative,
classification, and regression tasks. We also give visualization and analysis
of the model parameters and results on representative downstream tasks.
Finally, we point out possible research directions for this topic that may
benefit future works. In addition, we maintain a continuously updated paper
list for large-scale pre-trained multi-modal big models:
https://github.com/wangxiao5791509/MultiModal_BigModels_SurveyComment: Accepted by Machine Intelligence Researc
- …