1 research outputs found
What Makes Multi-modal Learning Better than Single (Provably)
The world provides us with data of multiple modalities. Intuitively, models
fusing data from different modalities outperform their uni-modal counterparts,
since more information is aggregated. Recently, joining the success of deep
learning, there is an influential line of work on deep multi-modal learning,
which has remarkable empirical results on various applications. However,
theoretical justifications in this field are notably lacking.
Can multi-modal learning provably perform better than uni-modal?
In this paper, we answer this question under a most popular multi-modal
fusion framework, which firstly encodes features from different modalities into
a common latent space and seamlessly maps the latent representations into the
task space. We prove that learning with multiple modalities achieves a smaller
population risk than only using its subset of modalities. The main intuition is
that the former has a more accurate estimate of the latent space
representation. To the best of our knowledge, this is the first theoretical
treatment to capture important qualitative phenomena observed in real
multi-modal applications from the generalization perspective. Combining with
experiment results, we show that multi-modal learning does possess an appealing
formal guarantee.Comment: Accepted to NeurIPS 202