Multimodal learning has seen great success mining data features from multiple
modalities with remarkable model performance improvement. Meanwhile, federated
learning (FL) addresses the data sharing problem, enabling privacy-preserved
collaborative training to provide sufficient precious data. Great potential,
therefore, arises with the confluence of them, known as multimodal federated
learning. However, limitation lies in the predominant approaches as they often
assume that each local dataset records samples from all modalities. In this
paper, we aim to bridge this gap by proposing an Unimodal Training - Multimodal
Prediction (UTMP) framework under the context of multimodal federated learning.
We design HA-Fedformer, a novel transformer-based model that empowers unimodal
training with only a unimodal dataset at the client and multimodal testing by
aggregating multiple clients' knowledge for better accuracy. The key advantages
are twofold. Firstly, to alleviate the impact of data non-IID, we develop an
uncertainty-aware aggregation method for the local encoders with layer-wise
Markov Chain Monte Carlo sampling. Secondly, to overcome the challenge of
unaligned language sequence, we implement a cross-modal decoder aggregation to
capture the hidden signal correlation between decoders trained by data from
different modalities. Our experiments on popular sentiment analysis benchmarks,
CMU-MOSI and CMU-MOSEI, demonstrate that HA-Fedformer significantly outperforms
state-of-the-art multimodal models under the UTMP federated learning
frameworks, with 15%-20% improvement on most attributes.Comment: 10 pages,5 figure