Federated learning (FL) enables a decentralized machine learning paradigm for
multiple clients to collaboratively train a generalized global model without
sharing their private data. Most existing works simply propose typical FL
systems for single-modal data, thus limiting its potential on exploiting
valuable multimodal data for future personalized applications. Furthermore, the
majority of FL approaches still rely on the labeled data at the client side,
which is limited in real-world applications due to the inability of
self-annotation from users. In light of these limitations, we propose a novel
multimodal FL framework that employs a semi-supervised learning approach to
leverage the representations from different modalities. Bringing this concept
into a system, we develop a distillation-based multimodal embedding knowledge
transfer mechanism, namely FedMEKT, which allows the server and clients to
exchange the joint knowledge of their learning models extracted from a small
multimodal proxy dataset. Our FedMEKT iteratively updates the generalized
global encoders with the joint embedding knowledge from the participating
clients. Thereby, to address the modality discrepancy and labeled data
constraint in existing FL systems, our proposed FedMEKT comprises local
multimodal autoencoder learning, generalized multimodal autoencoder
construction, and generalized classifier learning. Through extensive
experiments on three multimodal human activity recognition datasets, we
demonstrate that FedMEKT achieves superior global encoder performance on linear
evaluation and guarantees user privacy for personal data and model parameters
while demanding less communication cost than other baselines