Multimodal data, which can comprehensively perceive and recognize the
physical world, has become an essential path towards general artificial
intelligence. However, multimodal large models trained on public datasets often
underperform in specific industrial domains. This paper proposes a multimodal
federated learning framework that enables multiple enterprises to utilize
private domain data to collaboratively train large models for vertical domains,
achieving intelligent services across scenarios. The authors discuss in-depth
the strategic transformation of federated learning in terms of intelligence
foundation and objectives in the era of big model, as well as the new
challenges faced in heterogeneous data, model aggregation, performance and cost
trade-off, data privacy, and incentive mechanism. The paper elaborates a case
study of leading enterprises contributing multimodal data and expert knowledge
to city safety operation management , including distributed deployment and
efficient coordination of the federated learning platform, technical
innovations on data quality improvement based on large model capabilities and
efficient joint fine-tuning approaches. Preliminary experiments show that
enterprises can enhance and accumulate intelligent capabilities through
multimodal model federated learning, thereby jointly creating an smart city
model that provides high-quality intelligent services covering energy
infrastructure safety, residential community security, and urban operation
management. The established federated learning cooperation ecosystem is
expected to further aggregate industry, academia, and research resources,
realize large models in multiple vertical domains, and promote the large-scale
industrial application of artificial intelligence and cutting-edge research on
multimodal federated learning