Fusion is a technique for merging multiple independently-trained neural
networks in order to combine their capabilities. Past attempts have been
restricted to the case of fully-connected, convolutional, and residual
networks. This paper presents a systematic approach for fusing two or more
transformer-based networks exploiting Optimal Transport to (soft-)align the
various architectural components. We flesh out an abstraction for layer
alignment, that can generalize to arbitrary architectures - in principle - and
we apply this to the key ingredients of Transformers such as multi-head
self-attention, layer-normalization, and residual connections, and we discuss
how to handle them via various ablation studies. Furthermore, our method allows
the fusion of models of different sizes (heterogeneous fusion), providing a new
and efficient way to compress Transformers. The proposed approach is evaluated
on both image classification tasks via Vision Transformer and natural language
modeling tasks using BERT. Our approach consistently outperforms vanilla
fusion, and, after a surprisingly short finetuning, also outperforms the
individual converged parent models. In our analysis, we uncover intriguing
insights about the significant role of soft alignment in the case of
Transformers. Our results showcase the potential of fusing multiple
Transformers, thus compounding their expertise, in the budding paradigm of
model fusion and recombination. Code is available at
https://github.com/graldij/transformer-fusion.Comment: Appears at International Conference on Learning Representations
(ICLR), 2024. M. Imfeld, J. Graldi, and M. Giordano are the first authors and
contributed equally to this wor