Modular and Parameter-efficient Fine-tuning of Language Models

Abstract

Transfer learning has recently become the dominant paradigm of natural language processing. Models pre-trained on unlabeled data can be fine-tuned for downstream tasks based on only a handful of examples. A long-term goal is to develop models that acquire new information at scale without incurring negative transfer and that generalize systematically to new settings. Modular deep learning has emerged as a promising solution to these challenges, by updating parameter-efficient units of computation locally and asynchronously. These units are often implemented as modules that are interlaid between layers, interpolated with pre-trained parameters, or concatenated to the inputs. Conditioned on tasks or examples, information is routed to multiple modules through a fixed or learned function, followed by an aggregation of their outputs. This property enables compositional generalization, by disentangling knowledge and recombining it in new ways. In this thesis, we provide a unified view of modularity in natural language processing, spanning across four dimensions; specifically, we disentangle modularity into computation functions, routing functions, aggregation functions, and the training setting. Along those axes, we propose multiple contributions: a research framework which encompasses all dimensions; a novel attention-based aggregation function which combines the knowledge stored within different modules; routing mechanisms for out of distribution generalization in cross-lingual transfer scenarios; a dataset and modular training strategies for multimodal and multilingual transfer learning; a modular pre-training strategy to tackle catastrophic interference of heterogeneous data

    Similar works