Fine-tuning all the layers of a pre-trained neural language encoder (either
using all the parameters or using parameter-efficient methods) is often the
de-facto way of adapting it to a new task. We show evidence that for different
downstream language tasks, fine-tuning only a subset of layers is sufficient to
obtain performance that is close to and often better than fine-tuning all the
layers in the language encoder. We propose an efficient metric based on the
diagonal of the Fisher information matrix (FIM score), to select the candidate
layers for selective fine-tuning. We show, empirically on GLUE and SuperGLUE
tasks and across distinct language encoders, that this metric can effectively
select layers leading to a strong downstream performance. Our work highlights
that task-specific information corresponding to a given downstream task is
often localized within a few layers, and tuning only those is sufficient for
strong performance. Additionally, we demonstrate the robustness of the FIM
score to rank layers in a manner that remains constant during the optimization
process.Comment: Accepted to EMNLP 202