Test suites assess natural language processing models' performance on
specific functionalities: cases of interest involving model robustness,
fairness, or particular linguistic capabilities. They enable fine-grained
evaluations of model aspects that would otherwise go unnoticed in standard
evaluation datasets, but they do not address the problem of how to fix the
failure cases. Previous work has explored functionality learning by fine-tuning
models on suite data. While this improves performance on seen functionalities,
it often does not generalize to unseen ones and can harm general performance.
This paper analyses a fine-tuning-free approach to functionality learning.
For each functionality in a suite, we generate a specification instruction that
encodes it. We combine the obtained specification instructions to create
specification-augmented prompts, which we feed to language models pre-trained
on natural instruction data to generate suite predictions. A core aspect of our
analysis is to measure the effect that including a set of specifications has on
a held-out set of unseen, qualitatively different specifications. Our
experiments across four tasks and models ranging from 80M to 175B parameters
show that smaller models struggle to follow specification instructions.
However, larger models (> 3B params.) can benefit from specifications and even
generalize desirable behaviors across functionalities.Comment: 33 pages, 8 figure