3 research outputs found
Understanding In-Context Learning via Supportive Pretraining Data
In-context learning (ICL) improves language models' performance on a variety
of NLP tasks by simply demonstrating a handful of examples at inference time.
It is not well understood why ICL ability emerges, as the model has never been
specifically trained on such demonstrations. Unlike prior work that explores
implicit mechanisms behind ICL, we study ICL via investigating the pretraining
data. Specifically, we first adapt an iterative, gradient-based approach to
find a small subset of pretraining data that supports ICL. We observe that a
continued pretraining on this small subset significantly improves the model's
ICL ability, by up to 18%. We then compare the supportive subset constrastively
with random subsets of pretraining data and discover: (1) The supportive
pretraining data to ICL do not have a higher domain relevance to downstream
tasks. (2) The supportive pretraining data have a higher mass of rarely
occurring, long-tail tokens. (3) The supportive pretraining data are
challenging examples where the information gain from long-range context is
below average, indicating learning to incorporate difficult long-range context
encourages ICL. Our work takes a first step towards understanding ICL via
analyzing instance-level pretraining data. Our insights have a potential to
enhance the ICL ability of language models by actively guiding the construction
of pretraining data in the future.Comment: ACL 202
Text Characterization Toolkit
In NLP, models are usually evaluated by reporting single-number performance
scores on a number of readily available benchmarks, without much deeper
analysis. Here, we argue that - especially given the well-known fact that
benchmarks often contain biases, artefacts, and spurious correlations - deeper
results analysis should become the de-facto standard when presenting new models
or benchmarks. We present a tool that researchers can use to study properties
of the dataset and the influence of those properties on their models'
behaviour. Our Text Characterization Toolkit includes both an easy-to-use
annotation tool, as well as off-the-shelf scripts that can be used for specific
analyses. We also present use-cases from three different domains: we use the
tool to predict what are difficult examples for given well-known trained models
and identify (potentially harmful) biases and heuristics that are present in a
dataset
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
Recent work has shown that fine-tuning large pre-trained language models on a
collection of tasks described via instructions, a.k.a. instruction-tuning,
improves their zero and few-shot generalization to unseen tasks. However, there
is a limited understanding of the performance trade-offs of different decisions
made during the instruction-tuning process. These decisions include the scale
and diversity of the instruction-tuning benchmark, different task sampling
strategies, fine-tuning with and without demonstrations, training using
specialized datasets for reasoning and dialogue, and finally, the fine-tuning
objectives themselves. In this paper, we characterize the effect of
instruction-tuning decisions on downstream task performance when scaling both
model and benchmark sizes. To this end, we create OPT-IML Bench: a large
benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated
into task categories from 8 existing benchmarks, and prepare an evaluation
framework to measure three types of model generalizations: to tasks from fully
held-out categories, to held-out tasks from seen categories, and to held-out
instances from seen tasks. Through the lens of this framework, we first present
insights about instruction-tuning decisions as applied to OPT-30B and further
exploit these insights to train OPT-IML 30B and 175B, which are
instruction-tuned versions of OPT. OPT-IML demonstrates all three
generalization abilities at both scales on four different evaluation benchmarks
with diverse tasks and input formats -- PromptSource, FLAN,
Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly
outperform OPT on all benchmarks but is also highly competitive with existing
models fine-tuned on each specific benchmark. We release OPT-IML at both
scales, together with the OPT-IML Bench evaluation framework.Comment: 55 page