Instruction tuning has emerged to enhance the capabilities of large language
models (LLMs) to comprehend instructions and generate appropriate responses.
Existing methods either manually annotate or employ LLM (e.g., GPT-series) to
generate data for instruction tuning. However, they often overlook associating
instructions with existing annotated datasets. In this paper, we propose
Dynosaur, a dynamic growth paradigm for the automatic curation of
instruction-tuning data. Based on the metadata of existing datasets, we use
LLMs to automatically construct instruction-tuning data by identifying relevant
data fields and generating appropriate instructions.
By leveraging the existing annotated datasets, Dynosaur offers several
advantages: 1) it reduces the API cost for generating instructions (e.g., it
costs less than $12 USD by calling GPT-3.5-turbo for generating 800K
instruction tuning samples; 2) it provides high-quality data for instruction
tuning (e.g., it performs better than Alpaca and Flan on Super-NI and Longform
with comparable data sizes); and 3) it supports the continuous improvement of
models by generating instruction-tuning data when a new annotated dataset
becomes available. We further investigate a continual learning scheme for
learning with the ever-growing instruction-tuning dataset, and demonstrate that
replaying tasks with diverse instruction embeddings not only helps mitigate
forgetting issues but generalizes to unseen tasks better.
Code and data are available at https://github.com/WadeYin9712/Dynosaur.Comment: EMNLP 2023. Code and data are available at
https://github.com/WadeYin9712/Dynosau