The rapid advancement of large language models (LLMs) has sparked interest in
data synthesis techniques, aiming to generate diverse and high-quality
synthetic datasets. However, these synthetic datasets often suffer from a lack
of diversity and added noise. In this paper, we present TarGEN, a multi-step
prompting strategy for generating high-quality synthetic datasets utilizing a
LLM. An advantage of TarGEN is its seedless nature; it does not require
specific task instances, broadening its applicability beyond task replication.
We augment TarGEN with a method known as self-correction empowering LLMs to
rectify inaccurately labeled instances during dataset creation, ensuring
reliable labels. To assess our technique's effectiveness, we emulate 8 tasks
from the SuperGLUE benchmark and finetune various language models, including
encoder-only, encoder-decoder, and decoder-only models on both synthetic and
original training sets. Evaluation on the original test set reveals that models
trained on datasets generated by TarGEN perform approximately 1-2% points
better than those trained on original datasets (82.84% via syn. vs. 81.12% on
og. using Flan-T5). When incorporating instruction tuning, the performance
increases to 84.54% on synthetic data vs. 81.49% on original data by Flan-T5. A
comprehensive analysis of the synthetic dataset compared to the original
dataset reveals that the synthetic dataset demonstrates similar or higher
levels of dataset complexity and diversity. Furthermore, the synthetic dataset
displays a bias level that aligns closely with the original dataset. Finally,
when pre-finetuned on our synthetic SuperGLUE dataset, T5-3B yields impressive
results on the OpenLLM leaderboard, surpassing the model trained on the
Self-Instruct dataset by 4.14% points. We hope that TarGEN can be helpful for
quality data generation and reducing the human efforts to create complex
benchmarks.Comment: 10 pages, 6 tables, 5 figures, 5 pages references, 17 pages appendi