State-of-the-art deep neural networks are trained with large amounts
(millions or even billions) of data. The expensive computation and memory costs
make it difficult to train them on limited hardware resources, especially for
recent popular large language models (LLM) and computer vision models (CV).
Recent popular dataset distillation methods are thus developed, aiming to
reduce the number of training samples via synthesizing small-scale datasets via
gradient matching. However, as the gradient calculation is coupled with the
specific network architecture, the synthesized dataset is biased and performs
poorly when used for training unseen architectures. To address these
limitations, we present dataset quantization (DQ), a new framework to compress
large-scale datasets into small subsets which can be used for training any
neural network architectures. Extensive experiments demonstrate that DQ is able
to generate condensed small datasets for training unseen network architectures
with state-of-the-art compression ratios for lossless model training. To the
best of our knowledge, DQ is the first method that can successfully distill
large-scale datasets such as ImageNet-1k with a state-of-the-art compression
ratio. Notably, with 60% data from ImageNet and 20% data from Alpaca's
instruction tuning data, the models can be trained with negligible or no
performance drop for both vision tasks (including classification, semantic
segmentation, and object detection) as well as language tasks (including
instruction tuning tasks such as BBH and DROP).Comment: 9 page