Dataset distillation methods offer the promise of reducing a large-scale
dataset down to a significantly smaller set of (potentially synthetic) training
examples, which preserve sufficient information for training a new model from
scratch. So far dataset distillation methods have been developed for image
classification. However, with the rise in capabilities of vision-language
models, and especially given the scale of datasets necessary to train these
models, the time is ripe to expand dataset distillation methods beyond image
classification. In this work, we take the first steps towards this goal by
expanding on the idea of trajectory matching to create a distillation method
for vision-language datasets. The key challenge is that vision-language
datasets do not have a set of discrete classes. To overcome this, our proposed
multimodal dataset distillation method jointly distill the images and their
corresponding language descriptions in a contrastive formulation. Since there
are no existing baselines, we compare our approach to three coreset selection
methods (strategic subsampling of the training dataset), which we adapt to the
vision-language setting. We demonstrate significant improvements on the
challenging Flickr30K and COCO retrieval benchmark: the best coreset selection
method which selects 1000 image-text pairs for training is able to achieve only
5.6% image-to-text retrieval accuracy (recall@1); in contrast, our dataset
distillation approach almost doubles that with just 100 (an order of magnitude
fewer) training pairs.Comment: 28 pages, 11 figure