22,161 research outputs found
Soft-Label Dataset Distillation and Text Dataset Distillation
Dataset distillation is a method for reducing dataset sizes by learning a
small number of synthetic samples containing all the information of a large
dataset. This has several benefits like speeding up model training, reducing
energy consumption, and reducing required storage space. Currently, each
synthetic sample is assigned a single `hard' label, and also, dataset
distillation can currently only be used with image data.
We propose to simultaneously distill both images and their labels, thus
assigning each synthetic sample a `soft' label (a distribution of labels). Our
algorithm increases accuracy by 2-4% over the original algorithm for several
image classification tasks. Using `soft' labels also enables distilled datasets
to consist of fewer samples than there are classes as each sample can encode
information for multiple classes. For example, training a LeNet model with 10
distilled images (one per class) results in over 96% accuracy on MNIST, and
almost 92% accuracy when trained on just 5 distilled images.
We also extend the dataset distillation algorithm to distill sequential
datasets including texts. We demonstrate that text distillation outperforms
other methods across multiple datasets. For example, models attain almost their
original accuracy on the IMDB sentiment analysis task using just 20 distilled
sentences.
Our code can be found at
Dataset Distillation for Medical Dataset Sharing
Sharing medical datasets between hospitals is challenging because of the
privacy-protection problem and the massive cost of transmitting and storing
many high-resolution medical images. However, dataset distillation can
synthesize a small dataset such that models trained on it achieve comparable
performance with the original large dataset, which shows potential for solving
the existing medical sharing problems. Hence, this paper proposes a novel
dataset distillation-based method for medical dataset sharing. Experimental
results on a COVID-19 chest X-ray image dataset show that our method can
achieve high detection performance even using scarce anonymized chest X-ray
images
Dataset Distillation using Parameter Pruning
The acquisition of advanced models relies on large datasets in many fields,
which makes storing datasets and training models expensive. As a solution,
dataset distillation can synthesize a small dataset that preserves most
information of the original large dataset. The recently proposed dataset
distillation method by matching network parameters has been proven effective
for several datasets. However, the dimension of network parameters is usually
large. And we found that a few parameters in the distillation process are
difficult to match, which harms the distillation performance. Based on this
observation, this paper proposes a new method to solve the problem using
parameter pruning. The proposed method can synthesize more robust distilled
datasets and improve the distillation performance by pruning difficult-to-match
parameters in the distillation process. Experimental results on three datasets
show that the proposed method outperformed other state-of-the-art dataset
distillation methods
Towards Trustworthy Dataset Distillation
Efficiency and trustworthiness are two eternal pursuits when applying deep
learning in real-world applications. With regard to efficiency, dataset
distillation (DD) endeavors to reduce training costs by distilling the large
dataset into a tiny synthetic dataset. However, existing methods merely
concentrate on in-distribution (InD) classification in a closed-world setting,
disregarding out-of-distribution (OOD) samples. On the other hand, OOD
detection aims to enhance models' trustworthiness, which is always
inefficiently achieved in full-data settings. For the first time, we
simultaneously consider both issues and propose a novel paradigm called
Trustworthy Dataset Distillation (TrustDD). By distilling both InD samples and
outliers, the condensed datasets are capable to train models competent in both
InD classification and OOD detection. To alleviate the requirement of real
outlier data and make OOD detection more practical, we further propose to
corrupt InD samples to generate pseudo-outliers and introduce Pseudo-Outlier
Exposure (POE). Comprehensive experiments on various settings demonstrate the
effectiveness of TrustDD, and the proposed POE surpasses state-of-the-art
method Outlier Exposure (OE). Compared with the preceding DD, TrustDD is more
trustworthy and applicable to real open-world scenarios. Our code will be
publicly available.Comment: 20 pages, 20 figure
Embarassingly Simple Dataset Distillation
Dataset distillation extracts a small set of synthetic training samples from
a large dataset with the goal of achieving competitive performance on test data
when trained on this sample. In this work, we tackle dataset distillation at
its core by treating it directly as a bilevel optimization problem.
Re-examining the foundational back-propagation through time method, we study
the pronounced variance in the gradients, computational burden, and long-term
dependencies. We introduce an improved method: Random Truncated Backpropagation
Through Time (RaT-BPTT) to address them. RaT-BPTT incorporates a truncation
coupled with a random window, effectively stabilizing the gradients and
speeding up the optimization while covering long dependencies. This allows us
to establish new state-of-the-art for a variety of standard dataset benchmarks.
A deeper dive into the nature of distilled data unveils pronounced
intercorrelation. In particular, subsets of distilled datasets tend to exhibit
much worse performance than directly distilled smaller datasets of the same
size. Leveraging RaT-BPTT, we devise a boosting mechanism that generates
distilled datasets that contain subsets with near optimal performance across
different data budgets.Comment: Short version appears at NeurIPS 2023 WANT worksho
Multimodal Dataset Distillation for Image-Text Retrieval
Dataset distillation methods offer the promise of reducing a large-scale
dataset down to a significantly smaller set of (potentially synthetic) training
examples, which preserve sufficient information for training a new model from
scratch. So far dataset distillation methods have been developed for image
classification. However, with the rise in capabilities of vision-language
models, and especially given the scale of datasets necessary to train these
models, the time is ripe to expand dataset distillation methods beyond image
classification. In this work, we take the first steps towards this goal by
expanding on the idea of trajectory matching to create a distillation method
for vision-language datasets. The key challenge is that vision-language
datasets do not have a set of discrete classes. To overcome this, our proposed
multimodal dataset distillation method jointly distill the images and their
corresponding language descriptions in a contrastive formulation. Since there
are no existing baselines, we compare our approach to three coreset selection
methods (strategic subsampling of the training dataset), which we adapt to the
vision-language setting. We demonstrate significant improvements on the
challenging Flickr30K and COCO retrieval benchmark: the best coreset selection
method which selects 1000 image-text pairs for training is able to achieve only
5.6% image-to-text retrieval accuracy (recall@1); in contrast, our dataset
distillation approach almost doubles that with just 100 (an order of magnitude
fewer) training pairs.Comment: 28 pages, 11 figure
- …