1,948,132 research outputs found
Soft-Label Dataset Distillation and Text Dataset Distillation
Dataset distillation is a method for reducing dataset sizes by learning a
small number of synthetic samples containing all the information of a large
dataset. This has several benefits like speeding up model training, reducing
energy consumption, and reducing required storage space. Currently, each
synthetic sample is assigned a single `hard' label, and also, dataset
distillation can currently only be used with image data.
We propose to simultaneously distill both images and their labels, thus
assigning each synthetic sample a `soft' label (a distribution of labels). Our
algorithm increases accuracy by 2-4% over the original algorithm for several
image classification tasks. Using `soft' labels also enables distilled datasets
to consist of fewer samples than there are classes as each sample can encode
information for multiple classes. For example, training a LeNet model with 10
distilled images (one per class) results in over 96% accuracy on MNIST, and
almost 92% accuracy when trained on just 5 distilled images.
We also extend the dataset distillation algorithm to distill sequential
datasets including texts. We demonstrate that text distillation outperforms
other methods across multiple datasets. For example, models attain almost their
original accuracy on the IMDB sentiment analysis task using just 20 distilled
sentences.
Our code can be found at
Neural Dataset Generality
Often the filters learned by Convolutional Neural Networks (CNNs) from
different datasets appear similar. This is prominent in the first few layers.
This similarity of filters is being exploited for the purposes of transfer
learning and some studies have been made to analyse such transferability of
features. This is also being used as an initialization technique for different
tasks in the same dataset or for the same task in similar datasets.
Off-the-shelf CNN features have capitalized on this idea to promote their
networks as best transferable and most general and are used in a cavalier
manner in day-to-day computer vision tasks.
It is curious that while the filters learned by these CNNs are related to the
atomic structures of the images from which they are learnt, all datasets learn
similar looking low-level filters. With the understanding that a dataset that
contains many such atomic structures learn general filters and are therefore
useful to initialize other networks with, we propose a way to analyse and
quantify generality among datasets from their accuracies on transferred
filters. We applied this metric on several popular character recognition,
natural image and a medical image dataset, and arrived at some interesting
conclusions. On further experimentation we also discovered that particular
classes in a dataset themselves are more general than others.Comment: Long version of the paper accepted at IEEE International Conference
on Image Processing 201
Dbnary : Wiktionary as Linked Data for 12 Language Editions with Enhanced Translation Relations
International audienceThis paper presents the current state of development of the DBnary dataset. DBnary is a RDF dataset, structured using the LEMON vocabulary, that is extracted from twelve different Wiktionary language editions. DBnary also contains additional relations from translation pairs to their source word senses. The extracted data is registered at http://thedatahub.org/dataset/dbnary
- …