95 research outputs found
Soft-Label Dataset Distillation and Text Dataset Distillation
Dataset distillation is a method for reducing dataset sizes by learning a
small number of synthetic samples containing all the information of a large
dataset. This has several benefits like speeding up model training, reducing
energy consumption, and reducing required storage space. Currently, each
synthetic sample is assigned a single `hard' label, and also, dataset
distillation can currently only be used with image data.
We propose to simultaneously distill both images and their labels, thus
assigning each synthetic sample a `soft' label (a distribution of labels). Our
algorithm increases accuracy by 2-4% over the original algorithm for several
image classification tasks. Using `soft' labels also enables distilled datasets
to consist of fewer samples than there are classes as each sample can encode
information for multiple classes. For example, training a LeNet model with 10
distilled images (one per class) results in over 96% accuracy on MNIST, and
almost 92% accuracy when trained on just 5 distilled images.
We also extend the dataset distillation algorithm to distill sequential
datasets including texts. We demonstrate that text distillation outperforms
other methods across multiple datasets. For example, models attain almost their
original accuracy on the IMDB sentiment analysis task using just 20 distilled
sentences.
Our code can be found at
On the Equivalence of Common Approaches to Cross Sectional Weights in Household Panel Surveys
The computation of cross sectional weights in household panels is challenging because household compositions change over time. Sampling probabilities of new household entrants are generally not known and assigning them zero weight is not satisfying. Two common approaches to cross sectional weighting address this issue: (1) "shared weights" and (2) modeling or estimating unobserved sampling probabilities based on person-level characteristics. We survey how several well-known national household panels address cross sectional weights for different groups of respondents (including immigrants and births) and in different situations (including household mergers and splits). We show that for certain estimated sampling probabilities the modeling approach gives the same weights as fair shares, the most common of the shared weights approaches. Rather than abandoning the shared weights approach when orphan respondents (respondents in households without sampling weights) exist, we propose a hybrid approach; estimating sampling weights of newly orphan respondents only.BHPS, HILDA, PSID, SOEP, modeled weights, shared weights, fair shares
Nearest Labelset Using Double Distances for Multi-label Classification
Multi-label classification is a type of supervised learning where an instance
may belong to multiple labels simultaneously. Predicting each label
independently has been criticized for not exploiting any correlation between
labels. In this paper we propose a novel approach, Nearest Labelset using
Double Distances (NLDD), that predicts the labelset observed in the training
data that minimizes a weighted sum of the distances in both the feature space
and the label space to the new instance. The weights specify the relative
tradeoff between the two distances. The weights are estimated from a binomial
regression of the number of misclassified labels as a function of the two
distances. Model parameters are estimated by maximum likelihood. NLDD only
considers labelsets observed in the training data, thus implicitly taking into
account label dependencies. Experiments on benchmark multi-label data sets show
that the proposed method on average outperforms other well-known approaches in
terms of Hamming loss, 0/1 loss, and multi-label accuracy and ranks second
after ECC on the F-measure
What do web survey panel respondents answer when asked “Do you have any other comment?”
Near the end of a web survey respondents are often asked whether they have additional comments. Such final comments are usually ignored, partially because open-ended questions are more challenging to analyze. A random sample of final comments in the LISS panel and Dutch immigrant panel were categorized into one of nine categories (neutral, positive, multiple subcategories of negative). While few respondents chose to make a final comment, this is more common in the Immigrant panel (5.7%) than in the LISS panel (3.6%). In both panels there are slightly more neutral than negative comments, and very few positive comments. The number of final comments about unclear questions was 2.7 times larger in the immigrant panel than in the LISS panel. The number of final comments complaining about survey length on the other hand was 2.7 times larger in the LISS panel than in the immigrant panel. Researchers might want to consider additional pretesting of questions when fielding a questionnaire in the immigrant panel
Household Survey Panels: How Much Do Following Rules Affect Sample Size?
In household panels, typically all household members are surveyed. Because household composition changes over time, so-called following rules are implemented to decide whether to continue surveying household members who leave the household (e.g. former spouses/partners, grown children) in subsequent waves. Following rules have been largely ignored in the literature leaving panel designers unaware of the breadth of their options and forcing them to makead hoc decisions. In particular, to what extent various following rules affect sample size over time is unknown. From an operational point of view such knowledge is important because sample size greatly affects costs. Moreover, the decisionof whom to follow has irreversible consequences as finding household members who moved out years earlier is very difficult. We find that household survey panels implement a wide variety of following rules but their effect on sample size is relatively limited. Even after 25 years, the rule "follow only wave 1 respondents" still captures 85% of the respondents of the rule "follow everyone who can be traced back to a wave 1 household through living arrangements". Almost all of the remaining 15% live in households of children of wave 1 respondents who have grown up (5%) and in households of former spouses/partners (10%). Unless attrition is low, there is no danger of an ever expanding panel because even wide following rules do not typically exceed attrition.Survey panels, Survey methodology
Automated classification for open-ended questions with BERT
Manual coding of text data from open-ended questions into different
categories is time consuming and expensive. Automated coding uses
statistical/machine learning to train on a small subset of manually coded text
answers. Recently, pre-training a general language model on vast amounts of
unrelated data and then adapting the model to the specific application has
proven effective in natural language processing. Using two data sets, we
empirically investigate whether BERT, the currently dominant pre-trained
language model, is more effective at automated coding of answers to open-ended
questions than other non-pre-trained statistical learning approaches. We found
fine-tuning the pre-trained BERT parameters is essential as otherwise BERT's is
not competitive. Second, we found fine-tuned BERT barely beats the
non-pre-trained statistical learning approaches in terms of classification
accuracy when trained on 100 manually coded observations. However, BERT's
relative advantage increases rapidly when more manually coded observations
(e.g. 200-400) are available for training. We conclude that for automatically
coding answers to open-ended questions BERT is preferable to non-pretrained
models such as support vector machines and boosting
Coding Text Answers to Open-ended Questions: Human Coders and Statistical Learning Algorithms Make Similar Mistakes
Text answers to open-ended questions are often manually coded into one of several predefined categories or classes. More recently, researchers have begun to employ statistical models to automatically classify such text responses. It is unclear whether such automated coders and human coders find the same type of observations difficult to code or whether humans and models might be able to compensate for each other’s weaknesses. We analyze correlations between estimated error probabilities of human and automated coders and find: 1) Statistical models have higher error rates than human coders 2) Automated coders (models) and human coders tend to make similar coding mistakes. Specifically, the correlation between the estimated coding error of a statistical model and that of a human is comparable to that of two humans. 3) Two very different statistical models give highly correlated estimated coding errors. Therefore, a) the choice of statistical model does not matter, and b) having a second automated coder would be redundant
ConvART: Improving Adaptive Resonance Theory for Unsupervised Image Clustering
While supervised learning techniques have become increasinglyadept at separating images into different classes, these techniquesrequire large amounts of labelled data which may not always beavailable. We propose a novel neuro-dynamic method for unsuper-vised image clustering by combining 2 biologically-motivated mod-els: Adaptive Resonance Theory (ART) and Convolutional Neu-ral Networks (CNN). ART networks are unsupervised clustering al-gorithms that have high stability in preserving learned informationwhile quickly learning new information. Meanwhile, a major prop-erty of CNNs is their translation and distortion invariance, whichhas led to their success in the domain of vision problems. Byembedding convolutional layers into an ART network, the usefulproperties of both networks can be leveraged to identify differentclusters within unlabelled image datasets and classify images intothese clusters. In exploratory experiments, we demonstrate thatthis method greatly increases the performance of unsupervisedART networks on a benchmark image dataset
- …