2 research outputs found
Unsupervised Pool-Based Active Learning for Linear Regression
In many real-world machine learning applications, unlabeled data can be
easily obtained, but it is very time-consuming and/or expensive to label them.
So, it is desirable to be able to select the optimal samples to label, so that
a good machine learning model can be trained from a minimum amount of labeled
data. Active learning (AL) has been widely used for this purpose. However, most
existing AL approaches are supervised: they train an initial model from a small
amount of labeled samples, query new samples based on the model, and then
update the model iteratively. Few of them have considered the completely
unsupervised AL problem, i.e., starting from zero, how to optimally select the
very first few samples to label, without knowing any label information at all.
This problem is very challenging, as no label information can be utilized. This
paper studies unsupervised pool-based AL for linear regression problems. We
propose a novel AL approach that considers simultaneously the informativeness,
representativeness, and diversity, three essential criteria in AL. Extensive
experiments on 14 datasets from various application domains, using three
different linear regression models (ridge regression, LASSO, and linear support
vector regression), demonstrated the effectiveness of our proposed approach
Theory of Machine Learning Debugging via M-estimation
We investigate problems in penalized -estimation, inspired by applications
in machine learning debugging. Data are collected from two pools, one
containing data with possibly contaminated labels, and the other which is known
to contain only cleanly labeled points. We first formulate a general
statistical algorithm for identifying buggy points and provide rigorous
theoretical guarantees under the assumption that the data follow a linear
model. We then present two case studies to illustrate the results of our
general theory and the dependence of our estimator on clean versus buggy
points. We further propose an algorithm for tuning parameter selection of our
Lasso-based algorithm and provide corresponding theoretical guarantees.
Finally, we consider a two-person "game" played between a bug generator and a
debugger, where the debugger can augment the contaminated data set with cleanly
labeled versions of points in the original data pool. We establish a
theoretical result showing a sufficient condition under which the bug generator
can always fool the debugger. Nonetheless, we provide empirical results showing
that such a situation may not occur in practice, making it possible for natural
augmentation strategies combined with our Lasso debugging algorithm to succeed