157,766 research outputs found
Model Uncertainty based Active Learning on Tabular Data using Boosted Trees
Supervised machine learning relies on the availability of good labelled data
for model training. Labelled data is acquired by human annotation, which is a
cumbersome and costly process, often requiring subject matter experts. Active
learning is a sub-field of machine learning which helps in obtaining the
labelled data efficiently by selecting the most valuable data instances for
model training and querying the labels only for those instances from the human
annotator. Recently, a lot of research has been done in the field of active
learning, especially for deep neural network based models. Although deep
learning shines when dealing with image\textual\multimodal data, gradient
boosting methods still tend to achieve much better results on tabular data. In
this work, we explore active learning for tabular data using boosted trees.
Uncertainty based sampling in active learning is the most commonly used
querying strategy, wherein the labels of those instances are sequentially
queried for which the current model prediction is maximally uncertain. Entropy
is often the choice for measuring uncertainty. However, entropy is not exactly
a measure of model uncertainty. Although there has been a lot of work in deep
learning for measuring model uncertainty and employing it in active learning,
it is yet to be explored for non-neural network models. To this end, we explore
the effectiveness of boosted trees based model uncertainty methods in active
learning. Leveraging this model uncertainty, we propose an uncertainty based
sampling in active learning for regression tasks on tabular data. Additionally,
we also propose a novel cost-effective active learning method for regression
tasks along with an improved cost-effective active learning method for
classification tasks
Active Domain Adaptation via Clustering Uncertainty-weighted Embeddings
Generalizing deep neural networks to new target domains is critical to their
real-world utility. In practice, it may be feasible to get some target data
labeled, but to be cost-effective it is desirable to select a
maximally-informative subset via active learning (AL). We study the problem of
AL under a domain shift, called Active Domain Adaptation (Active DA). We
empirically demonstrate how existing AL approaches based solely on model
uncertainty or diversity sampling are suboptimal for Active DA. Our algorithm,
Active Domain Adaptation via Clustering Uncertainty-weighted Embeddings
(ADA-CLUE), i) identifies target instances for labeling that are both uncertain
under the model and diverse in feature space, and ii) leverages the available
source and target data for adaptation by optimizing a semi-supervised
adversarial entropy loss that is complementary to our active sampling
objective. On standard image classification-based domain adaptation benchmarks,
ADA-CLUE consistently outperforms competing active adaptation, active learning,
and domain adaptation methods across domain shifts of varying severity
Minimizing Supervision in Multi-label Categorization
Multiple categories of objects are present in most images. Treating this as a
multi-class classification is not justified. We treat this as a multi-label
classification problem. In this paper, we further aim to minimize the
supervision required for providing supervision in multi-label classification.
Specifically, we investigate an effective class of approaches that associate a
weak localization with each category either in terms of the bounding box or
segmentation mask. Doing so improves the accuracy of multi-label
categorization. The approach we adopt is one of active learning, i.e.,
incrementally selecting a set of samples that need supervision based on the
current model, obtaining supervision for these samples, retraining the model
with the additional set of supervised samples and proceeding again to select
the next set of samples. A crucial concern is the choice of the set of samples.
In doing so, we provide a novel insight, and no specific measure succeeds in
obtaining a consistently improved selection criterion. We, therefore, provide a
selection criterion that consistently improves the overall baseline criterion
by choosing the top k set of samples for a varied set of criteria. Using this
criterion, we are able to show that we can retain more than 98% of the fully
supervised performance with just 20% of samples (and more than 96% using 10%)
of the dataset on PASCAL VOC 2007 and 2012. Also, our proposed approach
consistently outperforms all other baseline metrics for all benchmark datasets
and model combinations.Comment: Accepted in CVPR-W 202
- …