13,616 research outputs found
Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains
There has been increased interest in devising learning techniques that
combine unlabeled data with labeled data ? i.e. semi-supervised learning.
However, to the best of our knowledge, no study has been performed across
various techniques and different types and amounts of labeled and unlabeled
data. Moreover, most of the published work on semi-supervised learning
techniques assumes that the labeled and unlabeled data come from the same
distribution. It is possible for the labeling process to be associated with a
selection bias such that the distributions of data points in the labeled and
unlabeled sets are different. Not correcting for such bias can result in biased
function approximation with potentially poor performance. In this paper, we
present an empirical study of various semi-supervised learning techniques on a
variety of datasets. We attempt to answer various questions such as the effect
of independence or relevance amongst features, the effect of the size of the
labeled and unlabeled sets and the effect of noise. We also investigate the
impact of sample-selection bias on the semi-supervised learning techniques
under study and implement a bivariate probit technique particularly designed to
correct for such bias
Identifying leading indicators of product recalls from online reviews using positive unlabeled learning and domain adaptation
Consumer protection agencies are charged with safeguarding the public from
hazardous products, but the thousands of products under their jurisdiction make
it challenging to identify and respond to consumer complaints quickly. From the
consumer's perspective, online reviews can provide evidence of product defects,
but manually sifting through hundreds of reviews is not always feasible. In
this paper, we propose a system to mine Amazon.com reviews to identify products
that may pose safety or health hazards. Since labeled data for this task are
scarce, our approach combines positive unlabeled learning with domain
adaptation to train a classifier from consumer complaints submitted to the U.S.
Consumer Product Safety Commission. On a validation set of manually annotated
Amazon product reviews, we find that our approach results in an absolute F1
score improvement of 8% over the best competing baseline. Furthermore, we apply
the classifier to Amazon reviews of known recalled products; the classifier
identifies reviews reporting safety hazards prior to the recall date for 45% of
the products. This suggests that the system may be able to provide an early
warning system to alert consumers to hazardous products before an official
recall is announced
- …