2,270 research outputs found
On Consistency of Graph-based Semi-supervised Learning
Graph-based semi-supervised learning is one of the most popular methods in
machine learning. Some of its theoretical properties such as bounds for the
generalization error and the convergence of the graph Laplacian regularizer
have been studied in computer science and statistics literatures. However, a
fundamental statistical property, the consistency of the estimator from this
method has not been proved. In this article, we study the consistency problem
under a non-parametric framework. We prove the consistency of graph-based
learning in the case that the estimated scores are enforced to be equal to the
observed responses for the labeled data. The sample sizes of both labeled and
unlabeled data are allowed to grow in this result. When the estimated scores
are not required to be equal to the observed responses, a tuning parameter is
used to balance the loss function and the graph Laplacian regularizer. We give
a counterexample demonstrating that the estimator for this case can be
inconsistent. The theoretical findings are supported by numerical studies.Comment: This paper is accepted by 2019 IEEE 39th International Conference on
Distributed Computing Systems (ICDCS
ResumeNet: A Learning-based Framework for Automatic Resume Quality Assessment
Recruitment of appropriate people for certain positions is critical for any
companies or organizations. Manually screening to select appropriate candidates
from large amounts of resumes can be exhausted and time-consuming. However,
there is no public tool that can be directly used for automatic resume quality
assessment (RQA). This motivates us to develop a method for automatic RQA.
Since there is also no public dataset for model training and evaluation, we
build a dataset for RQA by collecting around 10K resumes, which are provided by
a private resume management company. By investigating the dataset, we identify
some factors or features that could be useful to discriminate good resumes from
bad ones, e.g., the consistency between different parts of a resume. Then a
neural-network model is designed to predict the quality of each resume, where
some text processing techniques are incorporated. To deal with the label
deficiency issue in the dataset, we propose several variants of the model by
either utilizing the pair/triplet-based loss, or introducing some
semi-supervised learning technique to make use of the abundant unlabeled data.
Both the presented baseline model and its variants are general and easy to
implement. Various popular criteria including the receiver operating
characteristic (ROC) curve, F-measure and ranking-based average precision (AP)
are adopted for model evaluation. We compare the different variants with our
baseline model. Since there is no public algorithm for RQA, we further compare
our results with those obtained from a website that can score a resume.
Experimental results in terms of different criteria demonstrate the
effectiveness of the proposed method. We foresee that our approach would
transform the way of future human resources management.Comment: ICD
A bagging SVM to learn from positive and unlabeled examples
We consider the problem of learning a binary classifier from a training set
of positive and unlabeled examples, both in the inductive and in the
transductive setting. This problem, often referred to as \emph{PU learning},
differs from the standard supervised classification problem by the lack of
negative examples in the training set. It corresponds to an ubiquitous
situation in many applications such as information retrieval or gene ranking,
when we have identified a set of data of interest sharing a particular
property, and we wish to automatically retrieve additional data sharing the
same property among a large and easily available pool of unlabeled data. We
propose a conceptually simple method, akin to bagging, to approach both
inductive and transductive PU learning problems, by converting them into series
of supervised binary classification problems discriminating the known positive
examples from random subsamples of the unlabeled set. We empirically
demonstrate the relevance of the method on simulated and real data, where it
performs at least as well as existing methods while being faster
Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains
There has been increased interest in devising learning techniques that
combine unlabeled data with labeled data ? i.e. semi-supervised learning.
However, to the best of our knowledge, no study has been performed across
various techniques and different types and amounts of labeled and unlabeled
data. Moreover, most of the published work on semi-supervised learning
techniques assumes that the labeled and unlabeled data come from the same
distribution. It is possible for the labeling process to be associated with a
selection bias such that the distributions of data points in the labeled and
unlabeled sets are different. Not correcting for such bias can result in biased
function approximation with potentially poor performance. In this paper, we
present an empirical study of various semi-supervised learning techniques on a
variety of datasets. We attempt to answer various questions such as the effect
of independence or relevance amongst features, the effect of the size of the
labeled and unlabeled sets and the effect of noise. We also investigate the
impact of sample-selection bias on the semi-supervised learning techniques
under study and implement a bivariate probit technique particularly designed to
correct for such bias
- …