28,681 research outputs found
Mixed-Integer Quadratic Optimization and Iterative Clustering Techniques for Semi-Supervised Support Vector Machines
Among the most famous algorithms for solving classification problems are
support vector machines (SVMs), which find a separating hyperplane for a set of
labeled data points. In some applications, however, labels are only available
for a subset of points. Furthermore, this subset can be non-representative,
e.g., due to self-selection in a survey. Semi-supervised SVMs tackle the
setting of labeled and unlabeled data and can often improve the reliability of
the results. Moreover, additional information about the size of the classes can
be available from undisclosed sources. We propose a mixed-integer quadratic
optimization (MIQP) model that covers the setting of labeled and unlabeled data
points as well as the overall number of points in each class. Since the MIQP's
solution time rapidly grows as the number of variables increases, we introduce
an iterative clustering approach to reduce the model's size. Moreover, we
present an update rule for the required big- values, prove the correctness
of the iterative clustering method as well as derive tailored
dimension-reduction and warm-starting techniques. Our numerical results show
that our approach leads to a similar accuracy and precision than the MIQP
formulation but at much lower computational cost. Thus, we can solve solve
larger problems. With respect to the original SVM formulation, we observe that
our approach has even better accuracy and precision for biased samples.Comment: 33 pages,18 figure
Applicability of semi-supervised learning assumptions for gene ontology terms prediction
Gene Ontology (GO) is one of the most important resources in bioinformatics, aiming to provide a unified framework for the biological annotation of genes and proteins across all species. Predicting GO terms is an essential task for bioinformatics, but the number of available labelled proteins is in several cases insufficient for training reliable machine learning classifiers. Semi-supervised learning methods arise as a powerful solution that explodes the information contained in unlabelled data in order to improve the estimations of traditional supervised approaches. However, semi-supervised learning methods have to make strong assumptions about the nature of the training data and thus, the performance of the predictor is highly dependent on these assumptions. This paper presents an analysis of the applicability of semi-supervised learning assumptions over the specific task of GO terms prediction, focused on providing judgment elements that allow choosing the most suitable tools for specific GO terms. The results show that semi-supervised approaches significantly outperform the traditional supervised methods and that the highest performances are reached when applying the cluster assumption. Besides, it is experimentally demonstrated that cluster and manifold assumptions are complimentary to each other and an analysis of which GO terms can be more prone to be correctly predicted with each assumption, is provided.Postprint (published version
Machine Learning Methods for Attack Detection in the Smart Grid
Attack detection problems in the smart grid are posed as statistical learning
problems for different attack scenarios in which the measurements are observed
in batch or online settings. In this approach, machine learning algorithms are
used to classify measurements as being either secure or attacked. An attack
detection framework is provided to exploit any available prior knowledge about
the system and surmount constraints arising from the sparse structure of the
problem in the proposed approach. Well-known batch and online learning
algorithms (supervised and semi-supervised) are employed with decision and
feature level fusion to model the attack detection problem. The relationships
between statistical and geometric properties of attack vectors employed in the
attack scenarios and learning algorithms are analyzed to detect unobservable
attacks using statistical learning methods. The proposed algorithms are
examined on various IEEE test systems. Experimental analyses show that machine
learning algorithms can detect attacks with performances higher than the attack
detection algorithms which employ state vector estimation methods in the
proposed attack detection framework.Comment: 14 pages, 11 Figure
Deep Generative Models for Reject Inference in Credit Scoring
Credit scoring models based on accepted applications may be biased and their
consequences can have a statistical and economic impact. Reject inference is
the process of attempting to infer the creditworthiness status of the rejected
applications. In this research, we use deep generative models to develop two
new semi-supervised Bayesian models for reject inference in credit scoring, in
which we model the data generating process to be dependent on a Gaussian
mixture. The goal is to improve the classification accuracy in credit scoring
models by adding reject applications. Our proposed models infer the unknown
creditworthiness of the rejected applications by exact enumeration of the two
possible outcomes of the loan (default or non-default). The efficient
stochastic gradient optimization technique used in deep generative models makes
our models suitable for large data sets. Finally, the experiments in this
research show that our proposed models perform better than classical and
alternative machine learning models for reject inference in credit scoring
Extension of TSVM to Multi-Class and Hierarchical Text Classification Problems With General Losses
Transductive SVM (TSVM) is a well known semi-supervised large margin learning
method for binary text classification. In this paper we extend this method to
multi-class and hierarchical classification problems. We point out that the
determination of labels of unlabeled examples with fixed classifier weights is
a linear programming problem. We devise an efficient technique for solving it.
The method is applicable to general loss functions. We demonstrate the value of
the new method using large margin loss on a number of multi-class and
hierarchical classification datasets. For maxent loss we show empirically that
our method is better than expectation regularization/constraint and posterior
regularization methods, and competitive with the version of entropy
regularization method which uses label constraints
- …