335 research outputs found
Active Learning with Semi-Supervised Support Vector Machines
A significant problem in many machine learning tasks is that it is time consuming and costly to gather the necessary labeled data for training the learning algorithm to a reasonable level of performance. In reality, it is often the case that a small amount of labeled data is available and that more unlabeled data could be labeled on demand at a cost. If the labeled data is obtained by a process outside of the control of the learner, then the learner is passive. If the learner picks the data to be labeled, then this becomes active learning. This has the advantage that the learner can pick data to gain specific information that will speed up the learning process. Support Vector Machines
(SVMs) have many properties that make them attractive to use as a learning algorithm for many real world applications including classification tasks. Some researchers have proposed algorithms for active learning with SVMs, i.e. algorithms for choosing the next
unlabeled instance to get label for. Their approach is supervised in nature since they do not consider all unlabeled instances while looking for the next instance. In this thesis, we propose three new algorithms for applying active learning for SVMs in a semi-supervised setting which takes advantage of the presence of all unlabeled points. The suggested approaches might, by reducing the number of experiments needed, yield considerable savings in costly classification problems in the cases when finding the training data for a classifier is expensive
Optimization meets Machine Learning: An Exact Algorithm for Semi-Supervised Support Vector Machines
Support vector machines (SVMs) are well-studied supervised learning models
for binary classification. In many applications, large amounts of samples can
be cheaply and easily obtained. What is often a costly and error-prone process
is to manually label these instances. Semi-supervised support vector machines
(S3VMs) extend the well-known SVM classifiers to the semi-supervised approach,
aiming at maximizing the margin between samples in the presence of unlabeled
data. By leveraging both labeled and unlabeled data, S3VMs attempt to achieve
better accuracy and robustness compared to traditional SVMs. Unfortunately, the
resulting optimization problem is non-convex and hence difficult to solve
exactly. In this paper, we present a new branch-and-cut approach for S3VMs
using semidefinite programming (SDP) relaxations. We apply optimality-based
bound tightening to bound the feasible set. Box constraints allow us to include
valid inequalities, strengthening the lower bound. The resulting SDP relaxation
provides bounds significantly stronger than the ones available in the
literature. For the upper bound, instead, we define a local search exploiting
the solution of the SDP relaxation. Computational results highlight the
efficiency of the algorithm, showing its capability to solve instances with a
number of data points 10 times larger than the ones solved in the literature
Analisis Sentimen Berbasis Fitur pada Ulasan Online dengan Metode Semi-supervised Support Vector Machines (S3VMs)
Situs online review menyediakan fasilitas agar pengguna internet dapat memberikan ulasan mengenai suatu aspek. Sentimen yang terdapat pada kumpulan ulasan mengenai suatu produk bermanfaat dan memiliki pengaruh dalam pengambilan keputusan seseorang atau organisasi. Adapun dalam suatu opini, reviewer dapat memberikan ulasan positif dan negatif sekaligus. Hal ini disebabkan, target opini sering kali bukan merupakan produk secara keseluruhan, melainkan bagian produk yang disebut dengan fitur, dimana terdapat kelebihan dan kekurangan menurut pandangan reviewer.Pada tugas akhir ini, dilakukan penelitian agar sentiment dari suatu opini produk telepon genggam berdasarkan fitur produknya. Data opini yang digunakan pada tugas akhir ini berbahasa Inggris yang diambil dari situs www.cnet.com. Dengan demikian, terdapat dua proses yang dilakukan pada tugas akhir ini : (1) Ekstraksi fitur produk pada opini, (2) Identifikasi sentimen untuk setiap fitur produk. Ekstraksi fitur dilakukan dengan mencari frasa yang sesuai dengan dependencies relation template. Kemudian dilakukan feature filtering. Pada identifikasi sentimen, nilai probabilitas positif, negatif, serta label kelas target dari preparation data, menjadi parameter input classifier S3VMs. Pada penelitian dengan S3VMs, beberapa data diperlakukan sebagai unlabeled data. Dari penelitian ini diperoleh hasil evaluasi untuk identifikasi sentiment dengan F1-Measure untuk kelas positif sebesar 86% dan 70% untuk kelas negatif. Adapun untuk identifikasi fitur diperoleh akurasi 82%. ulasan, sentimen, fitur produk, S3VMs, feature-based opinio
Mixed-Integer Quadratic Optimization and Iterative Clustering Techniques for Semi-Supervised Support Vector Machines
Among the most famous algorithms for solving classification problems are
support vector machines (SVMs), which find a separating hyperplane for a set of
labeled data points. In some applications, however, labels are only available
for a subset of points. Furthermore, this subset can be non-representative,
e.g., due to self-selection in a survey. Semi-supervised SVMs tackle the
setting of labeled and unlabeled data and can often improve the reliability of
the results. Moreover, additional information about the size of the classes can
be available from undisclosed sources. We propose a mixed-integer quadratic
optimization (MIQP) model that covers the setting of labeled and unlabeled data
points as well as the overall number of points in each class. Since the MIQP's
solution time rapidly grows as the number of variables increases, we introduce
an iterative clustering approach to reduce the model's size. Moreover, we
present an update rule for the required big- values, prove the correctness
of the iterative clustering method as well as derive tailored
dimension-reduction and warm-starting techniques. Our numerical results show
that our approach leads to a similar accuracy and precision than the MIQP
formulation but at much lower computational cost. Thus, we can solve solve
larger problems. With respect to the original SVM formulation, we observe that
our approach has even better accuracy and precision for biased samples.Comment: 33 pages,18 figure
Detecting genuine multipartite entanglement via machine learning
In recent years, supervised and semi-supervised machine learning methods such
as neural networks, support vector machines (SVM), and semi-supervised support
vector machines (S4VM) have been widely used in quantum entanglement and
quantum steering verification problems. However, few studies have focused on
detecting genuine multipartite entanglement based on machine learning. Here, we
investigate supervised and semi-supervised machine learning for detecting
genuine multipartite entanglement of three-qubit states. We randomly generate
three-qubit density matrices, and train an SVM for the detection of genuine
multipartite entangled states. Moreover, we improve the training method of
S4VM, which optimizes the grouping of prediction samples and then performs
iterative predictions. Through numerical simulation, it is confirmed that this
method can significantly improve the prediction accuracy.Comment: 9 pages, 8 figure
Applicability of semi-supervised learning assumptions for gene ontology terms prediction
Gene Ontology (GO) is one of the most important resources in bioinformatics, aiming to provide a unified framework for the biological annotation of genes and proteins across all species. Predicting GO terms is an essential task for bioinformatics, but the number of available labelled proteins is in several cases insufficient for training reliable machine learning classifiers. Semi-supervised learning methods arise as a powerful solution that explodes the information contained in unlabelled data in order to improve the estimations of traditional supervised approaches. However, semi-supervised learning methods have to make strong assumptions about the nature of the training data and thus, the performance of the predictor is highly dependent on these assumptions. This paper presents an analysis of the applicability of semi-supervised learning assumptions over the specific task of GO terms prediction, focused on providing judgment elements that allow choosing the most suitable tools for specific GO terms. The results show that semi-supervised approaches significantly outperform the traditional supervised methods and that the highest performances are reached when applying the cluster assumption. Besides, it is experimentally demonstrated that cluster and manifold assumptions are complimentary to each other and an analysis of which GO terms can be more prone to be correctly predicted with each assumption, is provided.Postprint (published version
Deep Generative Models for Reject Inference in Credit Scoring
Credit scoring models based on accepted applications may be biased and their
consequences can have a statistical and economic impact. Reject inference is
the process of attempting to infer the creditworthiness status of the rejected
applications. In this research, we use deep generative models to develop two
new semi-supervised Bayesian models for reject inference in credit scoring, in
which we model the data generating process to be dependent on a Gaussian
mixture. The goal is to improve the classification accuracy in credit scoring
models by adding reject applications. Our proposed models infer the unknown
creditworthiness of the rejected applications by exact enumeration of the two
possible outcomes of the loan (default or non-default). The efficient
stochastic gradient optimization technique used in deep generative models makes
our models suitable for large data sets. Finally, the experiments in this
research show that our proposed models perform better than classical and
alternative machine learning models for reject inference in credit scoring
- …