1,442 research outputs found
Population-Guided Large Margin Classifier for High-Dimension Low -Sample-Size Problems
Various applications in different fields, such as gene expression analysis or
computer vision, suffer from data sets with high-dimensional low-sample-size
(HDLSS), which has posed significant challenges for standard statistical and
modern machine learning methods. In this paper, we propose a novel linear
binary classifier, denoted by population-guided large margin classifier
(PGLMC), which is applicable to any sorts of data, including HDLSS. PGLMC is
conceived with a projecting direction w given by the comprehensive
consideration of local structural information of the hyperplane and the
statistics of the training samples. Our proposed model has several advantages
compared to those widely used approaches. First, it is not sensitive to the
intercept term b. Second, it operates well with imbalanced data. Third, it is
relatively simple to be implemented based on Quadratic Programming. Fourth, it
is robust to the model specification for various real applications. The
theoretical properties of PGLMC are proven. We conduct a series of evaluations
on two simulated and six real-world benchmark data sets, including DNA
classification, digit recognition, medical image analysis, and face
recognition. PGLMC outperforms the state-of-the-art classification methods in
most cases, or at least obtains comparable results.Comment: 48 Page
Another Look at DWD: Thrifty Algorithm and Bayes Risk Consistency in RKHS
Distance weighted discrimination (DWD) is a margin-based classifier with an
interesting geometric motivation. DWD was originally proposed as a superior
alternative to the support vector machine (SVM), however DWD is yet to be
popular compared with the SVM. The main reasons are twofold. First, the
state-of-the-art algorithm for solving DWD is based on the second-order-cone
programming (SOCP), while the SVM is a quadratic programming problem which is
much more efficient to solve. Second, the current statistical theory of DWD
mainly focuses on the linear DWD for the high-dimension-low-sample-size setting
and data-piling, while the learning theory for the SVM mainly focuses on the
Bayes risk consistency of the kernel SVM. In fact, the Bayes risk consistency
of DWD is presented as an open problem in the original DWD paper. In this work,
we advance the current understanding of DWD from both computational and
theoretical perspectives. We propose a novel efficient algorithm for solving
DWD, and our algorithm can be several hundred times faster than the existing
state-of-the-art algorithm based on the SOCP. In addition, our algorithm can
handle the generalized DWD, while the SOCP algorithm only works well for a
special DWD but not the generalized DWD. Furthermore, we consider a natural
kernel DWD in a reproducing kernel Hilbert space and then establish the Bayes
risk consistency of the kernel DWD. We compare DWD and the SVM on several
benchmark data sets and show that the two have comparable classification
accuracy, but DWD equipped with our new algorithm can be much faster to compute
than the SVM
Weighted second-order cone programming twin support vector machine for imbalanced data classification
We propose a method of using a Weighted second-order cone programming twin
support vector machine (WSOCP-TWSVM) for imbalanced data classification. This
method constructs a graph based under-sampling method which is utilized to
remove outliers and reduce the dispensable majority samples. Then, appropriate
weights are set in order to decrease the impact of samples of the majority
class and increase the effect of the minority class in the optimization formula
of the classifier. These weights are embedded in the optimization problem of
the Second Order Cone Programming (SOCP) Twin Support Vector Machine
formulations. This method is tested, and its performance is compared to
previous methods on standard datasets. Results of experiments confirm the
feasibility and efficiency of the proposed method.Comment: This manuscript is under revision at Pattern Recognition Letter
Optimal arrangements of hyperplanes for multiclass classification
In this paper, we present a novel approach to construct multiclass
classifiers by means of arrangements of hyperplanes. We propose different mixed
integer (linear and non linear) programming formulations for the problem using
extensions of widely used measures for misclassifying observations where the
\textit{kernel trick} can be adapted to be applicable. Some dimensionality
reductions and variable fixing strategies are also developed for these models.
An extensive battery of experiments has been run which reveal the powerfulness
of our proposal as compared with other previously proposed methodologies.Comment: 8 Figures, 2 Table
Sparse Distance Weighted Discrimination
Distance weighted discrimination (DWD) was originally proposed to handle the
data piling issue in the support vector machine. In this paper, we consider the
sparse penalized DWD for high-dimensional classification. The state-of-the-art
algorithm for solving the standard DWD is based on second-order cone
programming, however such an algorithm does not work well for the sparse
penalized DWD with high-dimensional data. In order to overcome the challenging
computation difficulty, we develop a very efficient algorithm to compute the
solution path of the sparse DWD at a given fine grid of regularization
parameters. We implement the algorithm in a publicly available R package sdwd.
We conduct extensive numerical experiments to demonstrate the computational
efficiency and classification performance of our method.Comment: 16 page
Transductive Optimization of Top k Precision
Consider a binary classification problem in which the learner is given a
labeled training set, an unlabeled test set, and is restricted to choosing
exactly test points to output as positive predictions. Problems of this
kind---{\it transductive precision@}---arise in information retrieval,
digital advertising, and reserve design for endangered species. Previous
methods separate the training of the model from its use in scoring the test
points. This paper introduces a new approach, Transductive Top K (TTK), that
seeks to minimize the hinge loss over all training instances under the
constraint that exactly test instances are predicted as positive. The paper
presents two optimization methods for this challenging problem. Experiments and
analysis confirm the importance of incorporating the knowledge of into the
learning process. Experimental evaluations of the TTK approach show that the
performance of TTK matches or exceeds existing state-of-the-art methods on 7
UCI datasets and 3 reserve design problem instances
Flexible High-dimensional Classification Machines and Their Asymptotic Properties
Classification is an important topic in statistics and machine learning with
great potential in many real applications. In this paper, we investigate two
popular large margin classification methods, Support Vector Machine (SVM) and
Distance Weighted Discrimination (DWD), under two contexts: the
high-dimensional, low-sample size data and the imbalanced data. A unified
family of classification machines, the FLexible Assortment MachinE (FLAME) is
proposed, within which DWD and SVM are special cases. The FLAME family helps to
identify the similarities and differences between SVM and DWD. It is well known
that many classifiers overfit the data in the high-dimensional setting; and
others are sensitive to the imbalanced data, that is, the class with a larger
sample size overly influences the classifier and pushes the decision boundary
towards the minority class. SVM is resistant to the imbalanced data issue, but
it overfits high-dimensional data sets by showing the undesired data-piling
phenomena. The DWD method was proposed to improve SVM in the high-dimensional
setting, but its decision boundary is sensitive to the imbalanced ratio of
sample sizes. Our FLAME family helps to understand an intrinsic connection
between SVM and DWD, and improves both methods by providing a better trade-off
between sensitivity to the imbalanced data and overfitting the high-dimensional
data. Several asymptotic properties of the FLAME classifiers are studied.
Simulations and real data applications are investigated to illustrate the
usefulness of the FLAME classifiers.Comment: 49 pages, 11 figure
Gaussian Robust Classification
Supervised learning is all about the ability to generalize knowledge.
Specifically, the goal of the learning is to train a classifier using training
data, in such a way that it will be capable of classifying new unseen data
correctly. In order to acheive this goal, it is important to carefully design
the learner, so it will not overfit the training data. The later can is done
usually by adding a regularization term. The statistical learning theory
explains the success of this method by claiming that it restricts the
complexity of the learned model. This explanation, however, is rather abstract
and does not have a geometric intuition. The generalization error of a
classifier may be thought of as correlated with its robustness to perturbations
of the data: a classifier that copes with disturbance is expected to generalize
well. Indeed, Xu et al. [2009] have shown that the SVM formulation is
equivalent to a robust optimization (RO) formulation, in which an adversary
displaces the training and testing points within a ball of pre-determined
radius. In this work we explore a different kind of robustness, namely changing
each data point with a Gaussian cloud centered at the sample. Loss is evaluated
as the expectation of an underlying loss function on the cloud. This setup fits
the fact that in many applications, the data is sampled along with noise. We
develop an RO framework, in which the adversary chooses the covariance of the
noise. In our algorithm named GURU, the tuning parameter is a spectral bound on
the noise, thus it can be estimated using physical or applicative
considerations. Our experiments show that this framework performs as well as
SVM and even slightly better in some cases. Generalizations for Mercer kernels
and for the multiclass case are presented as well. We also show that our
framework may be further generalized, using the technique of convex perspective
functions.Comment: Master's dissertation of the first author, carried out under the
supervision of the second autho
A Summary Of The Kernel Matrix, And How To Learn It Effectively Using Semidefinite Programming
Kernel-based learning algorithms are widely used in machine learning for
problems that make use of the similarity between object pairs. Such algorithms
first embed all data points into an alternative space, where the inner product
between object pairs specifies their distance in the embedding space. Applying
kernel methods to partially labeled datasets is a classical challenge in this
regard, requiring that the distances between unlabeled pairs must somehow be
learnt using the labeled data. In this independent study, I will summarize the
work of G. Lanckriet et al.'s work on "Learning the Kernel Matrix with
Semidefinite Programming" used in support vector machines (SVM) algorithms for
the transduction problem. Throughout the report, I have provide alternative
explanations / derivations / analysis related to this work which is designed to
ease the understanding of the original article.Comment: Independent study / summary, 20 pages total (14 pages main body, 6
pages appendices and proofs
Fast SVM-based Feature Elimination Utilizing Data Radius, Hard-Margin, Soft-Margin
Margin maximization in the hard-margin sense, proposed as feature elimination
criterion by the MFE-LO method, is combined here with data radius utilization
to further aim to lower generalization error, as several published bounds and
bound-related formulations pertaining to lowering misclassification risk (or
error) pertain to radius e.g. product of squared radius and weight vector
squared norm. Additionally, we propose additional novel feature elimination
criteria that, while instead being in the soft-margin sense, too can utilize
data radius, utilizing previously published bound-related formulations for
approaching radius for the soft-margin sense, whereby e.g. a focus was on the
principle stated therein as "finding a bound whose minima are in a region with
small leave-one-out values may be more important than its tightness". These
additional criteria we propose combine radius utilization with a novel and
computationally low-cost soft-margin light classifier retraining approach we
devise named QP1; QP1 is the soft-margin alternative to the hard-margin LO. We
correct an error in the MFE-LO description, find MFE-LO achieves the highest
generalization accuracy among the previously published margin-based feature
elimination (MFE) methods, discuss some limitations of MFE-LO, and find our
novel methods herein outperform MFE-LO, attain lower test set classification
error rate. On several datasets that each both have a large number of features
and fall into the `large features few samples' dataset category, and on
datasets with lower (low-to-intermediate) number of features, our novel methods
give promising results. Especially, among our methods the tunable ones, that do
not employ (the non-tunable) LO approach, can be tuned more aggressively in the
future than herein, to aim to demonstrate for them even higher performance than
herein.Comment: Incomplete but good, again. To Apr 28 version, made few misc text and
notation improvements including typo corrections, probably mostly in
Appendix, but probably best to read in whole again. New results for one of
the datasets (Leukemia gene dataset
- …