8,340 research outputs found
Noise Tolerance under Risk Minimization
In this paper we explore noise tolerant learning of classifiers. We formulate
the problem as follows. We assume that there is an
training set which is noise-free. The actual training set given to the learning
algorithm is obtained from this ideal data set by corrupting the class label of
each example. The probability that the class label of an example is corrupted
is a function of the feature vector of the example. This would account for most
kinds of noisy data one encounters in practice. We say that a learning method
is noise tolerant if the classifiers learnt with the ideal noise-free data and
with noisy data, both have the same classification accuracy on the noise-free
data. In this paper we analyze the noise tolerance properties of risk
minimization (under different loss functions), which is a generic method for
learning classifiers. We show that risk minimization under 0-1 loss function
has impressive noise tolerance properties and that under squared error loss is
tolerant only to uniform noise; risk minimization under other loss functions is
not noise tolerant. We conclude the paper with some discussion on implications
of these theoretical results
Penalized Orthogonal Iteration for Sparse Estimation of Generalized Eigenvalue Problem
We propose a new algorithm for sparse estimation of eigenvectors in
generalized eigenvalue problems (GEP). The GEP arises in a number of modern
data-analytic situations and statistical methods, including principal component
analysis (PCA), multiclass linear discriminant analysis (LDA), canonical
correlation analysis (CCA), sufficient dimension reduction (SDR) and invariant
co-ordinate selection. We propose to modify the standard generalized orthogonal
iteration with a sparsity-inducing penalty for the eigenvectors. To achieve
this goal, we generalize the equation-solving step of orthogonal iteration to a
penalized convex optimization problem. The resulting algorithm, called
penalized orthogonal iteration, provides accurate estimation of the true
eigenspace, when it is sparse. Also proposed is a computationally more
efficient alternative, which works well for PCA and LDA problems. Numerical
studies reveal that the proposed algorithms are competitive, and that our
tuning procedure works well. We demonstrate applications of the proposed
algorithm to obtain sparse estimates for PCA, multiclass LDA, CCA and SDR.
Supplementary materials are available online
A Direct Estimation Approach to Sparse Linear Discriminant Analysis
This paper considers sparse linear discriminant analysis of high-dimensional
data. In contrast to the existing methods which are based on separate
estimation of the precision matrix \O and the difference \de of the mean
vectors, we introduce a simple and effective classifier by estimating the
product \O\de directly through constrained minimization. The
estimator can be implemented efficiently using linear programming and the
resulting classifier is called the linear programming discriminant (LPD) rule.
The LPD rule is shown to have desirable theoretical and numerical properties.
It exploits the approximate sparsity of \O\de and as a consequence allows
cases where it can still perform well even when \O and/or \de cannot be
estimated consistently. Asymptotic properties of the LPD rule are investigated
and consistency and rate of convergence results are given. The LPD classifier
has superior finite sample performance and significant computational advantages
over the existing methods that require separate estimation of \O and \de.
The LPD rule is also applied to analyze real datasets from lung cancer and
leukemia studies. The classifier performs favorably in comparison to existing
methods.Comment: 39 pages.To appear in Journal of the American Statistical Associatio
Comment on "Support Vector Machines with Applications"
Comment on "Support Vector Machines with Applications" [math.ST/0612817]Comment: Published at http://dx.doi.org/10.1214/088342306000000475 in the
Statistical Science (http://www.imstat.org/sts/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A CASE STUDY ON SUPPORT VECTOR MACHINES VERSUS ARTIFICIAL NEURAL NETWORKS
The capability of artificial neural networks for pattern recognition of real world problems is well known. In recent years, the support vector machine has been advocated for its structure risk minimization leading to tolerance margins of decision boundaries. Structures and performances of these pattern classifiers depend on the feature dimension and training data size. The objective of this research is to compare these pattern recognition systems based on a case study. The particular case considered is on classification of hypertensive and normotensive right ventricle (RV) shapes obtained from Magnetic Resonance Image (MRI) sequences. In this case, the feature dimension is reasonable, but the available training data set is small, however, the decision surface is highly nonlinear.For diagnosis of congenital heart defects, especially those associated with pressure and volume overload problems, a reliable pattern classifier for determining right ventricle function is needed. RV¡¦s global and regional surface to volume ratios are assessed from an individual¡¦s MRI heart images. These are used as features for pattern classifiers. We considered first two linear classification methods: the Fisher linear discriminant and the linear classifier trained by the Ho-Kayshap algorithm. When the data are not linearly separable, artificial neural networks with back-propagation training and radial basis function networks were then considered, providing nonlinear decision surfaces. Thirdly, a support vector machine was trained which gives tolerance margins on both sides of the decision surface. We have found in this case study that the back-propagation training of an artificial neural network depends heavily on the selection of initial weights, even though randomized. The support vector machine where radial basis function kernels are used is easily trained and provides decision tolerance margins, in spite of only small margins
On surrogate loss functions and -divergences
The goal of binary classification is to estimate a discriminant function
from observations of covariate vectors and corresponding binary
labels. We consider an elaboration of this problem in which the covariates are
not available directly but are transformed by a dimensionality-reducing
quantizer . We present conditions on loss functions such that empirical risk
minimization yields Bayes consistency when both the discriminant function and
the quantizer are estimated. These conditions are stated in terms of a general
correspondence between loss functions and a class of functionals known as
Ali-Silvey or -divergence functionals. Whereas this correspondence was
established by Blackwell [Proc. 2nd Berkeley Symp. Probab. Statist. 1 (1951)
93--102. Univ. California Press, Berkeley] for the 0--1 loss, we extend the
correspondence to the broader class of surrogate loss functions that play a key
role in the general theory of Bayes consistency for binary classification. Our
result makes it possible to pick out the (strict) subset of surrogate loss
functions that yield Bayes consistency for joint estimation of the discriminant
function and the quantizer.Comment: Published in at http://dx.doi.org/10.1214/08-AOS595 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Target Contrastive Pessimistic Discriminant Analysis
Domain-adaptive classifiers learn from a source domain and aim to generalize
to a target domain. If the classifier's assumptions on the relationship between
domains (e.g. covariate shift) are valid, then it will usually outperform a
non-adaptive source classifier. Unfortunately, it can perform substantially
worse when its assumptions are invalid. Validating these assumptions requires
labeled target samples, which are usually not available. We argue that, in
order to make domain-adaptive classifiers more practical, it is necessary to
focus on robust methods; robust in the sense that the model still achieves a
particular level of performance without making strong assumptions on the
relationship between domains. With this objective in mind, we formulate a
conservative parameter estimator that only deviates from the source classifier
when a lower or equal risk is guaranteed for all possible labellings of the
given target samples. We derive the corresponding estimator for a discriminant
analysis model, and show that its risk is actually strictly smaller than that
of the source classifier. Experiments indicate that our classifier outperforms
state-of-the-art classifiers for geographically biased samples.Comment: 9 pages, no figures, 2 tables. arXiv admin note: substantial text
overlap with arXiv:1706.0808
- …