4,773 research outputs found
Unexpected properties of bandwidth choice when smoothing discrete data for constructing a functional data classifier
The data functions that are studied in the course of functional data analysis
are assembled from discrete data, and the level of smoothing that is used is
generally that which is appropriate for accurate approximation of the
conceptually smooth functions that were not actually observed. Existing
literature shows that this approach is effective, and even optimal, when using
functional data methods for prediction or hypothesis testing. However, in the
present paper we show that this approach is not effective in classification
problems. There a useful rule of thumb is that undersmoothing is often
desirable, but there are several surprising qualifications to that approach.
First, the effect of smoothing the training data can be more significant than
that of smoothing the new data set to be classified; second, undersmoothing is
not always the right approach, and in fact in some cases using a relatively
large bandwidth can be more effective; and third, these perverse results are
the consequence of very unusual properties of error rates, expressed as
functions of smoothing parameters. For example, the orders of magnitude of
optimal smoothing parameter choices depend on the signs and sizes of terms in
an expansion of error rate, and those signs and sizes can vary dramatically
from one setting to another, even for the same classifier.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1158 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
A U-statistic estimator for the variance of resampling-based error estimators
We revisit resampling procedures for error estimation in binary classification in terms of U-statistics. In particular, we exploit the fact that the error rate estimator involving all learning-testing splits is a U-statistic. Therefore, several standard theorems on properties of U-statistics apply.
In particular, it has minimal variance among all unbiased estimators and is asymptotically normally distributed. Moreover, there is an unbiased estimator for this minimal variance if the total sample size is at least the double learning set size plus two. In this case, we exhibit such an estimator which is another U-statistic. It enjoys, again, various optimality properties and yields an asymptotically exact hypothesis test of the equality of error rates when two learning algorithms are compared. Our statements apply to any deterministic learning algorithms under weak non-degeneracy assumptions.
In an application to tuning parameter choice in lasso regression on a gene expression data set, the test does not reject the null hypothesis of equal rates between two different parameters
Classification and Error Estimation for Discrete Data
Discrete classification is common in Genomic Signal Processing applications, in particular in classification of discretized gene expression data, and in discrete gene expression prediction and the inference of boolean genomic regulatory networks. Once a discrete classifier is obtained from sample data, its performance must be evaluated through its classification error. In practice, error estimation methods must then be employed to obtain reliable estimates of the classification error based on the available data. Both classifier design and error estimation are complicated, in the case of Genomics, by the prevalence of small-sample data sets in such applications. This paper presents a broad review of the methodology of classification and error estimation for discrete data, in the context of Genomics, focusing on the study of performance in small sample scenarios, as well as asymptotic behavior
- …