Search CORE

4,773 research outputs found

Unexpected properties of bandwidth choice when smoothing discrete data for constructing a functional data classifier

Author: Carroll Raymond J.
Delaigle Aurore
Hall Peter
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2013
Field of study

The data functions that are studied in the course of functional data analysis are assembled from discrete data, and the level of smoothing that is used is generally that which is appropriate for accurate approximation of the conceptually smooth functions that were not actually observed. Existing literature shows that this approach is effective, and even optimal, when using functional data methods for prediction or hypothesis testing. However, in the present paper we show that this approach is not effective in classification problems. There a useful rule of thumb is that undersmoothing is often desirable, but there are several surprising qualifications to that approach. First, the effect of smoothing the training data can be more significant than that of smoothing the new data set to be classified; second, undersmoothing is not always the right approach, and in fact in some cases using a relatively large bandwidth can be more effective; and third, these perverse results are the consequence of very unusual properties of error rates, expressed as functions of smoothing parameters. For example, the orders of magnitude of optimal smoothing parameter choices depend on the signs and sizes of terms in an expansion of error rate, and those signs and sizes can vary dramatically from one setting to another, even for the same classifier.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1158 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

OPUS - University of Technology Sydney

Texas A&M Repository

PubMed Central

A U-statistic estimator for the variance of resampling-based error estimators

Author: Boulesteix Anne-Laure
De Bin Riccardo
Fuchs Mathias
Hornung Roman
Publication venue
Publication date: 01/01/2013
Field of study

We revisit resampling procedures for error estimation in binary classification in terms of U-statistics. In particular, we exploit the fact that the error rate estimator involving all learning-testing splits is a U-statistic. Therefore, several standard theorems on properties of U-statistics apply. In particular, it has minimal variance among all unbiased estimators and is asymptotically normally distributed. Moreover, there is an unbiased estimator for this minimal variance if the total sample size is at least the double learning set size plus two. In this case, we exhibit such an estimator which is another U-statistic. It enjoys, again, various optimality properties and yields an asymptotically exact hypothesis test of the equality of error rates when two learning algorithms are compared. Our statements apply to any deterministic learning algorithms under weak non-degeneracy assumptions. In an application to tuning parameter choice in lasso regression on a gene expression data set, the test does not reject the null hypothesis of equal rates between two different parameters

arXiv.org e-Print Archive

CiteSeerX

Open Access LMU

Classification and Error Estimation for Discrete Data

Author: Braga-Neto Ulisses M
Publication venue: Bentham Science Publishers Ltd.
Publication date: 01/01/2009
Field of study

Discrete classification is common in Genomic Signal Processing applications, in particular in classification of discretized gene expression data, and in discrete gene expression prediction and the inference of boolean genomic regulatory networks. Once a discrete classifier is obtained from sample data, its performance must be evaluated through its classification error. In practice, error estimation methods must then be employed to obtain reliable estimates of the classification error based on the available data. Both classifier design and error estimation are complicated, in the case of Genomics, by the prevalence of small-sample data sets in such applications. This paper presents a broad review of the methodology of classification and error estimation for discrete data, in the context of Genomics, focusing on the study of performance in small sample scenarios, as well as asymptotic behavior

CiteSeerX

Crossref

PubMed Central

A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies

Author: Boulesteix Anne-Laure
Eugster Manuel J. A.
Hable Robert
Lauer Sabine
Publication venue
Publication date: 01/01/2013
Field of study

Open Access LMU