2,675 research outputs found

    An adaptive multiclass nearest neighbor classifier

    Full text link
    We consider a problem of multiclass classification, where the training sample Sn={(Xi,Yi)}i=1nS_n = \{(X_i, Y_i)\}_{i=1}^n is generated from the model P(Y=m∣X=x)=ηm(x)\mathbb P(Y = m | X = x) = \eta_m(x), 1≤m≤M1 \leq m \leq M, and η1(x),…,ηM(x)\eta_1(x), \dots, \eta_M(x) are unknown α\alpha-Holder continuous functions.Given a test point XX, our goal is to predict its label. A widely used k\mathsf k-nearest-neighbors classifier constructs estimates of η1(X),…,ηM(X)\eta_1(X), \dots, \eta_M(X) and uses a plug-in rule for the prediction. However, it requires a proper choice of the smoothing parameter k\mathsf k, which may become tricky in some situations. In our solution, we fix several integers n1,…,nKn_1, \dots, n_K, compute corresponding nkn_k-nearest-neighbor estimates for each mm and each nkn_k and apply an aggregation procedure. We study an algorithm, which constructs a convex combination of these estimates such that the aggregated estimate behaves approximately as well as an oracle choice. We also provide a non-asymptotic analysis of the procedure, prove its adaptation to the unknown smoothness parameter α\alpha and to the margin and establish rates of convergence under mild assumptions.Comment: Accepted in ESAIM: Probability & Statistics. The original publication is available at www.esaim-ps.or

    Simultaneous adaptation to the margin and to complexity in classification

    Get PDF
    We consider the problem of adaptation to the margin and to complexity in binary classification. We suggest an exponential weighting aggregation scheme. We use this aggregation procedure to construct classifiers which adapt automatically to margin and complexity. Two main examples are worked out in which adaptivity is achieved in frameworks proposed by Steinwart and Scovel [Learning Theory. Lecture Notes in Comput. Sci. 3559 (2005) 279--294. Springer, Berlin; Ann. Statist. 35 (2007) 575--607] and Tsybakov [Ann. Statist. 32 (2004) 135--166]. Adaptive schemes, like ERM or penalized ERM, usually involve a minimization step. This is not the case for our procedure.Comment: Published in at http://dx.doi.org/10.1214/009053607000000055 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    On Rate-Optimal Partitioning Classification from Observable and from Privatised Data

    Full text link
    In this paper we revisit the classical method of partitioning classification and study its convergence rate under relaxed conditions, both for observable (non-privatised) and for privatised data. Let the feature vector XX take values in Rd\mathbb{R}^d and denote its label by YY. Previous results on the partitioning classifier worked with the strong density assumption, which is restrictive, as we demonstrate through simple examples. We assume that the distribution of XX is a mixture of an absolutely continuous and a discrete distribution, such that the absolutely continuous component is concentrated to a dad_a dimensional subspace. Here, we study the problem under much milder assumptions: in addition to the standard Lipschitz and margin conditions, a novel characteristic of the absolutely continuous component is introduced, by which the exact convergence rate of the classification error probability is calculated, both for the binary and for the multi-label cases. Interestingly, this rate of convergence depends only on the intrinsic dimension dad_a. The privacy constraints mean that the data (X1,Y1),…,(Xn,Yn)(X_1,Y_1), \dots ,(X_n,Y_n) cannot be directly observed, and the classifiers are functions of the randomised outcome of a suitable local differential privacy mechanism. The statistician is free to choose the form of this privacy mechanism, and here we add Laplace distributed noises to the discontinuations of all possible locations of the feature vector XiX_i and to its label YiY_i. Again, tight upper bounds on the rate of convergence of the classification error probability are derived, without the strong density assumption, such that this rate depends on 2 da2\,d_a

    Two Phases of Scaling Laws for Nearest Neighbor Classifiers

    Full text link
    A scaling law refers to the observation that the test performance of a model improves as the number of training data increases. A fast scaling law implies that one can solve machine learning problems by simply boosting the data and the model sizes. Yet, in many cases, the benefit of adding more data can be negligible. In this work, we study the rate of scaling laws of nearest neighbor classifiers. We show that a scaling law can have two phases: in the first phase, the generalization error depends polynomially on the data dimension and decreases fast; whereas in the second phase, the error depends exponentially on the data dimension and decreases slowly. Our analysis highlights the complexity of the data distribution in determining the generalization error. When the data distributes benignly, our result suggests that nearest neighbor classifier can achieve a generalization error that depends polynomially, instead of exponentially, on the data dimension

    Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy

    Full text link
    We consider challenges that arise in the estimation of the mean outcome under an optimal individualized treatment strategy defined as the treatment rule that maximizes the population mean outcome, where the candidate treatment rules are restricted to depend on baseline covariates. We prove a necessary and sufficient condition for the pathwise differentiability of the optimal value, a key condition needed to develop a regular and asymptotically linear (RAL) estimator of the optimal value. The stated condition is slightly more general than the previous condition implied in the literature. We then describe an approach to obtain root-nn rate confidence intervals for the optimal value even when the parameter is not pathwise differentiable. We provide conditions under which our estimator is RAL and asymptotically efficient when the mean outcome is pathwise differentiable. We also outline an extension of our approach to a multiple time point problem. All of our results are supported by simulations.Comment: Published at http://dx.doi.org/10.1214/15-AOS1384 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Choice of neighbor order in nearest-neighbor classification

    Full text link
    The kkth-nearest neighbor rule is arguably the simplest and most intuitively appealing nonparametric classification procedure. However, application of this method is inhibited by lack of knowledge about its properties, in particular, about the manner in which it is influenced by the value of kk; and by the absence of techniques for empirical choice of kk. In the present paper we detail the way in which the value of kk determines the misclassification error. We consider two models, Poisson and Binomial, for the training samples. Under the first model, data are recorded in a Poisson stream and are "assigned" to one or other of the two populations in accordance with the prior probabilities. In particular, the total number of data in both training samples is a Poisson-distributed random variable. Under the Binomial model, however, the total number of data in the training samples is fixed, although again each data value is assigned in a random way. Although the values of risk and regret associated with the Poisson and Binomial models are different, they are asymptotically equivalent to first order, and also to the risks associated with kernel-based classifiers that are tailored to the case of two derivatives. These properties motivate new methods for choosing the value of kk.Comment: Published in at http://dx.doi.org/10.1214/07-AOS537 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore