    Discussion of ``2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization'' by V. Koltchinskii

    Discussion of ``2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization'' by V. Koltchinskii [arXiv:0708.0083]Comment: Published at http://dx.doi.org/10.1214/009053606000001064 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Simultaneous adaptation to the margin and to complexity in classification

    We consider the problem of adaptation to the margin and to complexity in binary classification. We suggest an exponential weighting aggregation scheme. We use this aggregation procedure to construct classifiers which adapt automatically to margin and complexity. Two main examples are worked out in which adaptivity is achieved in frameworks proposed by Steinwart and Scovel [Learning Theory. Lecture Notes in Comput. Sci. 3559 (2005) 279--294. Springer, Berlin; Ann. Statist. 35 (2007) 575--607] and Tsybakov [Ann. Statist. 32 (2004) 135--166]. Adaptive schemes, like ERM or penalized ERM, usually involve a minimization step. This is not the case for our procedure.Comment: Published in at http://dx.doi.org/10.1214/009053607000000055 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    An adaptive multiclass nearest neighbor classifier

    We consider a problem of multiclass classification, where the training sample Sn={(Xi,Yi)}i=1nS_n = \{(X_i, Y_i)\}_{i=1}^n is generated from the model P(Y=m∣X=x)=ηm(x)\mathbb P(Y = m | X = x) = \eta_m(x), 1≀m≀M1 \leq m \leq M, and η1(x),
,ηM(x)\eta_1(x), \dots, \eta_M(x) are unknown α\alpha-Holder continuous functions.Given a test point XX, our goal is to predict its label. A widely used k\mathsf k-nearest-neighbors classifier constructs estimates of η1(X),
,ηM(X)\eta_1(X), \dots, \eta_M(X) and uses a plug-in rule for the prediction. However, it requires a proper choice of the smoothing parameter k\mathsf k, which may become tricky in some situations. In our solution, we fix several integers n1,
,nKn_1, \dots, n_K, compute corresponding nkn_k-nearest-neighbor estimates for each mm and each nkn_k and apply an aggregation procedure. We study an algorithm, which constructs a convex combination of these estimates such that the aggregated estimate behaves approximately as well as an oracle choice. We also provide a non-asymptotic analysis of the procedure, prove its adaptation to the unknown smoothness parameter α\alpha and to the margin and establish rates of convergence under mild assumptions.Comment: Accepted in ESAIM: Probability & Statistics. The original publication is available at www.esaim-ps.or

    PAC-Bayesian High Dimensional Bipartite Ranking

    This paper is devoted to the bipartite ranking problem, a classical statistical learning task, in a high dimensional setting. We propose a scoring and ranking strategy based on the PAC-Bayesian approach. We consider nonlinear additive scoring functions, and we derive non-asymptotic risk bounds under a sparsity assumption. In particular, oracle inequalities in probability holding under a margin condition assess the performance of our procedure, and prove its minimax optimality. An MCMC-flavored algorithm is proposed to implement our method, along with its behavior on synthetic and real-life datasets

    Beyond Disagreement-based Agnostic Active Learning

    We study agnostic active learning, where the goal is to learn a classifier in a pre-specified hypothesis class interactively with as few label queries as possible, while making no assumptions on the true function generating the labels. The main algorithms for this problem are {\em{disagreement-based active learning}}, which has a high label requirement, and {\em{margin-based active learning}}, which only applies to fairly restricted settings. A major challenge is to find an algorithm which achieves better label complexity, is consistent in an agnostic setting, and applies to general classification problems. In this paper, we provide such an algorithm. Our solution is based on two novel contributions -- a reduction from consistent active learning to confidence-rated prediction with guaranteed error, and a novel confidence-rated predictor

    Adapting to Unknown Smoothness by Aggregation of Thresholded Wavelet Estimators

    We study the performances of an adaptive procedure based on a convex combination, with data-driven weights, of term-by-term thresholded wavelet estimators. For the bounded regression model, with random uniform design, and the nonparametric density model, we show that the resulting estimator is optimal in the minimax sense over all Besov balls under the L2L^2 risk, without any logarithm factor

    Sharp Oracle Inequalities for Aggregation of Affine Estimators

    We consider the problem of combining a (possibly uncountably infinite) set of affine estimators in non-parametric regression model with heteroscedastic Gaussian noise. Focusing on the exponentially weighted aggregate, we prove a PAC-Bayesian type inequality that leads to sharp oracle inequalities in discrete but also in continuous settings. The framework is general enough to cover the combinations of various procedures such as least square regression, kernel ridge regression, shrinking estimators and many other estimators used in the literature on statistical inverse problems. As a consequence, we show that the proposed aggregate provides an adaptive estimator in the exact minimax sense without neither discretizing the range of tuning parameters nor splitting the set of observations. We also illustrate numerically the good performance achieved by the exponentially weighted aggregate

    A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity

    We present a novel notion of complexity that interpolates between and generalizes some classic existing complexity notions in learning theory: for estimators like empirical risk minimization (ERM) with arbitrary bounded losses, it is upper bounded in terms of data-independent Rademacher complexity; for generalized Bayesian estimators, it is upper bounded by the data-dependent information complexity (also known as stochastic or PAC-Bayesian, KL(posterior∄⁥prior)\mathrm{KL}(\text{posterior} \operatorname{\|} \text{prior}) complexity. For (penalized) ERM, the new complexity reduces to (generalized) normalized maximum likelihood (NML) complexity, i.e. a minimax log-loss individual-sequence regret. Our first main result bounds excess risk in terms of the new complexity. Our second main result links the new complexity via Rademacher complexity to L2(P)L_2(P) entropy, thereby generalizing earlier results of Opper, Haussler, Lugosi, and Cesa-Bianchi who did the log-loss case with L∞L_\infty. Together, these results recover optimal bounds for VC- and large (polynomial entropy) classes, replacing localized Rademacher complexity by a simpler analysis which almost completely separates the two aspects that determine the achievable rates: 'easiness' (Bernstein) conditions and model complexity.Comment: 38 page
