The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime

Abstract

Modern machine learning models are often so complex that they achieve vanishing classification error on the training set. Max-margin linear classifiers are among the simplest classification methods that have zero training error (with linearly separable data). Despite their simplicity, their high-dimensional behavior is not yet completely understood. We assume to be given i.i.d. data (yi,xi)(y_i,{\boldsymbol x}_i), i≀ni\le n with xi∼N(0,Ξ£){\boldsymbol x}_i\sim {\sf N}(0,{\boldsymbol \Sigma}) a pp-dimensional feature vector, and yi∈{+1,βˆ’1}y_i \in\{+1,-1\} a label whose distribution depends on a linear combination of the covariates βŸ¨ΞΈβˆ—,xi⟩\langle{\boldsymbol\theta}_*,{\boldsymbol x}_i\rangle. We consider the proportional asymptotics n,pβ†’βˆžn,p\to\infty with p/nβ†’Οˆp/n\to \psi, and derive exact expressions for the limiting prediction error. Our asymptotic results match simulations already when n,pn,p are of the order of a few hundreds. We explore several choices for (ΞΈβˆ—,Ξ£)({\boldsymbol \theta}_*,{\boldsymbol \Sigma}), and show that the resulting generalization curve (test error error as a function of the overparametrization ψ=p/n\psi=p/n) is qualitatively different, depending on this choice. In particular we consider a specific structure of (ΞΈβˆ—,Ξ£)({\boldsymbol \theta}_*,{\boldsymbol\Sigma}) that captures the behavior of nonlinear random feature models or, equivalently, two-layers neural networks with random first layer weights. In this case, we aim at classifying data (yi,xi)(y_i,{\boldsymbol x}_i) with xi∈Rd{\boldsymbol x}_i\in{\mathbb R}^d but we do so by first embedding them a pp dimensional feature space via xi↦σ(Wxi){\boldsymbol x}_i\mapsto\sigma({\boldsymbol W}{\boldsymbol x}_i) and then finding a max-margin classifier in this space. We derive exact formulas in the proportional asymptotics p,n,dβ†’βˆžp,n,d\to\infty with p/dβ†’Οˆ1p/d\to\psi_1, n/dβ†’Οˆ2n/d\to\psi_2 and observe that the test error is minimized in the highly overparametrized regime ψ1≫0\psi_1\gg 0.Comment: 73 pages; 12 pdf figures (Added formulas for wide asymptotics, and distribution of the coordinates of the estimator

    Similar works

    Full text

    thumbnail-image

    Available Versions