10 research outputs found
High-Dimensional Boosting: Rate of Convergence
Boosting is one of the most significant developments in machine learning.
This paper studies the rate of convergence of Boosting, which is tailored
for regression, in a high-dimensional setting. Moreover, we introduce so-called
\textquotedblleft post-Boosting\textquotedblright. This is a post-selection
estimator which applies ordinary least squares to the variables selected in the
first stage by Boosting. Another variant is \textquotedblleft Orthogonal
Boosting\textquotedblright\ where after each step an orthogonal projection is
conducted. We show that both post-Boosting and the orthogonal boosting
achieve the same rate of convergence as LASSO in a sparse, high-dimensional
setting. We show that the rate of convergence of the classical Boosting
depends on the design matrix described by a sparse eigenvalue constant. To show
the latter results, we derive new approximation results for the pure greedy
algorithm, based on analyzing the revisiting behavior of Boosting. We also
introduce feasible rules for early stopping, which can be easily implemented
and used in applied work. Our results also allow a direct comparison between
LASSO and boosting which has been missing from the literature. Finally, we
present simulation studies and applications to illustrate the relevance of our
theoretical results and to provide insights into the practical aspects of
boosting. In these simulation studies, post-Boosting clearly outperforms
LASSO.Comment: 19 pages, 4 tables; AMS 2000 subject classifications: Primary 62J05,
62J07, 41A25; secondary 49M15, 68Q3
The Discrete Dantzig Selector: Estimating Sparse Linear Models via Mixed Integer Linear Optimization
We propose a novel high-dimensional linear regression estimator: the Discrete
Dantzig Selector, which minimizes the number of nonzero regression coefficients
subject to a budget on the maximal absolute correlation between the features
and residuals. Motivated by the significant advances in integer optimization
over the past 10-15 years, we present a Mixed Integer Linear Optimization
(MILO) approach to obtain certifiably optimal global solutions to this
nonconvex optimization problem. The current state of algorithmics in integer
optimization makes our proposal substantially more computationally attractive
than the least squares subset selection framework based on integer quadratic
optimization, recently proposed in [8] and the continuous nonconvex quadratic
optimization framework of [33]. We propose new discrete first-order methods,
which when paired with state-of-the-art MILO solvers, lead to good solutions
for the Discrete Dantzig Selector problem for a given computational budget. We
illustrate that our integrated approach provides globally optimal solutions in
significantly shorter computation times, when compared to off-the-shelf MILO
solvers. We demonstrate both theoretically and empirically that in a wide range
of regimes the statistical properties of the Discrete Dantzig Selector are
superior to those of popular -based approaches. We illustrate that
our approach can handle problem instances with p = 10,000 features with
certifiable optimality making it a highly scalable combinatorial variable
selection approach in sparse linear modeling
The Discrete Dantzig Selector: Estimating Sparse Linear Models via Mixed Integer Linear Optimization
We propose a novel high-dimensional linear regression estimator: the Discrete Dantzig Selector, which minimizes the number of nonzero regression coefficients subject to a budget on the maximal absolute correlation between the features and residuals. Motivated by the significant advances in integer optimization over the past 10-15 years, we present a mixed integer linear optimization (MILO) approach to obtain certifiably optimal global solutions to this nonconvex optimization problem. The current state of algorithmics in integer optimization makes our proposal substantially more computationally attractive than the least squares subset selection framework based on integer quadratic optimization, recently proposed by Bertsimas et al. and the continuous nonconvex quadratic optimization framework of Liu et al. We propose new discrete first-order methods, which when paired with the state-of-the-art MILO solvers, lead to good solutions for the Discrete Dantzig Selector problem for a given computational budget. We illustrate that our integrated approach provides globally optimal solutions in significantly shorter computation times, when compared to off-the-shelf MILO solvers. We demonstrate both theoretically and empirically that in a wide range of regimes the statistical properties of the Discrete Dantzig Selector are superior to those of popular ell1-based approaches. We illustrate that our approach can handle problem instances with p =10,000 features with certifiable optimality making it a highly scalable combinatorial variable selection approach in sparse linear modeling
A Precise High-Dimensional Asymptotic Theory for Boosting and Minimum--Norm Interpolated Classifiers
This paper establishes a precise high-dimensional asymptotic theory for
boosting on separable data, taking statistical and computational perspectives.
We consider a high-dimensional setting where the number of features (weak
learners) scales with the sample size , in an overparametrized regime.
Under a class of statistical models, we provide an exact analysis of the
generalization error of boosting when the algorithm interpolates the training
data and maximizes the empirical -margin. Further, we explicitly pin
down the relation between the boosting test error and the optimal Bayes error,
as well as the proportion of active features at interpolation (with zero
initialization). In turn, these precise characterizations answer certain
questions raised in \cite{breiman1999prediction, schapire1998boosting}
surrounding boosting, under assumed data generating processes. At the heart of
our theory lies an in-depth study of the maximum--margin, which can be
accurately described by a new system of non-linear equations; to analyze this
margin, we rely on Gaussian comparison techniques and develop a novel uniform
deviation argument. Our statistical and computational arguments can handle (1)
any finite-rank spiked covariance model for the feature distribution and (2)
variants of boosting corresponding to general -geometry, . As a final component, via the Lindeberg principle, we establish a
universality result showcasing that the scaled -margin (asymptotically)
remains the same, whether the covariates used for boosting arise from a
non-linear random feature model or an appropriately linearized model with
matching moments.Comment: 68 pages, 4 figure
A new perspective on boosting in linear regression via subgradient optimization and relatives
We analyze boosting algorithms [Ann. Statist. 29 (2001) 1189β1232; Ann. Statist. 28 (2000) 337β407; Ann. Statist. 32 (2004) 407β499] in linear regression from a new perspective: that of modern first-order methods in convex optimiz ation. We show that classic boosting algorithms in linear regression, namely the incremental forward stagewise algorithm (FS ? ) and least squares boosting [LS-BOOST(?)], can be viewed as subgradient descent to minimize the loss function defined as the maximum absolute correlation between the features and residuals. We also propose a minor modification of FS ? that yields an algorithm for the LASSO, and that may be easily extended to an algorithm that computes the LASSO path for different values of the regularization parameter. Furthermore, we show that these new algorithms for the LASSO may also be interpreted as the same master algorithm (subgradient descent), applied to a regularized version of the maximum absolute correlation loss function. We derive novel, comprehensive computational guarantees for several boosting algorithms in linear regression (including LS-BOOST(?) and FS ? ) by using techniques of first-order methods in convex optimization. Our computational guarantees inform us about the statistical properties of boosting algorithms. In particular, they provide, for the first time, a precise theoretical description of the amount of data-fidelity and regularization imparted by running a boosting algorithm with a prespecified learning rate for a fixed but arbitrary number of iterations, for any dataset