93,157 research outputs found
Generalized Boosting Algorithms for Convex Optimization
Boosting is a popular way to derive powerful learners from simpler hypothesis
classes. Following previous work (Mason et al., 1999; Friedman, 2000) on
general boosting frameworks, we analyze gradient-based descent algorithms for
boosting with respect to any convex objective and introduce a new measure of
weak learner performance into this setting which generalizes existing work. We
present the weak to strong learning guarantees for the existing gradient
boosting work for strongly-smooth, strongly-convex objectives under this new
measure of performance, and also demonstrate that this work fails for
non-smooth objectives. To address this issue, we present new algorithms which
extend this boosting approach to arbitrary convex loss functions and give
corresponding weak to strong convergence results. In addition, we demonstrate
experimental results that support our analysis and demonstrate the need for the
new algorithms we present.Comment: Extended version of paper presented at the International Conference
on Machine Learning, 2011. 9 pages + appendix with proof
Proximal boosting and its acceleration
Gradient boosting is a prediction method that iteratively combines weak
learners to produce a complex and accurate model. From an optimization point of
view, the learning procedure of gradient boosting mimics a gradient descent on
a functional variable. This paper proposes to build upon the proximal point
algorithm when the empirical risk to minimize is not differentiable to
introduce a novel boosting approach, called proximal boosting. Besides being
motivated by non-differentiable optimization, the proposed algorithm benefits
from Nesterov's acceleration in the same way as gradient boosting [Biau et al.,
2018]. This leads to a variant, called accelerated proximal boosting.
Advantages of leveraging proximal methods for boosting are illustrated by
numerical experiments on simulated and real-world data. In particular, we
exhibit a favorable comparison over gradient boosting regarding convergence
rate and prediction accuracy
Boosting with early stopping: Convergence and consistency
Boosting is one of the most significant advances in machine learning for
classification and regression. In its original and computationally flexible
version, boosting seeks to minimize empirically a loss function in a greedy
fashion. The resulting estimator takes an additive function form and is built
iteratively by applying a base estimator (or learner) to updated samples
depending on the previous iterations. An unusual regularization technique,
early stopping, is employed based on CV or a test set. This paper studies
numerical convergence, consistency and statistical rates of convergence of
boosting with early stopping, when it is carried out over the linear span of a
family of basis functions. For general loss functions, we prove the convergence
of boosting's greedy optimization to the infinimum of the loss function over
the linear span. Using the numerical convergence result, we find early-stopping
strategies under which boosting is shown to be consistent based on i.i.d.
samples, and we obtain bounds on the rates of convergence for boosting
estimators. Simulation studies are also presented to illustrate the relevance
of our theoretical results for providing insights to practical aspects of
boosting. As a side product, these results also reveal the importance of
restricting the greedy search step-sizes, as known in practice through the work
of Friedman and others. Moreover, our results lead to a rigorous proof that for
a linearly separable problem, AdaBoost with \epsilon\to0 step-size becomes an
L^1-margin maximizer when left to run to convergence.Comment: Published at http://dx.doi.org/10.1214/009053605000000255 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Structured Learning via Logistic Regression
A successful approach to structured learning is to write the learning
objective as a joint function of linear parameters and inference messages, and
iterate between updates to each. This paper observes that if the inference
problem is "smoothed" through the addition of entropy terms, for fixed
messages, the learning objective reduces to a traditional (non-structured)
logistic regression problem with respect to parameters. In these logistic
regression problems, each training example has a bias term determined by the
current set of messages. Based on this insight, the structured energy function
can be extended from linear factors to any function class where an "oracle"
exists to minimize a logistic loss.Comment: Advances in Neural Information Processing Systems 201
- …