9 research outputs found
Optimization by gradient boosting
Gradient boosting is a state-of-the-art prediction technique that
sequentially produces a model in the form of linear combinations of simple
predictors---typically decision trees---by solving an infinite-dimensional
convex optimization problem. We provide in the present paper a thorough
analysis of two widespread versions of gradient boosting, and introduce a
general framework for studying these algorithms from the point of view of
functional optimization. We prove their convergence as the number of iterations
tends to infinity and highlight the importance of having a strongly convex risk
functional to minimize. We also present a reasonable statistical context
ensuring consistency properties of the boosting predictors as the sample size
grows. In our approach, the optimization procedures are run forever (that is,
without resorting to an early stopping strategy), and statistical
regularization is basically achieved via an appropriate penalization of
the loss and strong convexity arguments
On surrogate loss functions and -divergences
The goal of binary classification is to estimate a discriminant function
from observations of covariate vectors and corresponding binary
labels. We consider an elaboration of this problem in which the covariates are
not available directly but are transformed by a dimensionality-reducing
quantizer . We present conditions on loss functions such that empirical risk
minimization yields Bayes consistency when both the discriminant function and
the quantizer are estimated. These conditions are stated in terms of a general
correspondence between loss functions and a class of functionals known as
Ali-Silvey or -divergence functionals. Whereas this correspondence was
established by Blackwell [Proc. 2nd Berkeley Symp. Probab. Statist. 1 (1951)
93--102. Univ. California Press, Berkeley] for the 0--1 loss, we extend the
correspondence to the broader class of surrogate loss functions that play a key
role in the general theory of Bayes consistency for binary classification. Our
result makes it possible to pick out the (strict) subset of surrogate loss
functions that yield Bayes consistency for joint estimation of the discriminant
function and the quantizer.Comment: Published in at http://dx.doi.org/10.1214/08-AOS595 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Boosting with early stopping: Convergence and consistency
Boosting is one of the most significant advances in machine learning for
classification and regression. In its original and computationally flexible
version, boosting seeks to minimize empirically a loss function in a greedy
fashion. The resulting estimator takes an additive function form and is built
iteratively by applying a base estimator (or learner) to updated samples
depending on the previous iterations. An unusual regularization technique,
early stopping, is employed based on CV or a test set. This paper studies
numerical convergence, consistency and statistical rates of convergence of
boosting with early stopping, when it is carried out over the linear span of a
family of basis functions. For general loss functions, we prove the convergence
of boosting's greedy optimization to the infinimum of the loss function over
the linear span. Using the numerical convergence result, we find early-stopping
strategies under which boosting is shown to be consistent based on i.i.d.
samples, and we obtain bounds on the rates of convergence for boosting
estimators. Simulation studies are also presented to illustrate the relevance
of our theoretical results for providing insights to practical aspects of
boosting. As a side product, these results also reveal the importance of
restricting the greedy search step-sizes, as known in practice through the work
of Friedman and others. Moreover, our results lead to a rigorous proof that for
a linearly separable problem, AdaBoost with \epsilon\to0 step-size becomes an
L^1-margin maximizer when left to run to convergence.Comment: Published at http://dx.doi.org/10.1214/009053605000000255 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Constrained Classification and Policy Learning
Modern machine learning approaches to classification, including AdaBoost,
support vector machines, and deep neural networks, utilize surrogate loss
techniques to circumvent the computational complexity of minimizing empirical
classification risk. These techniques are also useful for causal policy
learning problems, since estimation of individualized treatment rules can be
cast as a weighted (cost-sensitive) classification problem. Consistency of the
surrogate loss approaches studied in Zhang (2004) and Bartlett et al. (2006)
crucially relies on the assumption of correct specification, meaning that the
specified set of classifiers is rich enough to contain a first-best classifier.
This assumption is, however, less credible when the set of classifiers is
constrained by interpretability or fairness, leaving the applicability of
surrogate loss based algorithms unknown in such second-best scenarios. This
paper studies consistency of surrogate loss procedures under a constrained set
of classifiers without assuming correct specification. We show that in the
setting where the constraint restricts the classifier's prediction set only,
hinge losses (i.e., -support vector machines) are the only surrogate
losses that preserve consistency in second-best scenarios. If the constraint
additionally restricts the functional form of the classifier, consistency of a
surrogate loss approach is not guaranteed even with hinge loss. We therefore
characterize conditions for the constrained set of classifiers that can
guarantee consistency of hinge risk minimizing classifiers. Exploiting our
theoretical results, we develop robust and computationally attractive hinge
loss based procedures for a monotone classification problem
Greedy Algorithms for Classification - Consistency, Convergence Rates, and Adaptivity
Many regression and classification algorithms proposed over the years can be described as greedy procedures for the stagewise minimization of an appropriate cost function. Some examples include additive models, matching pursuit, and boosting. In this work we focus on the classification problem, for which many recent algorithms have been proposed and applied successfully. For a specific regularized form of greedy stagewise optimization, we prove consistency of the approach under rather general conditions. Focusing on specific classes of problems we provide conditions under which our greedy procedure achieves the (nearly) minimax rate of convergence, implying that the procedure cannot be improved in a worst case setting. We also construct a fully adaptive procedure, which, without knowing the smoothness parameter of the decision boundary, converges at the same rate as if the smoothness parameter were known
Applications of empirical processes in learning theory : algorithmic stability and generalization bounds
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 2006.Includes bibliographical references (p. 141-148).This thesis studies two key properties of learning algorithms: their generalization ability and their stability with respect to perturbations. To analyze these properties, we focus on concentration inequalities and tools from empirical process theory. We obtain theoretical results and demonstrate their applications to machine learning. First, we show how various notions of stability upper- and lower-bound the bias and variance of several estimators of the expected performance for general learning algorithms. A weak stability condition is shown to be equivalent to consistency of empirical risk minimization. The second part of the thesis derives tight performance guarantees for greedy error minimization methods - a family of computationally tractable algorithms. In particular, we derive risk bounds for a greedy mixture density estimation procedure. We prove that, unlike what is suggested in the literature, the number of terms in the mixture is not a bias-variance trade-off for the performance. The third part of this thesis provides a solution to an open problem regarding the stability of Empirical Risk Minimization (ERM). This algorithm is of central importance in Learning Theory.(cont.) By studying the suprema of the empirical process, we prove that ERM over Donsker classes of functions is stable in the L1 norm. Hence, as the number of samples grows, it becomes less and less likely that a perturbation of o(v/n) samples will result in a very different empirical minimizer. Asymptotic rates of this stability are proved under metric entropy assumptions on the function class. Through the use of a ratio limit inequality, we also prove stability of expected errors of empirical minimizers. Next, we investigate applications of the stability result. In particular, we focus on procedures that optimize an objective function, such as k-means and other clustering methods. We demonstrate that stability of clustering, just like stability of ERM, is closely related to the geometry of the class and the underlying measure. Furthermore, our result on stability of ERM delineates a phase transition between stability and instability of clustering methods. In the last chapter, we prove a generalization of the bounded-difference concentration inequality for almost-everywhere smooth functions. This result can be utilized to analyze algorithms which are almost always stable. Next, we prove a phase transition in the concentration of almost-everywhere smooth functions. Finally, a tight concentration of empirical errors of empirical minimizers is shown under an assumption on the underlying space.by Alexander Rakhlin.Ph.D