17 research outputs found
Binary Classification with Instance and Label Dependent Label Noise
Learning with label dependent label noise has been extensively explored in
both theory and practice; however, dealing with instance (i.e., feature) and
label dependent label noise continues to be a challenging task. The difficulty
arises from the fact that the noise rate varies for each instance, making it
challenging to estimate accurately. The question of whether it is possible to
learn a reliable model using only noisy samples remains unresolved. We answer
this question with a theoretical analysis that provides matching upper and
lower bounds. Surprisingly, our results show that, without any additional
assumptions, empirical risk minimization achieves the optimal excess risk
bound. Specifically, we derive a novel excess risk bound proportional to the
noise level, which holds in very general settings, by comparing the empirical
risk minimizers obtained from clean samples and noisy samples. Second, we show
that the minimax lower bound for the 0-1 loss is a constant proportional to the
average noise rate. Our findings suggest that learning solely with noisy
samples is impossible without access to clean samples or strong assumptions on
the distribution of the data
Generalization Bounds in the Predict-then-Optimize Framework
The predict-then-optimize framework is fundamental in many practical
settings: predict the unknown parameters of an optimization problem, and then
solve the problem using the predicted values of the parameters. A natural loss
function in this environment is to consider the cost of the decisions induced
by the predicted parameters, in contrast to the prediction error of the
parameters. This loss function was recently introduced in Elmachtoub and Grigas
(2017) and referred to as the Smart Predict-then-Optimize (SPO) loss. In this
work, we seek to provide bounds on how well the performance of a prediction
model fit on training data generalizes out-of-sample, in the context of the SPO
loss. Since the SPO loss is non-convex and non-Lipschitz, standard results for
deriving generalization bounds do not apply.
We first derive bounds based on the Natarajan dimension that, in the case of
a polyhedral feasible region, scale at most logarithmically in the number of
extreme points, but, in the case of a general convex feasible region, have
linear dependence on the decision dimension. By exploiting the structure of the
SPO loss function and a key property of the feasible region, which we denote as
the strength property, we can dramatically improve the dependence on the
decision and feature dimensions. Our approach and analysis rely on placing a
margin around problematic predictions that do not yield unique optimal
solutions, and then providing generalization bounds in the context of a
modified margin SPO loss function that is Lipschitz continuous. Finally, we
characterize the strength property and show that the modified SPO loss can be
computed efficiently for both strongly convex bodies and polytopes with an
explicit extreme point representation.Comment: Preliminary version in NeurIPS 201
New Analysis and Results for the Conditional Gradient Method
We present new results for the conditional gradient method (also known as the Frank-Wolfe method). We derive computational guarantees for arbitrary step-size sequences, which are then applied to various step-size rules, including simple averaging and constant step-sizes. We also develop step-size rules and computational guarantees that depend naturally on the warm-start quality of the initial (and subsequent) iterates. Our results include computational guarantees for both duality/bound gaps and the so-called Wolfe gaps. Lastly, we present complexity bounds in the presence of approximate computation of gradients and/or linear optimization subproblem solutions.
Online Contextual Decision-Making with a Smart Predict-then-Optimize Method
We study an online contextual decision-making problem with resource
constraints. At each time period, the decision-maker first predicts a reward
vector and resource consumption matrix based on a given context vector and then
solves a downstream optimization problem to make a decision. The final goal of
the decision-maker is to maximize the summation of the reward and the utility
from resource consumption, while satisfying the resource constraints. We
propose an algorithm that mixes a prediction step based on the "Smart
Predict-then-Optimize (SPO)" method with a dual update step based on mirror
descent. We prove regret bounds and demonstrate that the overall convergence
rate of our method depends on the convergence of online
mirror descent as well as risk bounds of the surrogate loss function used to
learn the prediction model. Our algorithm and regret bounds apply to a general
convex feasible region for the resource constraints, including both hard and
soft resource constraint cases, and they apply to a wide class of prediction
models in contrast to the traditional settings of linear contextual models or
finite policy spaces. We also conduct numerical experiments to empirically
demonstrate the strength of our proposed SPO-type methods, as compared to
traditional prediction-error-only methods, on multi-dimensional knapsack and
longest path instances
A new perspective on boosting in linear regression via subgradient optimization and relatives
We analyze boosting algorithms [Ann. Statist. 29 (2001) 1189–1232; Ann. Statist. 28 (2000) 337–407; Ann. Statist. 32 (2004) 407–499] in linear regression from a new perspective: that of modern first-order methods in convex optimiz ation. We show that classic boosting algorithms in linear regression, namely the incremental forward stagewise algorithm (FS ? ) and least squares boosting [LS-BOOST(?)], can be viewed as subgradient descent to minimize the loss function defined as the maximum absolute correlation between the features and residuals. We also propose a minor modification of FS ? that yields an algorithm for the LASSO, and that may be easily extended to an algorithm that computes the LASSO path for different values of the regularization parameter. Furthermore, we show that these new algorithms for the LASSO may also be interpreted as the same master algorithm (subgradient descent), applied to a regularized version of the maximum absolute correlation loss function. We derive novel, comprehensive computational guarantees for several boosting algorithms in linear regression (including LS-BOOST(?) and FS ? ) by using techniques of first-order methods in convex optimization. Our computational guarantees inform us about the statistical properties of boosting algorithms. In particular, they provide, for the first time, a precise theoretical description of the amount of data-fidelity and regularization imparted by running a boosting algorithm with a prespecified learning rate for a fixed but arbitrary number of iterations, for any dataset
New analysis and results for the Frank–Wolfe method
We present new results for the Frank–Wolfe method (also known as the conditional gradient method). We derive computational guarantees for arbitrary step-size sequences, which are then applied to various step-size rules, including simple averaging and constant step-sizes. We also develop step-size rules and computational guarantees that depend naturally on the warm-start quality of the initial (and subsequent) iterates. Our results include computational guarantees for both duality/bound gaps and the so-called FW gaps. Lastly, we present complexity bounds in the presence of approximate computation of gradients and/or linear optimization subproblem solutions.United States. Air Force Office of Scientific Research (AFOSR Grant No. FA9550-11-1-0141)Pontifical Catholic University of Chile (MIT-Chile-Pontificia Universidad Católica de Chile Seed Fund)National Science Foundation (U.S.) (NSF Graduate Research Fellowship No. 1122374