9 research outputs found
Distributionally Robust Optimization and Robust Statistics
We review distributionally robust optimization (DRO), a principled approach
for constructing statistical estimators that hedge against the impact of
deviations in the expected loss between the training and deployment
environments. Many well-known estimators in statistics and machine learning
(e.g. AdaBoost, LASSO, ridge regression, dropout training, etc.) are
distributionally robust in a precise sense. We hope that by discussing the DRO
interpretation of well-known estimators, statisticians who may not be too
familiar with DRO may find a way to access the DRO literature through the
bridge between classical results and their DRO equivalent formulation. On the
other hand, the topic of robustness in statistics has a rich tradition
associated with removing the impact of contamination. Thus, another objective
of this paper is to clarify the difference between DRO and classical
statistical robustness. As we will see, these are two fundamentally different
philosophies leading to completely different types of estimators. In DRO, the
statistician hedges against an environment shift that occurs after the decision
is made; thus DRO estimators tend to be pessimistic in an adversarial setting,
leading to a min-max type formulation. In classical robust statistics, the
statistician seeks to correct contamination that occurred before a decision is
made; thus robust statistical estimators tend to be optimistic leading to a
min-min type formulation
Statistical Estimation Under Distribution Shift: Wasserstein Perturbations and Minimax Theory
Distribution shifts are a serious concern in modern statistical learning as
they can systematically change the properties of the data away from the truth.
We focus on Wasserstein distribution shifts, where every data point may undergo
a slight perturbation, as opposed to the Huber contamination model where a
fraction of observations are outliers. We consider perturbations that are
either independent or coordinated joint shifts across data points. We analyze
several important statistical problems, including location estimation, linear
regression, and non-parametric density estimation. Under a squared loss for
mean estimation and prediction error in linear regression, we find the exact
minimax risk, a least favorable perturbation, and show that the sample mean and
least squares estimators are respectively optimal. For other problems, we
provide nearly optimal estimators and precise finite-sample bounds. We also
introduce several tools for bounding the minimax risk under general
distribution shifts, not just for Wasserstein perturbations, such as a
smoothing technique for location families, and generalizations of classical
tools including least favorable sequences of priors, the modulus of continuity,
as well as Le Cam's, Fano's, and Assouad's methods.Comment: 60 pages, 7 figure
Distributionally Robust Optimization: A Review
The concepts of risk-aversion, chance-constrained optimization, and robust
optimization have developed significantly over the last decade. Statistical
learning community has also witnessed a rapid theoretical and applied growth by
relying on these concepts. A modeling framework, called distributionally robust
optimization (DRO), has recently received significant attention in both the
operations research and statistical learning communities. This paper surveys
main concepts and contributions to DRO, and its relationships with robust
optimization, risk-aversion, chance-constrained optimization, and function
regularization
Recommended from our members
Distributionally Robust Optimization and its Applications in Machine Learning
The goal of Distributionally Robust Optimization (DRO) is to minimize the cost of running a stochastic system, under the assumption that an adversary can replace the underlying baseline stochastic model by another model within a family known as the distributional uncertainty region. This dissertation focuses on a class of DRO problems which are data-driven, which generally speaking means that the baseline stochastic model corresponds to the empirical distribution of a given sample.
One of the main contributions of this dissertation is to show that the class of data-driven DRO problems that we study unify many successful machine learning algorithms, including square root Lasso, support vector machines, and generalized logistic regression, among others. A key distinctive feature of the class of DRO problems that we consider here is that our distributional uncertainty region is based on optimal transport costs. In contrast, most of the DRO formulations that exist to date take advantage of a likelihood based formulation (such as Kullback-Leibler divergence, among others). Optimal transport costs include as a special case the so-called Wasserstein distance, which is popular in various statistical applications.
The use of optimal transport costs is advantageous relative to the use of divergence-based formulations because the region of distributional uncertainty contains distributions which explore samples outside of the support of the empirical measure, therefore explaining why many machine learning algorithms have the ability to improve generalization. Moreover, the DRO representations that we use to unify the previously mentioned machine learning algorithms, provide a clear interpretation of the so-called regularization parameter, which is known to play a crucial role in controlling generalization error. As we establish, the regularization parameter corresponds exactly to the size of the distributional uncertainty region.
Another contribution of this dissertation is the development of statistical methodology to study data-driven DRO formulations based on optimal transport costs. Using this theory, for example, we provide a sharp characterization of the optimal selection of regularization parameters in machine learning settings such as square-root Lasso and regularized logistic regression.
Our statistical methodology relies on the construction of a key object which we call the robust Wasserstein profile function (RWP function). The RWP function similar in spirit to the empirical likelihood profile function in the context of empirical likelihood (EL). But the asymptotic analysis of the RWP function is different because of a certain lack of smoothness which arises in a suitable Lagrangian formulation.
Optimal transport costs have many advantages in terms of statistical modeling. For example, we show how to define a class of novel semi-supervised learning estimators which are natural companions of the standard supervised counterparts (such as square root Lasso, support vector machines, and logistic regression). We also show how to define the distributional uncertainty region in a purely data-driven way. Precisely, the optimal transport formulation allows us to inform the shape of the distributional uncertainty, not only its center (which given by the empirical distribution). This shape is informed by establishing connections to the metric learning literature. We develop a class of metric learning algorithms which are based on robust optimization. We use the robust-optimization-based metric learning algorithms to inform the distributional uncertainty region in our data-driven DRO problem. This means that we endow the adversary with additional which force him to spend effort on regions of importance to further improve generalization properties of machine learning algorithms.
In summary, we explain how the use of optimal transport costs allow constructing what we call double-robust statistical procedures. We test all of the procedures proposed in this paper in various data sets, showing significant improvement in generalization ability over a wide range of state-of-the-art procedures.
Finally, we also discuss a class of stochastic optimization algorithms of independent interest which are particularly useful to solve DRO problems, especially those which arise when the distributional uncertainty region is based on optimal transport costs