Search CORE

11 research outputs found

A Precise High-Dimensional Asymptotic Theory for Boosting and Minimum- $\ell_1$ -Norm Interpolated Classifiers

Author: Liang Tengyuan
Sur Pragya
Publication venue
Publication date: 22/07/2021
Field of study

This paper establishes a precise high-dimensional asymptotic theory for boosting on separable data, taking statistical and computational perspectives. We consider a high-dimensional setting where the number of features (weak learners)

p

scales with the sample size

n

, in an overparametrized regime. Under a class of statistical models, we provide an exact analysis of the generalization error of boosting when the algorithm interpolates the training data and maximizes the empirical

\ell_1

-margin. Further, we explicitly pin down the relation between the boosting test error and the optimal Bayes error, as well as the proportion of active features at interpolation (with zero initialization). In turn, these precise characterizations answer certain questions raised in \cite{breiman1999prediction, schapire1998boosting} surrounding boosting, under assumed data generating processes. At the heart of our theory lies an in-depth study of the maximum-

\ell_1

-margin, which can be accurately described by a new system of non-linear equations; to analyze this margin, we rely on Gaussian comparison techniques and develop a novel uniform deviation argument. Our statistical and computational arguments can handle (1) any finite-rank spiked covariance model for the feature distribution and (2) variants of boosting corresponding to general

\ell_q

-geometry,

q \in [1, 2]

. As a final component, via the Lindeberg principle, we establish a universality result showcasing that the scaled

\ell_1

-margin (asymptotically) remains the same, whether the covariates used for boosting arise from a non-linear random feature model or an appropriately linearized model with matching moments.Comment: 68 pages, 4 figure

arXiv.org e-Print Archive

Spectrum-Aware Adjustment: A New Debiasing Framework with Applications to Principal Components Regression

Author: Li Yufan
Sur Pragya
Publication venue
Publication date: 14/09/2023
Field of study

We introduce a new debiasing framework for high-dimensional linear regression that bypasses the restrictions on covariate distributions imposed by modern debiasing technology. We study the prevalent setting where the number of features and samples are both large and comparable. In this context, state-of-the-art debiasing technology uses a degrees-of-freedom correction to remove shrinkage bias of regularized estimators and conduct inference. However, this method requires that the observed samples are i.i.d., the covariates follow a mean zero Gaussian distribution, and reliable covariance matrix estimates for observed features are available. This approach struggles when (i) covariates are non-Gaussian with heavy tails or asymmetric distributions, (ii) rows of the design exhibit heterogeneity or dependencies, and (iii) reliable feature covariance estimates are lacking. To address these, we develop a new strategy where the debiasing correction is a rescaled gradient descent step (suitably initialized) with step size determined by the spectrum of the sample covariance matrix. Unlike prior work, we assume that eigenvectors of this matrix are uniform draws from the orthogonal group. We show this assumption remains valid in diverse situations where traditional debiasing fails, including designs with complex row-column dependencies, heavy tails, asymmetric properties, and latent low-rank structures. We establish asymptotic normality of our proposed estimator (centered and scaled) under various convergence notions. Moreover, we develop a consistent estimator for its asymptotic variance. Lastly, we introduce a debiased Principal Component Regression (PCR) technique using our Spectrum-Aware approach. In varied simulations and real data experiments, we observe that our method outperforms degrees-of-freedom debiasing by a margin

arXiv.org e-Print Archive

Abstracting Fairness: Oracles, Metrics, and Interpretability

Author: Dwork Cynthia
Ilvento Christina
Rothblum Guy N.
Sur Pragya
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 1st Symposium on Foundations of Responsible Computing (FORC 2020)
Publication date: 01/01/2020
Field of study

It is well understood that classification algorithms, for example, for deciding on loan applications, cannot be evaluated for fairness without taking context into account. We examine what can be learned from a fairness oracle equipped with an underlying understanding of ``true'' fairness. The oracle takes as input a (context, classifier) pair satisfying an arbitrary fairness definition, and accepts or rejects the pair according to whether the classifier satisfies the underlying fairness truth. Our principal conceptual result is an extraction procedure that learns the underlying truth; moreover, the procedure can learn an approximation to this truth given access to a weak form of the oracle. Since every ``truly fair'' classifier induces a coarse metric, in which those receiving the same decision are at distance zero from one another and those receiving different decisions are at distance one, this extraction process provides the basis for ensuring a rough form of metric fairness, also known as individual fairness. Our principal technical result is a higher fidelity extractor under a mild technical constraint on the weak oracle's conception of fairness. Our framework permits the scenario in which many classifiers, with differing outcomes, may all be considered fair. Our results have implications for interpretablity -- a highly desired but poorly defined property of classification systems that endeavors to permit a human arbiter to reject classifiers deemed to be ``unfair'' or illegitimately derived.Comment: 17 pages, 1 figur

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

A modern maximum-likelihood theory for high-dimensional logistic regression

Author: Candes Emmanuel J.
Sur Pragya
Publication venue
Publication date: 16/06/2018
Field of study

Every student in statistics or data science learns early on that when the sample size largely exceeds the number of variables, fitting a logistic model produces estimates that are approximately unbiased. Every student also learns that there are formulas to predict the variability of these estimates which are used for the purpose of statistical inference; for instance, to produce p-values for testing the significance of regression coefficients. Although these formulas come from large sample asymptotics, we are often told that we are on reasonably safe grounds when

n

is large in such a way that

n \ge 5p

n \ge 10p

. This paper shows that this is far from the case, and consequently, inferences routinely produced by common software packages are often unreliable. Consider a logistic model with independent features in which

n

and

p

become increasingly large in a fixed ratio. Then we show that (1) the MLE is biased, (2) the variability of the MLE is far greater than classically predicted, and (3) the commonly used likelihood-ratio test (LRT) is not distributed as a chi-square. The bias of the MLE is extremely problematic as it yields completely wrong predictions for the probability of a case based on observed values of the covariates. We develop a new theory, which asymptotically predicts (1) the bias of the MLE, (2) the variability of the MLE, and (3) the distribution of the LRT. We empirically also demonstrate that these predictions are extremely accurate in finite samples. Further, an appealing feature is that these novel predictions depend on the unknown sequence of regression coefficients only through a single scalar, the overall strength of the signal. This suggests very concrete procedures to adjust inference; we describe one such procedure learning a single parameter from data and producing accurate inferenceComment: 29 pages, 14 figures, 4 table

arXiv.org e-Print Archive

The Asymptotic Distribution of the MLE in High-dimensional Logistic Models: Arbitrary Covariance

Author: Candès Emmanuel J.
Sur Pragya
Zhao Qian
Publication venue
Publication date: 25/01/2020
Field of study

We study the distribution of the maximum likelihood estimate (MLE) in high-dimensional logistic models, extending the recent results from Sur (2019) to the case where the Gaussian covariates may have an arbitrary covariance structure. We prove that in the limit of large problems holding the ratio between the number

p

of covariates and the sample size

n

constant, every finite list of MLE coordinates follows a multivariate normal distribution. Concretely, the

j

th coordinate

\hat {\beta}_j

of the MLE is asymptotically normally distributed with mean

\alpha_\star \beta_j

and standard deviation

\sigma_\star/\tau_j

; here,

\beta_j

is the value of the true regression coefficient, and

\tau_j

the standard deviation of the

j

th predictor conditional on all the others. The numerical parameters

\alpha_\star > 1

and

\sigma_\star

only depend upon the problem dimensionality

p/n

and the overall signal strength, and can be accurately estimated. Our results imply that the MLE's magnitude is biased upwards and that the MLE's standard deviation is greater than that predicted by classical theory. We present a series of experiments on simulated and real data showing excellent agreement with the theory

arXiv.org e-Print Archive

A modern maximum-likelihood theory for high-dimensional logistic regression

Author: Bickel
Bradic
Bull
Bunea
Candes
Copas
Cordeiro
Cordeiro
Cox
Donoho
Emmanuel J. Candès
He
Jennings
McLachlan
Moulton
Portnoy
Portnoy
Portnoy
Pragya Sur
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date
Field of study

Crossref

Modeling Bimodal Discrete Data Using Conway-Maxwell-Poisson Mixture Models

Author: Czado C.
Dempster A.P.
Galit Shmueli
Hilbe J.M.
McLachlan G.J.
Paromita Dubey
Pragya Sur
Sellers K.F.
Shmueli G.
Smarajit Bose
Wu C.F. J.
Publication venue: 'Informa UK Limited'
Publication date
Field of study

Crossref