68 research outputs found

    Sharp oracle inequalities for Least Squares estimators in shape restricted regression

    Full text link
    The performance of Least Squares (LS) estimators is studied in isotonic, unimodal and convex regression. Our results have the form of sharp oracle inequalities that account for the model misspecification error. In isotonic and unimodal regression, the LS estimator achieves the nonparametric rate nβˆ’2/3n^{-2/3} as well as a parametric rate of order k/nk/n up to logarithmic factors, where kk is the number of constant pieces of the true parameter. In univariate convex regression, the LS estimator satisfies an adaptive risk bound of order q/nq/n up to logarithmic factors, where qq is the number of affine pieces of the true regression function. This adaptive risk bound holds for any design points. While Guntuboyina and Sen (2013) established that the nonparametric rate of convex regression is of order nβˆ’4/5n^{-4/5} for equispaced design points, we show that the nonparametric rate of convex regression can be as slow as nβˆ’2/3n^{-2/3} for some worst-case design points. This phenomenon can be explained as follows: Although convexity brings more structure than unimodality, for some worst-case design points this extra structure is uninformative and the nonparametric rates of unimodal regression and convex regression are both nβˆ’2/3n^{-2/3}

    Concentration of quadratic forms under a Bernstein moment assumption

    Full text link
    A concentration result for quadratic form of independent subgaussian random variables is derived. If the moments of the random variables satisfy a "Bernstein condition", then the variance term of the Hanson-Wright inequality can be improved. The Bernstein condition is satisfied, for instance, by all log-concave subgaussian distributions.Comment: This short note presents a result that initially appeared in arXiv:1410.0346v1 (see Assumption 3.3). The result was later removed from arXiv:1410.0346 and the published version https://projecteuclid.org/euclid.aos/1519268423 due to space constraint

    Aggregation of supports along the Lasso path

    Full text link
    In linear regression with fixed design, we propose two procedures that aggregate a data-driven collection of supports. The collection is a subset of the 2p2^p possible supports and both its cardinality and its elements can depend on the data. The procedures satisfy oracle inequalities with no assumption on the design matrix. Then we use these procedures to aggregate the supports that appear on the regularization path of the Lasso in order to construct an estimator that mimics the best Lasso estimator. If the restricted eigenvalue condition on the design matrix is satisfied, then this estimator achieves optimal prediction bounds. Finally, we discuss the computational cost of these procedures

    Optimal exponential bounds for aggregation of density estimators

    Full text link
    We consider the problem of model selection type aggregation in the context of density estimation. We first show that empirical risk minimization is sub-optimal for this problem and it shares this property with the exponential weights aggregate, empirical risk minimization over the convex hull of the dictionary functions, and all selectors. Using a penalty inspired by recent works on the QQ-aggregation procedure, we derive a sharp oracle inequality in deviation under a simple boundedness assumption and we show that the rate is optimal in a minimax sense. Unlike the procedures based on exponential weights, this estimator is fully adaptive under the uniform prior. In particular, its construction does not rely on the sup-norm of the unknown density. By providing lower bounds with exponential tails, we show that the deviation term appearing in the sharp oracle inequalities cannot be improved.Comment: Published at http://dx.doi.org/10.3150/15-BEJ742 in the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm

    Optimistic lower bounds for convex regularized least-squares

    Full text link
    Minimax lower bounds are pessimistic in nature: for any given estimator, minimax lower bounds yield the existence of a worst-case target vector Ξ²worstβˆ—\beta^*_{worst} for which the prediction error of the given estimator is bounded from below. However, minimax lower bounds shed no light on the prediction error of the given estimator for target vectors different than Ξ²worstβˆ—\beta^*_{worst}. A characterization of the prediction error of any convex regularized least-squares is given. This characterization provide both a lower bound and an upper bound on the prediction error. This produces lower bounds that are applicable for any target vector and not only for a single, worst-case Ξ²worstβˆ—\beta^*_{worst}. Finally, these lower and upper bounds on the prediction error are applied to the Lasso is sparse linear regression. We obtain a lower bound involving the compatibility constant for any tuning parameter, matching upper and lower bounds for the universal choice of the tuning parameter, and a lower bound for the Lasso with small tuning parameter

    The cost-free nature of optimally tuning Tikhonov regularizers and other ordered smoothers

    Full text link
    We consider the problem of selecting the best estimator among a family of Tikhonov regularized estimators, or, alternatively, to select a linear combination of these regularizers that is as good as the best regularizer in the family. Our theory reveals that if the Tikhonov regularizers share the same penalty matrix with different tuning parameters, a convex procedure based on QQ-aggregation achieves the mean square error of the best estimator, up to a small error term no larger than CΟƒ2C\sigma^2, where Οƒ2\sigma^2 is the noise level and C>0C>0 is an absolute constant. Remarkably, the error term does not depend on the penalty matrix or the number of estimators as long as they share the same penalty matrix, i.e., it applies to any grid of tuning parameters, no matter how large the cardinality of the grid is. This reveals the surprising "cost-free" nature of optimally tuning Tikhonov regularizers, in striking contrast with the existing literature on aggregation of estimators where one typically has to pay a cost of Οƒ2log⁑(M)\sigma^2\log(M) where MM is the number of estimators in the family. The result holds, more generally, for any family of ordered linear smoothers. This encompasses Ridge regression as well as Principal Component Regression. The result is extended to the problem of tuning Tikhonov regularizers with different penalty matrices

    Second order Stein: SURE for SURE and other applications in high-dimensional inference

    Full text link
    Stein's formula states that a random variable of the form z⊀f(z)βˆ’divf(z)z^\top f(z) - \text{div} f(z) is mean-zero for functions ff with integrable gradient. Here, divf\text{div} f is the divergence of the function ff and zz is a standard normal vector. This paper aims to propose a Second Order Stein formula to characterize the variance of such random variables for all functions f(z)f(z) with square integrable gradient, and to demonstrate the usefulness of this formula in various applications. In the Gaussian sequence model, a consequence of Stein's formula is Stein's Unbiased Risk Estimate (SURE), an unbiased estimate of the mean squared risk for almost any estimator ΞΌ^\hat\mu of the unknown mean. A first application of the Second Order Stein formula is an Unbiased Risk Estimate for SURE itself (SURE for SURE): an unbiased estimate {providing} information about the squared distance between SURE and the squared estimation error of ΞΌ^\hat\mu. SURE for SURE has a simple form as a function of the data and is applicable to all ΞΌ^\hat\mu with square integrable gradient, e.g. the Lasso and the Elastic Net. In addition to SURE for SURE, the following applications are developed: (1) Upper bounds on the risk of SURE when the estimation target is the mean squared error; (2) Confidence regions based on SURE; (3) Oracle inequalities satisfied by SURE-tuned estimates; (4) An upper bound on the variance of the size of the model selected by the Lasso; (5) Explicit expressions of SURE for SURE for the Lasso and the Elastic-Net; (6) In the linear model, a general semi-parametric scheme to de-bias a differentiable initial estimator for inference of a low-dimensional projection of the unknown Ξ²\beta, with a characterization of the variance after de-biasing; and (7) An accuracy analysis of a Gaussian Monte Carlo scheme to approximate the divergence of functions f:Rnβ†’Rnf: R^n\to R^n

    Out-of-sample error estimate for robust M-estimators with convex penalty

    Full text link
    A generic out-of-sample error estimate is proposed for robust MM-estimators regularized with a convex penalty in high-dimensional linear regression where (X,y)(X,y) is observed and p,np,n are of the same order. If ψ\psi is the derivative of the robust data-fitting loss ρ\rho, the estimate depends on the observed data only through the quantities ψ^=ψ(yβˆ’XΞ²^)\hat\psi = \psi(y-X\hat\beta), X⊀ψ^X^\top \hat\psi and the derivatives (βˆ‚/βˆ‚y)ψ^(\partial/\partial y) \hat\psi and (βˆ‚/βˆ‚y)XΞ²^(\partial/\partial y) X\hat\beta for fixed XX. The out-of-sample error estimate enjoys a relative error of order nβˆ’1/2n^{-1/2} in a linear model with Gaussian covariates and independent noise, either non-asymptotically when p/n≀γp/n\le \gamma or asymptotically in the high-dimensional asymptotic regime p/nβ†’Ξ³β€²βˆˆ(0,∞)p/n\to\gamma'\in(0,\infty). General differentiable loss functions ρ\rho are allowed provided that ψ=ρ′\psi=\rho' is 1-Lipschitz. The validity of the out-of-sample error estimate holds either under a strong convexity assumption, or for the β„“1\ell_1-penalized Huber M-estimator if the number of corrupted observations and sparsity of the true Ξ²\beta are bounded from above by sβˆ—ns_*n for some small enough constant sβˆ—βˆˆ(0,1)s_*\in(0,1) independent of n,pn,p. For the square loss and in the absence of corruption in the response, the results additionally yield nβˆ’1/2n^{-1/2}-consistent estimates of the noise variance and of the generalization error. This generalizes, to arbitrary convex penalty, estimates that were previously known for the Lasso.Comment: This version adds simulations for the nuclear norm penalt

    Optimal bounds for aggregation of affine estimators

    Full text link
    We study the problem of aggregation of estimators when the estimators are not independent of the data used for aggregation and no sample splitting is allowed. If the estimators are deterministic vectors, it is well known that the minimax rate of aggregation is of order log⁑(M)\log(M), where MM is the number of estimators to aggregate. It is proved that for affine estimators, the minimax rate of aggregation is unchanged: it is possible to handle the linear dependence between the affine estimators and the data used for aggregation at no extra cost. The minimax rate is not impacted either by the variance of the affine estimators, or any other measure of their statistical complexity. The minimax rate is attained with a penalized procedure over the convex hull of the estimators, for a penalty that is inspired from the QQ-aggregation procedure. The results follow from the interplay between the penalty, strong convexity and concentration.Comment: Published at https://projecteuclid.org/euclid.aos/1519268423 in the Annals of Statistics (http://imstat.org/aos/ ) by the Institute of Mathematical Statistics (http://imstat.org/

    The noise barrier and the large signal bias of the Lasso and other convex estimators

    Full text link
    Convex estimators such as the Lasso, the matrix Lasso and the group Lasso have been studied extensively in the last two decades, demonstrating great success in both theory and practice. Two quantities are introduced, the noise barrier and the large scale bias, that provides insights on the performance of these convex regularized estimators. It is now well understood that the Lasso achieves fast prediction rates, provided that the correlations of the design satisfy some Restricted Eigenvalue or Compatibility condition, and provided that the tuning parameter is large enough. Using the two quantities introduced in the paper, we show that the compatibility condition on the design matrix is actually unavoidable to achieve fast prediction rates with the Lasso. The Lasso must incur a loss due to the correlations of the design matrix, measured in terms of the compatibility constant. This results holds for any design matrix, any active subset of covariates, and any tuning parameter. It is now well known that the Lasso enjoys a dimension reduction property: the prediction error is of order Ξ»k\lambda\sqrt k where kk is the sparsity; even if the ambient dimension pp is much larger than kk. Such results require that the tuning parameters is greater than some universal threshold. We characterize sharp phase transitions for the tuning parameter of the Lasso around a critical threshold dependent on kk. If Ξ»\lambda is equal or larger than this critical threshold, the Lasso is minimax over kk-sparse target vectors. If Ξ»\lambda is equal or smaller than critical threshold, the Lasso incurs a loss of order Οƒk\sigma\sqrt k --which corresponds to a model of size kk-- even if the target vector has fewer than kk nonzero coefficients. Remarkably, the lower bounds obtained in the paper also apply to random, data-driven tuning parameters. The results extend to convex penalties beyond the Lasso.Comment: This paper supersedes the previous article arXiv:1703.0133
    • …
    corecore