146 research outputs found
Confidence Intervals for High-Dimensional Linear Regression: Minimax Rates and Adaptivity
Confidence sets play a fundamental role in statistical inference. In this paper, we consider confidence intervals for high-dimensional linear regression with random design. We first establish the convergence rates of the minimax expected length for confidence intervals in the oracle setting where the sparsity parameter is given. The focus is then on the problem of adaptation to sparsity for the construction of confidence intervals. Ideally, an adaptive confidence interval should have its length automatically adjusted to the sparsity of the unknown regression vector, while maintaining a pre-specified coverage probability. It is shown that such a goal is in general not attainable, except when the sparsity parameter is restricted to a small region over which the confidence intervals have the optimal length of the usual parametric rate. It is further demonstrated that the lack of adaptivity is not due to the conservativeness of the minimax framework, but is fundamentally caused by the difficulty of learning the bias accurately
Accuracy Assessment for High-Dimensional Linear Regression
This paper considers point and interval estimation of the βq loss of an estimator in high-dimensional linear regression with random design. We establish the minimax rate for estimating the βq loss and the minimax expected length of confidence intervals for the βq loss of rate-optimal estimators of the regression vector, including commonly used estimators such as Lasso, scaled Lasso, square-root Lasso and Dantzig Selector. Adaptivity of the confidence intervals for the βq loss is also studied. Both the setting of known identity design covariance matrix and known noise level and the setting of unknown design covariance matrix and unknown noise level are studied. The results reveal interesting and significant differences between estimating the β2 loss and βq loss with 1 β€ q \u3c 2 as well as between the two settings. New technical tools are developed to establish rate sharp lower bounds for the minimax estimation error and the expected length of minimax and adaptive confidence intervals for the βq loss. A significant difference between loss estimation and the traditional parameter estimation is that for loss estimation the constraint is on the performance of the estimator of the regression vector, but the lower bounds are on the difficulty of estimating its βq loss. The technical tools developed in this paper can also be of independent interest
Distributionally Robust Transfer Learning
Many existing transfer learning methods rely on leveraging information from
source data that closely resembles the target data. However, this approach
often overlooks valuable knowledge that may be present in different yet
potentially related auxiliary samples. When dealing with a limited amount of
target data and a diverse range of source models, our paper introduces a novel
approach, Distributionally Robust Optimization for Transfer Learning
(TransDRO), that breaks free from strict similarity constraints. TransDRO is
designed to optimize the most adversarial loss within an uncertainty set,
defined as a collection of target populations generated as a convex combination
of source distributions that guarantee excellent prediction performances for
the target data. TransDRO effectively bridges the realms of transfer learning
and distributional robustness prediction models. We establish the
identifiability of TransDRO and its interpretation as a weighted average of
source models closest to the baseline model. We also show that TransDRO
achieves a faster convergence rate than the model fitted with the target data.
Our comprehensive numerical studies and analysis of multi-institutional
electronic health records data using TransDRO further substantiate the
robustness and accuracy of TransDRO, highlighting its potential as a powerful
tool in transfer learning applications
Efficient Modeling of Surrogates to Improve Multi-source High-dimensional Biobank Studies
Surrogate variables in electronic health records (EHR) and biobank data play
an important role in biomedical studies due to the scarcity or absence of
chart-reviewed gold standard labels. We develop a novel approach named SASH for
{\bf S}urrogate-{\bf A}ssisted and data-{\bf S}hielding {\bf H}igh-dimensional
integrative regression. It is a semi-supervised approach that efficiently
leverages sizable unlabeled samples with error-prone EHR surrogate outcomes
from multiple local sites, to improve the learning accuracy of the small
gold-labeled data. {To facilitate stable and efficient knowledge extraction
from the surrogates, our method first obtains a preliminary supervised
estimator, and then uses it to assist training a regularized single index model
(SIM) for the surrogates. Interestingly, through a chain of convex and properly
penalized sparse regressions that approximate the SIM loss with
bias-correction, our method avoids the local minima issue of the SIM training,
and fully eliminates the impact of the preliminary estimator's large error. In
addition, it protects individual-level information through
summary-statistics-based data aggregation across the local sites, leveraging a
similar idea of bias-corrected approximation for SIM.} Through simulation
studies, we demonstrate that our method outperforms existing approaches on
finite samples. Finally, we apply our method to develop a high dimensional
genetic risk model for type II diabetes using large-scale data sets from UK and
Mass General Brigham biobanks, where only a small fraction of subjects in one
site has been labeled via chart reviewing
- β¦