9 research outputs found
Scalable Approximations of Marginal Posteriors in Variable Selection
In many contexts, there is interest in selecting the most important variables
from a very large collection, commonly referred to as support recovery or
variable, feature or subset selection. There is an enormous literature
proposing a rich variety of algorithms. In scientific applications, it is of
crucial importance to quantify uncertainty in variable selection, providing
measures of statistical significance for each variable. The overwhelming
majority of algorithms fail to produce such measures. This has led to a focus
in the scientific literature on independent screening methods, which examine
each variable in isolation, obtaining p-values measuring the significance of
marginal associations. Bayesian methods provide an alternative, with marginal
inclusion probabilities used in place of p-values. Bayesian variable selection
has advantages, but is impractical computationally beyond small problems. In
this article, we show that approximate message passing (AMP) and Bayesian
compressed regression (BCR) can be used to rapidly obtain accurate
approximations to marginal inclusion probabilities in high-dimensional variable
selection. Theoretical support is provided, simulation studies are conducted to
assess performance, and the method is applied to a study relating brain
networks to creative reasoning.Comment: 10 pages, 4 figures, PDFLaTeX, submitted to the Twenty-ninth Annual
Conference on Neural Information Processing Systems (NIPS 2015
Thresholding tests
We derive a new class of statistical tests for generalized linear models
based on thresholding point estimators. These tests can be employed whether the
model includes more parameters than observations or not. For linear models, our
tests rely on pivotal statistics derived from model selection techniques.
Affine lasso, a new extension of lasso, allows to unveil new tests and to
develop in the same framework parametric and nonparametric tests. Our tests for
generalized linear models are based on new asymptotically pivotal statistics. A
composite thresholding test attempts to achieve uniformly most power under both
sparse and dense alternatives with success. In a simulation, we compare the
level and power of these tests under sparse and dense alternative hypotheses.
The thresholding tests have a better control of the nominal level and higher
power than existing tests.Comment: 18 pages, 3 figure
Approximating posteriors with high-dimensional nuisance parameters via integrated rotated Gaussian approximation
Posterior computation for high-dimensional data with many parameters can be
challenging. This article focuses on a new method for approximating posterior
distributions of a low- to moderate-dimensional parameter in the presence of a
high-dimensional or otherwise computationally challenging nuisance parameter.
The focus is on regression models and the key idea is to separate the
likelihood into two components through a rotation. One component involves only
the nuisance parameters, which can then be integrated out using a novel type of
Gaussian approximation. We provide theory on approximation accuracy that holds
for a broad class of forms of the nuisance component and priors. Applying our
method to simulated and real data sets shows that it can outperform
state-of-the-art posterior approximation approaches.Comment: 32 pages, 8 figure
Hierarchical correction of -values via a tree running Ornstein-Uhlenbeck process
Statistical testing is classically used as an exploratory tool to search for
association between a phenotype and many possible explanatory variables. This
approach often leads to multiple dependence testing under dependence. We assume
a hierarchical structure between tests via an Ornstein-Uhlenbeckprocess on a
tree. The process correlation structure is used for smoothing the p-values. We
design a penalized estimation of the mean of the OU process for p-value
computation. The performances of the algorithm are assessed via simulations.
Its ability to discover new associations is demonstrated on a metagenomic
dataset. The corresponding R package is available from
https://github.com/abichat/zazou.Comment: 20 pages, 8 figure
Confidence Intervals and Hypothesis Testing for High-Dimensional Regression
Fitting high-dimensional statistical models often requires the use of
non-linear parameter estimation procedures. As a consequence, it is generally
impossible to obtain an exact characterization of the probability distribution
of the parameter estimates. This in turn implies that it is extremely
challenging to quantify the \emph{uncertainty} associated with a certain
parameter estimate. Concretely, no commonly accepted procedure exists for
computing classical measures of uncertainty and statistical significance as
confidence intervals or -values for these models.
We consider here high-dimensional linear regression problem, and propose an
efficient algorithm for constructing confidence intervals and -values. The
resulting confidence intervals have nearly optimal size. When testing for the
null hypothesis that a certain parameter is vanishing, our method has nearly
optimal power.
Our approach is based on constructing a `de-biased' version of regularized
M-estimators. The new construction improves over recent work in the field in
that it does not assume a special structure on the design matrix. We test our
method on synthetic data and a high-throughput genomic data set about
riboflavin production rate.Comment: 40 pages, 4 pdf figure
In Defense of the Indefensible: A Very Naive Approach to High-Dimensional Inference
A great deal of interest has recently focused on conducting inference on the
parameters in a high-dimensional linear model.
In this paper, we consider a simple and very na\"{i}ve two-step procedure for
this task, in which we (i) fit a lasso model in order to obtain a subset of the
variables, and (ii) fit a least squares model on the lasso-selected set.
Conventional statistical wisdom tells us that we cannot make use of the
standard statistical inference tools for the resulting least squares model
(such as confidence intervals and -values), since we peeked at the data
twice: once in running the lasso, and again in fitting the least squares model.
However, in this paper, we show that under a certain set of assumptions, with
high probability, the set of variables selected by the lasso is identical to
the one selected by the noiseless lasso and is hence deterministic.
Consequently, the na\"{i}ve two-step approach can yield asymptotically valid
inference. We utilize this finding to develop the \emph{na\"ive confidence
interval}, which can be used to draw inference on the regression coefficients
of the model selected by the lasso, as well as the \emph{na\"ive score test},
which can be used to test the hypotheses regarding the full-model regression
coefficients
Sparse Nonlinear Regression: Parameter Estimation and Asymptotic Inference
We study parameter estimation and asymptotic inference for sparse nonlinear
regression. More specifically, we assume the data are given by , where is nonlinear. To recover , we propose
an -regularized least-squares estimator. Unlike classical linear
regression, the corresponding optimization problem is nonconvex because of the
nonlinearity of . In spite of the nonconvexity, we prove that under mild
conditions, every stationary point of the objective enjoys an optimal
statistical rate of convergence. In addition, we provide an efficient algorithm
that provably converges to a stationary point. We also access the uncertainty
of the obtained estimator. Specifically, based on any stationary point of the
objective, we construct valid hypothesis tests and confidence intervals for the
low dimensional components of the high-dimensional parameter .
Detailed numerical results are provided to back up our theory.Comment: 32 pages, 2 figures, 1 tabl
Confidence intervals for high-dimensional Cox models
The purpose of this paper is to construct confidence intervals for the
regression coefficients in high-dimensional Cox proportional hazards regression
models where the number of covariates may be larger than the sample size. Our
debiased estimator construction is similar to those in Zhang and Zhang (2014)
and van de Geer et al. (2014), but the time-dependent covariates and censored
risk sets introduce considerable additional challenges. Our theoretical
results, which provide conditions under which our confidence intervals are
asymptotically valid, are supported by extensive numerical experiments.Comment: 36 pages, 1 figur
Testing and Confidence Intervals for High Dimensional Proportional Hazards Model
This paper proposes a decorrelation-based approach to test hypotheses and
construct confidence intervals for the low dimensional component of high
dimensional proportional hazards models. Motivated by the geometric projection
principle, we propose new decorrelated score, Wald and partial likelihood ratio
statistics. Without assuming model selection consistency, we prove the
asymptotic normality of these test statistics, establish their semiparametric
optimality. We also develop new procedures for constructing pointwise
confidence intervals for the baseline hazard function and baseline survival
function. Thorough numerical results are provided to back up our theory.Comment: 42 pages, 4 figures, 5 table