135 research outputs found
Regression modeling on stratified data with the lasso
We consider the estimation of regression models on strata defined using a
categorical covariate, in order to identify interactions between this
categorical covariate and the other predictors. A basic approach requires the
choice of a reference stratum. We show that the performance of a penalized
version of this approach depends on this arbitrary choice. We propose a refined
approach that bypasses this arbitrary choice, at almost no additional
computational cost. Regarding model selection consistency, our proposal mimics
the strategy based on an optimal and covariate-specific choice for the
reference stratum. Results from an empirical study confirm that our proposal
generally outperforms the basic approach in the identification and description
of the interactions. An illustration on gene expression data is provided.Comment: 23 pages, 5 figure
Analysis of Testing-Based Forward Model Selection
This paper introduces and analyzes a procedure called Testing-based forward
model selection (TBFMS) in linear regression problems. This procedure
inductively selects covariates that add predictive power into a working
statistical model before estimating a final regression. The criterion for
deciding which covariate to include next and when to stop including covariates
is derived from a profile of traditional statistical hypothesis tests. This
paper proves probabilistic bounds, which depend on the quality of the tests,
for prediction error and the number of selected covariates. As an example, the
bounds are then specialized to a case with heteroskedastic data, with tests
constructed with the help of Huber-Eicker-White standard errors. Under the
assumed regularity conditions, these tests lead to estimation convergence rates
matching other common high-dimensional estimators including Lasso
Weighted-Lasso for Structured Network Inference from Time Course Data
We present a weighted-Lasso method to infer the parameters of a first-order
vector auto-regressive model that describes time course expression data
generated by directed gene-to-gene regulation networks. These networks are
assumed to own a prior internal structure of connectivity which drives the
inference method. This prior structure can be either derived from prior
biological knowledge or inferred by the method itself. We illustrate the
performance of this structure-based penalization both on synthetic data and on
two canonical regulatory networks, first yeast cell cycle regulation network by
analyzing Spellman et al's dataset and second E. coli S.O.S. DNA repair network
by analysing U. Alon's lab data
- …