329 research outputs found
Structured, sparse regression with application to HIV drug resistance
We introduce a new version of forward stepwise regression. Our modification
finds solutions to regression problems where the selected predictors appear in
a structured pattern, with respect to a predefined distance measure over the
candidate predictors. Our method is motivated by the problem of predicting
HIV-1 drug resistance from protein sequences. We find that our method improves
the interpretability of drug resistance while producing comparable predictive
accuracy to standard methods. We also demonstrate our method in a simulation
study and present some theoretical results and connections.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS428 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Regression modeling on stratified data with the lasso
We consider the estimation of regression models on strata defined using a
categorical covariate, in order to identify interactions between this
categorical covariate and the other predictors. A basic approach requires the
choice of a reference stratum. We show that the performance of a penalized
version of this approach depends on this arbitrary choice. We propose a refined
approach that bypasses this arbitrary choice, at almost no additional
computational cost. Regarding model selection consistency, our proposal mimics
the strategy based on an optimal and covariate-specific choice for the
reference stratum. Results from an empirical study confirm that our proposal
generally outperforms the basic approach in the identification and description
of the interactions. An illustration on gene expression data is provided.Comment: 23 pages, 5 figure
On the total variation regularized estimator over a class of tree graphs
We generalize to tree graphs obtained by connecting path graphs an oracle
result obtained for the Fused Lasso over the path graph. Moreover we show that
it is possible to substitute in the oracle inequality the minimum of the
distances between jumps by their harmonic mean. In doing so we prove a lower
bound on the compatibility constant for the total variation penalty. Our
analysis leverages insights obtained for the path graph with one branch to
understand the case of more general tree graphs.
As a side result, we get insights into the irrepresentable condition for such
tree graphs.Comment: 42 page
Multiple Change-point Detection: a Selective Overview
Very long and noisy sequence data arise from biological sciences to social
science including high throughput data in genomics and stock prices in
econometrics. Often such data are collected in order to identify and understand
shifts in trend, e.g., from a bull market to a bear market in finance or from a
normal number of chromosome copies to an excessive number of chromosome copies
in genetics. Thus, identifying multiple change points in a long, possibly very
long, sequence is an important problem. In this article, we review both
classical and new multiple change-point detection strategies. Considering the
long history and the extensive literature on the change-point detection, we
provide an in-depth discussion on a normal mean change-point model from aspects
of regression analysis, hypothesis testing, consistency and inference. In
particular, we present a strategy to gather and aggregate local information for
change-point detection that has become the cornerstone of several emerging
methods because of its attractiveness in both computational and theoretical
properties.Comment: 26 pages, 2 figure
LASSO ISOtone for High Dimensional Additive Isotonic Regression
Additive isotonic regression attempts to determine the relationship between a
multi-dimensional observation variable and a response, under the constraint
that the estimate is the additive sum of univariate component effects that are
monotonically increasing. In this article, we present a new method for such
regression called LASSO Isotone (LISO). LISO adapts ideas from sparse linear
modelling to additive isotonic regression. Thus, it is viable in many
situations with high dimensional predictor variables, where selection of
significant versus insignificant variables are required. We suggest an
algorithm involving a modification of the backfitting algorithm CPAV. We give a
numerical convergence result, and finally examine some of its properties
through simulations. We also suggest some possible extensions that improve
performance, and allow calculation to be carried out when the direction of the
monotonicity is unknown
Beyond Support in Two-Stage Variable Selection
Numerous variable selection methods rely on a two-stage procedure, where a
sparsity-inducing penalty is used in the first stage to predict the support,
which is then conveyed to the second stage for estimation or inference
purposes. In this framework, the first stage screens variables to find a set of
possibly relevant variables and the second stage operates on this set of
candidate variables, to improve estimation accuracy or to assess the
uncertainty associated to the selection of variables. We advocate that more
information can be conveyed from the first stage to the second one: we use the
magnitude of the coefficients estimated in the first stage to define an
adaptive penalty that is applied at the second stage. We give two examples of
procedures that can benefit from the proposed transfer of information, in
estimation and inference problems respectively. Extensive simulations
demonstrate that this transfer is particularly efficient when each stage
operates on distinct subsamples. This separation plays a crucial role for the
computation of calibrated p-values, allowing to control the False Discovery
Rate. In this setup, the proposed transfer results in sensitivity gains ranging
from 50% to 100% compared to state-of-the-art
FAST: An Optimization Framework for Fast Additive Segmentation in Transparent ML
We present FAST, an optimization framework for fast additive segmentation.
FAST segments piecewise constant shape functions for each feature in a dataset
to produce transparent additive models. The framework leverages a novel
optimization procedure to fit these models 2 orders of magnitude faster
than existing state-of-the-art methods, such as explainable boosting machines
\citep{nori2019interpretml}. We also develop new feature selection algorithms
in the FAST framework to fit parsimonious models that perform well. Through
experiments and case studies, we show that FAST improves the computational
efficiency and interpretability of additive models
- …