573 research outputs found
Prediction Weighted Maximum Frequency Selection
Shrinkage estimators that possess the ability to produce sparse solutions
have become increasingly important to the analysis of today's complex datasets.
Examples include the LASSO, the Elastic-Net and their adaptive counterparts.
Estimation of penalty parameters still presents difficulties however. While
variable selection consistent procedures have been developed, their finite
sample performance can often be less than satisfactory. We develop a new
strategy for variable selection using the adaptive LASSO and adaptive
Elastic-Net estimators with diverging. The basic idea first involves
using the trace paths of their LARS solutions to bootstrap estimates of maximum
frequency (MF) models conditioned on dimension. Conditioning on dimension
effectively mitigates overfitting, however to deal with underfitting, these MFs
are then prediction-weighted, and it is shown that not only can consistent
model selection be achieved, but that attractive convergence rates can as well,
leading to excellent finite sample performance. Detailed numerical studies are
carried out on both simulated and real datasets. Extensions to the class of
generalized linear models are also detailed.Comment: This manuscript contains 41 pages and 14 figure
Spike and slab variable selection: Frequentist and Bayesian strategies
Variable selection in the linear regression model takes many apparent faces
from both frequentist and Bayesian standpoints. In this paper we introduce a
variable selection method referred to as a rescaled spike and slab model. We
study the importance of prior hierarchical specifications and draw connections
to frequentist generalized ridge regression estimation. Specifically, we study
the usefulness of continuous bimodal priors to model hypervariance parameters,
and the effect scaling has on the posterior mean through its relationship to
penalization. Several model selection strategies, some frequentist and some
Bayesian in nature, are developed and studied theoretically. We demonstrate the
importance of selective shrinkage for effective variable selection in terms of
risk misclassification, and show this is achieved using the posterior from a
rescaled spike and slab model. We also show how to verify a procedure's ability
to reduce model uncertainty in finite samples using a specialized forward
selection strategy. Using this tool, we illustrate the effectiveness of
rescaled spike and slab models in reducing model uncertainty.Comment: Published at http://dx.doi.org/10.1214/009053604000001147 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Fence methods for mixed model selection
Many model search strategies involve trading off model fit with model
complexity in a penalized goodness of fit measure. Asymptotic properties for
these types of procedures in settings like linear regression and ARMA time
series have been studied, but these do not naturally extend to nonstandard
situations such as mixed effects models, where simple definition of the sample
size is not meaningful. This paper introduces a new class of strategies, known
as fence methods, for mixed model selection, which includes linear and
generalized linear mixed models. The idea involves a procedure to isolate a
subgroup of what are known as correct models (of which the optimal model is a
member). This is accomplished by constructing a statistical fence, or barrier,
to carefully eliminate incorrect models. Once the fence is constructed, the
optimal model is selected from among those within the fence according to a
criterion which can be made flexible. In addition, we propose two variations of
the fence. The first is a stepwise procedure to handle situations of many
predictors; the second is an adaptive approach for choosing a tuning constant.
We give sufficient conditions for consistency of fence and its variations, a
desirable property for a good model selection procedure. The methods are
illustrated through simulation studies and real data analysis.Comment: Published in at http://dx.doi.org/10.1214/07-AOS517 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Unsupervised Bump Hunting Using Principal Components
Principal Components Analysis is a widely used technique for dimension
reduction and characterization of variability in multivariate populations. Our
interest lies in studying when and why the rotation to principal components can
be used effectively within a response-predictor set relationship in the context
of mode hunting. Specifically focusing on the Patient Rule Induction Method
(PRIM), we first develop a fast version of this algorithm (fastPRIM) under
normality which facilitates the theoretical studies to follow. Using basic
geometrical arguments, we then demonstrate how the PC rotation of the predictor
space alone can in fact generate improved mode estimators. Simulation results
are used to illustrate our findings.Comment: 24 pages, 9 figure
Cross-validation and Peeling Strategies for Survival Bump Hunting using Recursive Peeling Methods
We introduce a framework to build a survival/risk bump hunting model with a
censored time-to-event response. Our Survival Bump Hunting (SBH) method is
based on a recursive peeling procedure that uses a specific survival peeling
criterion derived from non/semi-parametric statistics such as the
hazards-ratio, the log-rank test or the Nelson-Aalen estimator. To optimize the
tuning parameter of the model and validate it, we introduce an objective
function based on survival or prediction-error statistics, such as the log-rank
test and the concordance error rate. We also describe two alternative
cross-validation techniques adapted to the joint task of decision-rule making
by recursive peeling and survival estimation. Numerical analyses show the
importance of replicated cross-validation and the differences between criteria
and techniques in both low and high-dimensional settings. Although several
non-parametric survival models exist, none addresses the problem of directly
identifying local extrema. We show how SBH efficiently estimates extreme
survival/risk subgroups unlike other models. This provides an insight into the
behavior of commonly used models and suggests alternatives to be adopted in
practice. Finally, our SBH framework was applied to a clinical dataset. In it,
we identified subsets of patients characterized by clinical and demographic
covariates with a distinct extreme survival outcome, for which tailored medical
interventions could be made. An R package `PRIMsrc` is available on CRAN and
GitHub.Comment: Keywords: Exploratory Survival/Risk Analysis, Survival/Risk
Estimation & Prediction, Non-Parametric Method, Cross-Validation, Bump
Hunting, Rule-Induction Metho
On the explanatory power of principal components
We show that if we have an orthogonal base () in a
-dimensional vector space, and select vectors and
such that the vectors traverse the origin, then the probability of
being to closer to all the vectors in the base than to is at
least 1/2 and converges as increases to infinity to a normal distribution
on the interval [-1,1]; i.e., . This result has
relevant consequences for Principal Components Analysis in the context of
regression and other learning settings, if we take the orthogonal base as the
direction of the principal components.Comment: 10 pages, 3 figure
On A Simple Method For Analyzing Multivariate Survival Data Using Sample Survey Methods
A simple technique is illustrated for analyzing multivariate survival data. The data situation arises when an individual records multiple survival events, or when individuals recording single survival events are grouped into clusters. Past work has focused on developing new methods to handle such data. Here, we use a connection between Poisson regression and survival modeling and a cluster sampling approach to adjust the variance estimates. The approach requires parametric assumption for the marginal hazard function, but avoids specification of a joint multivariate survival distribution. A simulation study demonstrates the proposed approach is a competing method of recent developed marginal approaches in the literature
Statistical Redundancy Testing for Improved Gene Selection in Cancer Classification Using Microarray Data
In gene selection for cancer classification using microarray data, we define an eigenvalue-ratio statistic to measure a gene’s contribution to the joint discriminability when this gene is included into a set of genes. Based on this eigenvalue-ratio statistic, we define a novel hypothesis testing for gene statistical redundancy and propose two gene selection methods. Simulation studies illustrate the agreement between statistical redundancy testing and gene selection methods. Real data examples show the proposed gene selection methods can select a compact gene subset which can not only be used to build high quality cancer classifiers but also show biological relevance
Robust Estimation Of Multivariate Failure Data With Time-Modulated Frailty
A time-modulated frailty model is proposed for analyzing multivariate failure data. The effect of frailties, which may not be constant over time, is discussed. We assume a parametric model for the baseline hazard, but avoid the parametric assumption for the frailty distribution. The well-known connection between survival times and Poisson regression model is used. The parameters of interest are estimated by generalized estimating equations (GEE) or by penalized GEE. Simulation studies show that the procedure is successful to detect the effect of time-modulated frailty. The method is also applied to a placebo controlled randomized clinical trial of gamma interferon, a study of chronic granulomatous disease (CGD)
BAMarray™: Java software for Bayesian analysis of variance for microarray data
BACKGROUND: DNA microarrays open up a new horizon for studying the genetic determinants of disease. The high throughput nature of these arrays creates an enormous wealth of information, but also poses a challenge to data analysis. Inferential problems become even more pronounced as experimental designs used to collect data become more complex. An important example is multigroup data collected over different experimental groups, such as data collected from distinct stages of a disease process. We have developed a method specifically addressing these issues termed Bayesian ANOVA for microarrays (BAM). The BAM approach uses a special inferential regularization known as spike-and-slab shrinkage that provides an optimal balance between total false detections and total false non-detections. This translates into more reproducible differential calls. Spike and slab shrinkage is a form of regularization achieved by using information across all genes and groups simultaneously. RESULTS: BAMarray™ is a graphically oriented Java-based software package that implements the BAM method for detecting differentially expressing genes in multigroup microarray experiments (up to 256 experimental groups can be analyzed). Drop-down menus allow the user to easily select between different models and to choose various run options. BAMarray™ can also be operated in a fully automated mode with preselected run options. Tuning parameters have been preset at theoretically optimal values freeing the user from such specifications. BAMarray™ provides estimates for gene differential effects and automatically estimates data adaptive, optimal cutoff values for classifying genes into biological patterns of differential activity across experimental groups. A graphical suite is a core feature of the product and includes diagnostic plots for assessing model assumptions and interactive plots that enable tracking of prespecified gene lists to study such things as biological pathway perturbations. The user can zoom in and lasso genes of interest that can then be saved for downstream analyses. CONCLUSION: BAMarray™ is user friendly platform independent software that effectively and efficiently implements the BAM methodology. Classifying patterns of differential activity is greatly facilitated by a data adaptive cutoff rule and a graphical suite. BAMarray™ is licensed software freely available to academic institutions. More information can be found at
- …