196 research outputs found
Orthogonalized smoothing for rescaled spike and slab models
Rescaled spike and slab models are a new Bayesian variable selection method
for linear regression models. In high dimensional orthogonal settings such
models have been shown to possess optimal model selection properties. We review
background theory and discuss applications of rescaled spike and slab models to
prediction problems involving orthogonal polynomials. We first consider global
smoothing and discuss potential weaknesses. Some of these deficiencies are
remedied by using local regression. The local regression approach relies on an
intimate connection between local weighted regression and weighted generalized
ridge regression. An important implication is that one can trace the effective
degrees of freedom of a curve as a way to visualize and classify curvature.
Several motivating examples are presented.Comment: Published in at http://dx.doi.org/10.1214/074921708000000192 the IMS
Collections (http://www.imstat.org/publications/imscollections.htm) by the
Institute of Mathematical Statistics (http://www.imstat.org
Spike and slab variable selection: Frequentist and Bayesian strategies
Variable selection in the linear regression model takes many apparent faces
from both frequentist and Bayesian standpoints. In this paper we introduce a
variable selection method referred to as a rescaled spike and slab model. We
study the importance of prior hierarchical specifications and draw connections
to frequentist generalized ridge regression estimation. Specifically, we study
the usefulness of continuous bimodal priors to model hypervariance parameters,
and the effect scaling has on the posterior mean through its relationship to
penalization. Several model selection strategies, some frequentist and some
Bayesian in nature, are developed and studied theoretically. We demonstrate the
importance of selective shrinkage for effective variable selection in terms of
risk misclassification, and show this is achieved using the posterior from a
rescaled spike and slab model. We also show how to verify a procedure's ability
to reduce model uncertainty in finite samples using a specialized forward
selection strategy. Using this tool, we illustrate the effectiveness of
rescaled spike and slab models in reducing model uncertainty.Comment: Published at http://dx.doi.org/10.1214/009053604000001147 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Variable importance in binary regression trees and forests
We characterize and study variable importance (VIMP) and pairwise variable
associations in binary regression trees. A key component involves the node mean
squared error for a quantity we refer to as a maximal subtree. The theory
naturally extends from single trees to ensembles of trees and applies to
methods like random forests. This is useful because while importance values
from random forests are used to screen variables, for example they are used to
filter high throughput genomic data in Bioinformatics, very little theory
exists about their properties.Comment: Published in at http://dx.doi.org/10.1214/07-EJS039 the Electronic
Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Characterizing Boosting
We consider Boosting, a special case of Friedman's generic boosting
algorithm applied to linear regression under -loss. We study Boosting
for an arbitrary regularization parameter and derive an exact closed form
expression for the number of steps taken along a fixed coordinate direction.
This relationship is used to describe Boosting's solution path, to
describe new tools for studying its path, and to characterize some of the
algorithm's unique properties, including active set cycling, a property where
the algorithm spends lengthy periods of time cycling between the same
coordinates when the regularization parameter is arbitrarily small. Our fixed
descent analysis also reveals a repressible condition that limits the
effectiveness of Boosting in correlated problems by preventing desirable
variables from entering the solution path. As a simple remedy, a data
augmentation method similar to that used for the elastic net is used to
introduce -penalization and is shown, in combination with decorrelation,
to reverse the repressible condition and circumvents Boosting's
deficiencies in correlated problems. In itself, this presents a new explanation
for why the elastic net is successful in correlated problems and why methods
like LAR and lasso can perform poorly in such settings.Comment: Published in at http://dx.doi.org/10.1214/12-AOS997 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Recommended from our members
Tree Variable Selection for Paired Case-Control Studies with Application to Microbiome Data
When case-control studies involve paired samples, tree analyses based
on traditional splitting rules are suboptimal as they ignore the paired nature of the
data. Paired samples occur in microbiome studies when they are collected from
different locations of the same individual or when they are collected from paired
individuals with familial ties. Borrowing concepts from tree splitting, we propose
a novel approach that accommodates the paired structure in the data for fast and
effective nonparametric variable ranking. Importantly this method allows detangling
of different types of associations at play with structured correlated outcomes such
as host genotype and enviromental exposure effects. Another technique for variable
selection are variable importance measures. We describe two types of measures
useful for paired data analysis. The methodology is illustrated on the microbiota of
paired samples from a case-control study of obesity
BAMarrayâ„¢: Java software for Bayesian analysis of variance for microarray data
BACKGROUND: DNA microarrays open up a new horizon for studying the genetic determinants of disease. The high throughput nature of these arrays creates an enormous wealth of information, but also poses a challenge to data analysis. Inferential problems become even more pronounced as experimental designs used to collect data become more complex. An important example is multigroup data collected over different experimental groups, such as data collected from distinct stages of a disease process. We have developed a method specifically addressing these issues termed Bayesian ANOVA for microarrays (BAM). The BAM approach uses a special inferential regularization known as spike-and-slab shrinkage that provides an optimal balance between total false detections and total false non-detections. This translates into more reproducible differential calls. Spike and slab shrinkage is a form of regularization achieved by using information across all genes and groups simultaneously. RESULTS: BAMarrayâ„¢ is a graphically oriented Java-based software package that implements the BAM method for detecting differentially expressing genes in multigroup microarray experiments (up to 256 experimental groups can be analyzed). Drop-down menus allow the user to easily select between different models and to choose various run options. BAMarrayâ„¢ can also be operated in a fully automated mode with preselected run options. Tuning parameters have been preset at theoretically optimal values freeing the user from such specifications. BAMarrayâ„¢ provides estimates for gene differential effects and automatically estimates data adaptive, optimal cutoff values for classifying genes into biological patterns of differential activity across experimental groups. A graphical suite is a core feature of the product and includes diagnostic plots for assessing model assumptions and interactive plots that enable tracking of prespecified gene lists to study such things as biological pathway perturbations. The user can zoom in and lasso genes of interest that can then be saved for downstream analyses. CONCLUSION: BAMarrayâ„¢ is user friendly platform independent software that effectively and efficiently implements the BAM methodology. Classifying patterns of differential activity is greatly facilitated by a data adaptive cutoff rule and a graphical suite. BAMarrayâ„¢ is licensed software freely available to academic institutions. More information can be found at
Random survival forests
We introduce random survival forests, a random forests method for the
analysis of right-censored survival data. New survival splitting rules for
growing survival trees are introduced, as is a new missing data algorithm for
imputing missing data. A conservation-of-events principle for survival forests
is introduced and used to define ensemble mortality, a simple interpretable
measure of mortality that can be used as a predicted outcome. Several
illustrative examples are given, including a case study of the prognostic
implications of body mass for individuals with coronary artery disease.
Computations for all examples were implemented using the freely available
R-software package, randomSurvivalForest.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS169 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Satisfaction with web-based training in an integrated healthcare delivery network: do age, education, computer skills and attitudes matter?
<p>Abstract</p> <p>Background</p> <p>Healthcare institutions spend enormous time and effort to train their workforce. Web-based training can potentially streamline this process. However the deployment of web-based training in a large-scale setting with a diverse healthcare workforce has not been evaluated. The aim of this study was to evaluate the satisfaction of healthcare professionals with web-based training and to determine the predictors of such satisfaction including age, education status and computer proficiency.</p> <p>Methods</p> <p>Observational, cross-sectional survey of healthcare professionals from six hospital systems in an integrated delivery network. We measured overall satisfaction to web-based training and response to survey items measuring Website Usability, Course Usefulness, Instructional Design Effectiveness, Computer Proficiency and Self-learning Attitude.</p> <p>Results</p> <p>A total of 17,891 healthcare professionals completed the web-based training on HIPAA Privacy Rule; and of these, 13,537 completed the survey (response rate 75.6%). Overall course satisfaction was good (median, 4; scale, 1 to 5) with more than 75% of the respondents satisfied with the training (rating 4 or 5) and 65% preferring web-based training over traditional instructor-led training (rating 4 or 5). Multivariable ordinal regression revealed 3 key predictors of satisfaction with web-based training: Instructional Design Effectiveness, Website Usability and Course Usefulness. Demographic predictors such as gender, age and education did not have an effect on satisfaction.</p> <p>Conclusion</p> <p>The study shows that web-based training when tailored to learners' background, is perceived as a satisfactory mode of learning by an interdisciplinary group of healthcare professionals, irrespective of age, education level or prior computer experience. Future studies should aim to measure the long-term outcomes of web-based training.</p
- …