7,368 research outputs found
Use of pre-transformation to cope with outlying values in important candidate genes
Outlying values in predictors often strongly affect the results of statistical analyses in high-dimensional settings. Although they frequently occur with most high-throughput techniques, the problem is often ignored in the literature. We suggest to use a very simple transformation, proposed before in a different context by Royston and Sauerbrei, as an intermediary step between array normalization and high-level statistical analysis. This straightforward univariate transformation identifies extreme values and reduces the influence of outlying values considerably in all further steps of statistical analysis without eliminating the incriminated observation or feature. The use of the transformation and its effects are demonstrated for diverse univariate and multivariate statistical analyses using nine publicly available microarray data sets
Variable Selection and Model Choice in Structured Survival Models
In many situations, medical applications ask for flexible survival models that allow to extend the classical Cox-model via the
inclusion of time-varying and nonparametric effects. These structured survival models are very flexible but additional
difficulties arise when model choice and variable selection is desired. In particular, it has to be decided which covariates
should be assigned time-varying effects or whether parametric modeling is sufficient for a given covariate. Component-wise
boosting provides a means of likelihood-based model fitting that enables simultaneous variable selection and model choice. We
introduce a component-wise likelihood-based boosting algorithm for survival data that permits the inclusion of both parametric
and nonparametric time-varying effects as well as nonparametric effects of continuous covariates utilizing penalized splines as
the main modeling technique. Its properties
and performance are investigated in simulation studies.
The new modeling approach is used to build a flexible survival model for
intensive care patients suffering from severe sepsis.
A software implementation is available to the interested reader
GAMLSS for high-dimensional data – a flexible approach based on boosting
Generalized additive models for location, scale and shape (GAMLSS) are a popular semi-parametric modelling approach that, in contrast to conventional GAMs, regress not only the expected mean but every distribution parameter (e.g. location, scale and shape) to a set of covariates. Current fitting procedures for GAMLSS are infeasible for high-dimensional data setups and require variable selection based on (potentially problematic) information criteria. The present work describes a boosting algorithm for high-dimensional GAMLSS that was developed to overcome these limitations. Specifically, the new algorithm was designed to allow the simultaneous estimation of predictor effects and variable selection. The proposed algorithm was applied to data of the Munich Rental Guide, which is used by
landlords and tenants as a reference for the average rent of a flat depending on its characteristics and spatial features. The net-rent predictions that resulted from the high-dimensional GAMLSS were found to be highly competitive while covariate-specific prediction intervals showed a major improvement over classical GAMs
Stable variable selection for right censored data: comparison of methods
The instability in the selection of models is a major concern with data sets
containing a large number of covariates. This paper deals with variable
selection methodology in the case of high-dimensional problems where the
response variable can be right censored. We focuse on new stable variable
selection methods based on bootstrap for two methodologies: the Cox
proportional hazard model and survival trees. As far as the Cox model is
concerned, we investigate the bootstrapping applied to two variable selection
techniques: the stepwise algorithm based on the AIC criterion and the
L1-penalization of Lasso. Regarding survival trees, we review two
methodologies: the bootstrap node-level stabilization and random survival
forests. We apply these different approaches to two real data sets. We compare
the methods on the prediction error rate based on the Harrell concordance index
and the relevance of the interpretation of the corresponding selected models.
The aim is to find a compromise between a good prediction performance and ease
to interpretation for clinicians. Results suggest that in the case of a small
number of individuals, a bootstrapping adapted to L1-penalization in the Cox
model or a bootstrap node-level stabilization in survival trees give a good
alternative to the random survival forest methodology, known to give the
smallest prediction error rate but difficult to interprete by
non-statisticians. In a clinical perspective, the complementarity between the
methods based on the Cox model and those based on survival trees would permit
to built reliable models easy to interprete by the clinician.Comment: nombre de pages : 29 nombre de tableaux : 2 nombre de figures :
Model-based Boosting in R: A Hands-on Tutorial Using the R Package mboost
We provide a detailed hands-on tutorial for the R add-on package mboost. The package implements boosting for optimizing general risk functions utilizing component-wise (penalized) least squares estimates as base-learners for fitting various kinds of generalized linear and generalized additive models to potentially high-dimensional data. We give a theoretical background and demonstrate how mboost can be used to fit interpretable models of different complexity. As an example we use mboost to predict the body fat based on anthropometric measurements throughout the tutorial
- …