28,209 research outputs found
ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R
We introduce the C++ application and R package ranger. The software is a fast
implementation of random forests for high dimensional data. Ensembles of
classification, regression and survival trees are supported. We describe the
implementation, provide examples, validate the package with a reference
implementation, and compare runtime and memory usage with other
implementations. The new software proves to scale best with the number of
features, samples, trees, and features tried for splitting. Finally, we show
that ranger is the fastest and most memory efficient implementation of random
forests to analyze data on the scale of a genome-wide association study
Stable variable selection for right censored data: comparison of methods
The instability in the selection of models is a major concern with data sets
containing a large number of covariates. This paper deals with variable
selection methodology in the case of high-dimensional problems where the
response variable can be right censored. We focuse on new stable variable
selection methods based on bootstrap for two methodologies: the Cox
proportional hazard model and survival trees. As far as the Cox model is
concerned, we investigate the bootstrapping applied to two variable selection
techniques: the stepwise algorithm based on the AIC criterion and the
L1-penalization of Lasso. Regarding survival trees, we review two
methodologies: the bootstrap node-level stabilization and random survival
forests. We apply these different approaches to two real data sets. We compare
the methods on the prediction error rate based on the Harrell concordance index
and the relevance of the interpretation of the corresponding selected models.
The aim is to find a compromise between a good prediction performance and ease
to interpretation for clinicians. Results suggest that in the case of a small
number of individuals, a bootstrapping adapted to L1-penalization in the Cox
model or a bootstrap node-level stabilization in survival trees give a good
alternative to the random survival forest methodology, known to give the
smallest prediction error rate but difficult to interprete by
non-statisticians. In a clinical perspective, the complementarity between the
methods based on the Cox model and those based on survival trees would permit
to built reliable models easy to interprete by the clinician.Comment: nombre de pages : 29 nombre de tableaux : 2 nombre de figures :
Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event data
BACKGROUND: Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is computationally too expensive to check all possible interactions. RESULTS: We use a boosting technique for estimation of effects and the following building blocks for pre-selecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance. CONCLUSION: Screening interactions through random forests is feasible and useful, when one is interested in finding relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones
- …