7,545 research outputs found
An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years.
High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions.
The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application.
Application of the methods is illustrated using freely available implementations in the R system for statistical computing
Node harvest
When choosing a suitable technique for regression and classification with
multivariate predictor variables, one is often faced with a tradeoff between
interpretability and high predictive accuracy. To give a classical example,
classification and regression trees are easy to understand and interpret. Tree
ensembles like Random Forests provide usually more accurate predictions. Yet
tree ensembles are also more difficult to analyze than single trees and are
often criticized, perhaps unfairly, as `black box' predictors. Node harvest is
trying to reconcile the two aims of interpretability and predictive accuracy by
combining positive aspects of trees and tree ensembles. Results are very sparse
and interpretable and predictive accuracy is extremely competitive, especially
for low signal-to-noise data. The procedure is simple: an initial set of a few
thousand nodes is generated randomly. If a new observation falls into just a
single node, its prediction is the mean response of all training observation
within this node, identical to a tree-like prediction. A new observation falls
typically into several nodes and its prediction is then the weighted average of
the mean responses across all these nodes. The only role of node harvest is to
`pick' the right nodes from the initial large ensemble of nodes by choosing
node weights, which amounts in the proposed algorithm to a quadratic
programming problem with linear inequality constraints. The solution is sparse
in the sense that only very few nodes are selected with a nonzero weight. This
sparsity is not explicitly enforced. Maybe surprisingly, it is not necessary to
select a tuning parameter for optimal predictive accuracy. Node harvest can
handle mixed data and missing values and is shown to be simple to interpret and
competitive in predictive accuracy on a variety of data sets.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS367 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Fitting Prediction Rule Ensembles with R Package pre
Prediction rule ensembles (PREs) are sparse collections of rules, offering
highly interpretable regression and classification models. This paper presents
the R package pre, which derives PREs through the methodology of Friedman and
Popescu (2008). The implementation and functionality of package pre is
described and illustrated through application on a dataset on the prediction of
depression. Furthermore, accuracy and sparsity of PREs is compared with that of
single trees, random forest and lasso regression in four benchmark datasets.
Results indicate that pre derives ensembles with predictive accuracy comparable
to that of random forests, while using a smaller number of variables for
prediction
- …