6,157 research outputs found
Fitting Prediction Rule Ensembles with R Package pre
Prediction rule ensembles (PREs) are sparse collections of rules, offering
highly interpretable regression and classification models. This paper presents
the R package pre, which derives PREs through the methodology of Friedman and
Popescu (2008). The implementation and functionality of package pre is
described and illustrated through application on a dataset on the prediction of
depression. Furthermore, accuracy and sparsity of PREs is compared with that of
single trees, random forest and lasso regression in four benchmark datasets.
Results indicate that pre derives ensembles with predictive accuracy comparable
to that of random forests, while using a smaller number of variables for
prediction
Stable variable selection for right censored data: comparison of methods
The instability in the selection of models is a major concern with data sets
containing a large number of covariates. This paper deals with variable
selection methodology in the case of high-dimensional problems where the
response variable can be right censored. We focuse on new stable variable
selection methods based on bootstrap for two methodologies: the Cox
proportional hazard model and survival trees. As far as the Cox model is
concerned, we investigate the bootstrapping applied to two variable selection
techniques: the stepwise algorithm based on the AIC criterion and the
L1-penalization of Lasso. Regarding survival trees, we review two
methodologies: the bootstrap node-level stabilization and random survival
forests. We apply these different approaches to two real data sets. We compare
the methods on the prediction error rate based on the Harrell concordance index
and the relevance of the interpretation of the corresponding selected models.
The aim is to find a compromise between a good prediction performance and ease
to interpretation for clinicians. Results suggest that in the case of a small
number of individuals, a bootstrapping adapted to L1-penalization in the Cox
model or a bootstrap node-level stabilization in survival trees give a good
alternative to the random survival forest methodology, known to give the
smallest prediction error rate but difficult to interprete by
non-statisticians. In a clinical perspective, the complementarity between the
methods based on the Cox model and those based on survival trees would permit
to built reliable models easy to interprete by the clinician.Comment: nombre de pages : 29 nombre de tableaux : 2 nombre de figures :
Classification hardness for supervised learners on 20 years of intrusion detection data
This article consolidates analysis of established (NSL-KDD) and new intrusion detection datasets (ISCXIDS2012, CICIDS2017, CICIDS2018) through the use of supervised machine learning (ML) algorithms. The uniformity in analysis procedure opens up the option to compare the obtained results. It also provides a stronger foundation for the conclusions about the efficacy of supervised learners on the main classification task in network security. This research is motivated in part to address the lack of adoption of these modern datasets. Starting with a broad scope that includes classification by algorithms from different families on both established and new datasets has been done to expand the existing foundation and reveal the most opportune avenues for further inquiry. After obtaining baseline results, the classification task was increased in difficulty, by reducing the available data to learn from, both horizontally and vertically. The data reduction has been included as a stress-test to verify if the very high baseline results hold up under increasingly harsh constraints. Ultimately, this work contains the most comprehensive set of results on the topic of intrusion detection through supervised machine learning. Researchers working on algorithmic improvements can compare their results to this collection, knowing that all results reported here were gathered through a uniform framework. This work's main contributions are the outstanding classification results on the current state of the art datasets for intrusion detection and the conclusion that these methods show remarkable resilience in classification performance even when aggressively reducing the amount of data to learn from
Improved customer choice predictions using ensemble methods
In this paper various ensemble learning methods from machinelearning and statistics are considered and applied to the customerchoice modeling problem. The application of ensemble learningusually improves the prediction quality of flexible models likedecision trees and thus leads to improved predictions. We giveexperimental results for two real-life marketing datasets usingdecision trees, ensemble versions of decision trees and thelogistic regression model, which is a standard approach for thisproblem. The ensemble models are found to improve upon individualdecision trees and outperform logistic regression.Next, an additive decomposition of the prediction error of amodel, the bias/variance decomposition, is considered. A modelwith a high bias lacks the flexibility to fit the data well. Ahigh variance indicates that a model is instable with respect todifferent datasets. Decision trees have a high variance componentand a low bias component in the prediction error, whereas logisticregression has a high bias component and a low variance component.It is shown that ensemble methods aim at minimizing the variancecomponent in the prediction error while leaving the bias componentunaltered. Bias/variance decompositions for all models for bothcustomer choice datasets are given to illustrate these concepts.brand choice;data mining;boosting;choice models;Bias/Variance decomposition;Bagging;CART;ensembles
Bagging ensemble selection for regression
Bagging ensemble selection (BES) is a relatively new ensemble learning strategy. The strategy can be seen as an ensemble of the ensemble selection from libraries of models (ES) strategy. Previous experimental results on binary classification problems have shown that using random trees as base classifiers, BES-OOB (the most successful variant of BES) is competitive with (and in many cases, superior to) other ensemble learning strategies, for instance, the original ES algorithm, stacking with linear regression, random forests or boosting. Motivated by the promising results in classification, this paper examines the predictive performance of the BES-OOB strategy for regression problems. Our results show that the BES-OOB strategy outperforms Stochastic Gradient Boosting and Bagging when using regression trees as the base learners. Our results also suggest that the advantage of using a diverse model library becomes clear when the model library size is relatively large. We also present encouraging results indicating that the non negative least squares algorithm is a viable approach for pruning an ensemble of ensembles
An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years.
High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions.
The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application.
Application of the methods is illustrated using freely available implementations in the R system for statistical computing
- …