4,028 research outputs found
The Bootstrap in Supervised Learning and its Applications in Genomics/Proteomics
The small-sample size issue is a prevalent problem in Genomics and Proteomics today.
Bootstrap, a resampling method which aims at increasing the efficiency of data usage,
is considered to be an effort to overcome the problem of limited sample size. This dissertation
studies the application of bootstrap to two problems of supervised learning with small
sample data: estimation of the misclassification error of Gaussian discriminant analysis,
and the bagging ensemble classification method.
Estimating the misclassification error of discriminant analysis is a classical problem in
pattern recognition and has many important applications in biomedical research. Bootstrap
error estimation has been shown empirically to be one of the best estimation methods in
terms of root mean squared error. In the first part of this work, we conduct a detailed
analytical study of bootstrap error estimation for the Linear Discriminant Analysis (LDA)
classification rule under Gaussian populations. We derive the exact formulas of the first
and the second moment of the zero bootstrap and the convex bootstrap estimators, as well
as their cross moments with the resubstitution estimator and the true error. Based on these
results, we obtain the exact formulas of the bias, the variance, and the root mean squared
error of the deviation from the true error of these bootstrap estimators. This includes the
moments of the popular .632 bootstrap estimator. Moreover, we obtain the optimal weight
for unbiased and minimum-RMS convex bootstrap estimators. In the univariate case, all
the expressions involve Gaussian distributions, whereas in the multivariate case, the results are written in terms of bivariate doubly non-central F distributions.
In the second part of this work, we conduct an extensive empirical investigation of
bagging, which is an application of bootstrap to ensemble classification. We investigate
the performance of bagging in the classification of small-sample gene-expression data and
protein-abundance mass spectrometry data, as well as the accuracy of small-sample error
estimation with this ensemble classification rule. We observed that, under t-test and
RELIEF filter-based feature selection, bagging generally does a good job of improving
the performance of unstable, overtting classifiers, such as CART decision trees and neural
networks, but that improvement was not sufficient to beat the performance of single stable,
non-overtting classifiers, such as diagonal and plain linear discriminant analysis, or
3-nearest neighbors. Furthermore, the ensemble method did not improve the performance
of these stable classifiers significantly. We give an explicit definition of the out-of-bag estimator
that is intended to remove estimator bias, by formulating carefully how the error
count is normalized, and investigate the performance of error estimation for bagging of
common classification rules, including LDA, 3NN, and CART, applied on both synthetic
and real patient data, corresponding to the use of common error estimators such as resubstitution,
leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus,
bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the
numerical experiments indicated that the performance of the out-of-bag estimator is very
similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically
biased. The performance of the other estimators is consistent with their performance
with the corresponding single classifiers, as reported in other studies. The results of this
work are expected to provide helpful guidance to practitioners who are interested in applying
the bootstrap in supervised learning applications
Bagging ensemble selection for regression
Bagging ensemble selection (BES) is a relatively new ensemble learning strategy. The strategy can be seen as an ensemble of the ensemble selection from libraries of models (ES) strategy. Previous experimental results on binary classiďŹcation problems have shown that using random trees as base classiďŹers, BES-OOB (the most successful variant of BES) is competitive with (and in many cases, superior to) other ensemble learning strategies, for instance, the original ES algorithm, stacking with linear regression, random forests or boosting. Motivated by the promising results in classiďŹcation, this paper examines the predictive performance of the BES-OOB strategy for regression problems. Our results show that the BES-OOB strategy outperforms Stochastic Gradient Boosting and Bagging when using regression trees as the base learners. Our results also suggest that the advantage of using a diverse model library becomes clear when the model library size is relatively large. We also present encouraging results indicating that the non negative least squares algorithm is a viable approach for pruning an ensemble of ensembles
An investigation into machine learning approaches for forecasting spatio-temporal demand in ride-hailing service
In this paper, we present machine learning approaches for characterizing and
forecasting the short-term demand for on-demand ride-hailing services. We
propose the spatio-temporal estimation of the demand that is a function of
variable effects related to traffic, pricing and weather conditions. With
respect to the methodology, a single decision tree, bootstrap-aggregated
(bagged) decision trees, random forest, boosted decision trees, and artificial
neural network for regression have been adapted and systematically compared
using various statistics, e.g. R-square, Root Mean Square Error (RMSE), and
slope. To better assess the quality of the models, they have been tested on a
real case study using the data of DiDi Chuxing, the main on-demand ride hailing
service provider in China. In the current study, 199,584 time-slots describing
the spatio-temporal ride-hailing demand has been extracted with an
aggregated-time interval of 10 mins. All the methods are trained and validated
on the basis of two independent samples from this dataset. The results revealed
that boosted decision trees provide the best prediction accuracy (RMSE=16.41),
while avoiding the risk of over-fitting, followed by artificial neural network
(20.09), random forest (23.50), bagged decision trees (24.29) and single
decision tree (33.55).Comment: Currently under review for journal publicatio
Bounding Optimality Gap in Stochastic Optimization via Bagging: Statistical Efficiency and Stability
We study a statistical method to estimate the optimal value, and the
optimality gap of a given solution for stochastic optimization as an assessment
of the solution quality. Our approach is based on bootstrap aggregating, or
bagging, resampled sample average approximation (SAA). We show how this
approach leads to valid statistical confidence bounds for non-smooth
optimization. We also demonstrate its statistical efficiency and stability that
are especially desirable in limited-data situations, and compare these
properties with some existing methods. We present our theory that views SAA as
a kernel in an infinite-order symmetric statistic, which can be approximated
via bagging. We substantiate our theoretical findings with numerical results
An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years.
High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions.
The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application.
Application of the methods is illustrated using freely available implementations in the R system for statistical computing
Localized Regression
The main problem with localized discriminant techniques is the curse of dimensionality, which seems to restrict their use to the case of few variables. This restriction does not hold if localization is combined with a reduction of dimension. In particular it is shown that localization yields powerful classifiers even in higher dimensions if localization is combined with locally adaptive selection of predictors. A robust localized logistic regression (LLR) method is developed for which all tuning parameters are chosen dataÂĄadaptively. In an extended simulation study we evaluate the potential of the proposed procedure for various types of data and compare it to other classification procedures. In addition we demonstrate that automatic choice of localization, predictor selection and penalty parameters based on cross validation is working well. Finally the method is applied to real data sets and its real world performance is compared to alternative procedures
Bagging and boosting classification trees to predict churn.
Bagging; Boosting; Classification; Churn;
- âŚ