435 research outputs found
The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial
In this tutorial paper, we first define mean squared error, variance,
covariance, and bias of both random variables and classification/predictor
models. Then, we formulate the true and generalization errors of the model for
both training and validation/test instances where we make use of the Stein's
Unbiased Risk Estimator (SURE). We define overfitting, underfitting, and
generalization using the obtained true and generalization errors. We introduce
cross validation and two well-known examples which are -fold and
leave-one-out cross validations. We briefly introduce generalized cross
validation and then move on to regularization where we use the SURE again. We
work on both and norm regularizations. Then, we show that
bootstrap aggregating (bagging) reduces the variance of estimation. Boosting,
specifically AdaBoost, is introduced and it is explained as both an additive
model and a maximum margin model, i.e., Support Vector Machine (SVM). The upper
bound on the generalization error of boosting is also provided to show why
boosting prevents from overfitting. As examples of regularization, the theory
of ridge and lasso regressions, weight decay, noise injection to input/weights,
and early stopping are explained. Random forest, dropout, histogram of oriented
gradients, and single shot multi-box detector are explained as examples of
bagging in machine learning and computer vision. Finally, boosting tree and SVM
models are mentioned as examples of boosting.Comment: 23 pages, 9 figure
Vote-boosting ensembles
Vote-boosting is a sequential ensemble learning method in which the
individual classifiers are built on different weighted versions of the training
data. To build a new classifier, the weight of each training instance is
determined in terms of the degree of disagreement among the current ensemble
predictions for that instance. For low class-label noise levels, especially
when simple base learners are used, emphasis should be made on instances for
which the disagreement rate is high. When more flexible classifiers are used
and as the noise level increases, the emphasis on these uncertain instances
should be reduced. In fact, at sufficiently high levels of class-label noise,
the focus should be on instances on which the ensemble classifiers agree. The
optimal type of emphasis can be automatically determined using
cross-validation. An extensive empirical analysis using the beta distribution
as emphasis function illustrates that vote-boosting is an effective method to
generate ensembles that are both accurate and robust
Phishing Website Detection Using Several Machine Learning Algorithms: A Review Paper
Phishing is one of the major web social engineering attacks. This has led to demand for a better way to predict and stop them in a commercial environment. This paper seeks to understand the research done in the field and analyse the next steps forward. This is done by focusing on what goes into the selection of proper features, from manual selection to the use of Genetic Algorithms such as ADABoost and MultiBoost. Then a look into the classifiers in use, Neural Networks and Ensemble algorithms which were prominent alongside some novel approaches. This information is then processed into a framework for cloud-based and client-based phishing website detection, alongside suggestions for possible future research and experiments that could help progress the field
RiskLogitboost Regression for Rare Events in Binary Response: An Econometric Approach
A boosting-based machine learning algorithm is presented to model a binary response with large imbalance, i.e., a rare event. The new method (i) reduces the prediction error of the rare class, and (ii) approximates an econometric model that allows interpretability. RiskLogitboost regression includes a weighting mechanism that oversamples or undersamples observations according to their misclassification likelihood and a generalized least squares bias correction strategy to reduce the prediction error. An illustration using a real French third-party liability motor insurance data set is presented. The results show that RiskLogitboost regression improves the rate of detection of rare events compared to some boosting-based and tree-based algorithms and some existing methods designed to treat imbalanced responses
- …