435 research outputs found

    The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial

    Full text link
    In this tutorial paper, we first define mean squared error, variance, covariance, and bias of both random variables and classification/predictor models. Then, we formulate the true and generalization errors of the model for both training and validation/test instances where we make use of the Stein's Unbiased Risk Estimator (SURE). We define overfitting, underfitting, and generalization using the obtained true and generalization errors. We introduce cross validation and two well-known examples which are KK-fold and leave-one-out cross validations. We briefly introduce generalized cross validation and then move on to regularization where we use the SURE again. We work on both â„“2\ell_2 and â„“1\ell_1 norm regularizations. Then, we show that bootstrap aggregating (bagging) reduces the variance of estimation. Boosting, specifically AdaBoost, is introduced and it is explained as both an additive model and a maximum margin model, i.e., Support Vector Machine (SVM). The upper bound on the generalization error of boosting is also provided to show why boosting prevents from overfitting. As examples of regularization, the theory of ridge and lasso regressions, weight decay, noise injection to input/weights, and early stopping are explained. Random forest, dropout, histogram of oriented gradients, and single shot multi-box detector are explained as examples of bagging in machine learning and computer vision. Finally, boosting tree and SVM models are mentioned as examples of boosting.Comment: 23 pages, 9 figure

    Vote-boosting ensembles

    Full text link
    Vote-boosting is a sequential ensemble learning method in which the individual classifiers are built on different weighted versions of the training data. To build a new classifier, the weight of each training instance is determined in terms of the degree of disagreement among the current ensemble predictions for that instance. For low class-label noise levels, especially when simple base learners are used, emphasis should be made on instances for which the disagreement rate is high. When more flexible classifiers are used and as the noise level increases, the emphasis on these uncertain instances should be reduced. In fact, at sufficiently high levels of class-label noise, the focus should be on instances on which the ensemble classifiers agree. The optimal type of emphasis can be automatically determined using cross-validation. An extensive empirical analysis using the beta distribution as emphasis function illustrates that vote-boosting is an effective method to generate ensembles that are both accurate and robust

    Phishing Website Detection Using Several Machine Learning Algorithms: A Review Paper

    Get PDF
    Phishing is one of the major web social engineering attacks. This has led to demand for a better way to predict and stop them in a commercial environment. This paper seeks to understand the research done in the field and analyse the next steps forward. This is done by focusing on what goes into the selection of proper features, from manual selection to the use of Genetic Algorithms such as ADABoost and MultiBoost. Then a look into the classifiers in use, Neural Networks and Ensemble algorithms which were prominent alongside some novel approaches. This information is then processed into a framework for cloud-based and client-based phishing website detection, alongside suggestions for possible future research and experiments that could help progress the field

    RiskLogitboost Regression for Rare Events in Binary Response: An Econometric Approach

    Get PDF
    A boosting-based machine learning algorithm is presented to model a binary response with large imbalance, i.e., a rare event. The new method (i) reduces the prediction error of the rare class, and (ii) approximates an econometric model that allows interpretability. RiskLogitboost regression includes a weighting mechanism that oversamples or undersamples observations according to their misclassification likelihood and a generalized least squares bias correction strategy to reduce the prediction error. An illustration using a real French third-party liability motor insurance data set is presented. The results show that RiskLogitboost regression improves the rate of detection of rare events compared to some boosting-based and tree-based algorithms and some existing methods designed to treat imbalanced responses
    • …
    corecore