90 research outputs found

    Vote-boosting ensembles

    Full text link
    Vote-boosting is a sequential ensemble learning method in which the individual classifiers are built on different weighted versions of the training data. To build a new classifier, the weight of each training instance is determined in terms of the degree of disagreement among the current ensemble predictions for that instance. For low class-label noise levels, especially when simple base learners are used, emphasis should be made on instances for which the disagreement rate is high. When more flexible classifiers are used and as the noise level increases, the emphasis on these uncertain instances should be reduced. In fact, at sufficiently high levels of class-label noise, the focus should be on instances on which the ensemble classifiers agree. The optimal type of emphasis can be automatically determined using cross-validation. An extensive empirical analysis using the beta distribution as emphasis function illustrates that vote-boosting is an effective method to generate ensembles that are both accurate and robust

    Bounding Optimality Gap in Stochastic Optimization via Bagging: Statistical Efficiency and Stability

    Full text link
    We study a statistical method to estimate the optimal value, and the optimality gap of a given solution for stochastic optimization as an assessment of the solution quality. Our approach is based on bootstrap aggregating, or bagging, resampled sample average approximation (SAA). We show how this approach leads to valid statistical confidence bounds for non-smooth optimization. We also demonstrate its statistical efficiency and stability that are especially desirable in limited-data situations, and compare these properties with some existing methods. We present our theory that views SAA as a kernel in an infinite-order symmetric statistic, which can be approximated via bagging. We substantiate our theoretical findings with numerical results

    Advancing Inference in Supervised Learning Procedures via Permutation Tests and Importance Sampling, with Applications to Environmental Science

    Get PDF
    Random forests, since being proposed by Breiman (2001), have become popular supervised regression and classification techniques. Their popularity stems from being easy to implement - the default hyper-parameter settings are often not far from optimal and are often competitive with more involved supervised models. While random forests are complex, they are not completely impenetrable to theoretical analysis. In this thesis, we present several contributions to random forest methodology. First, we provide a motivating application of random forests to ornithological data, where we develop a novel hypothesis test for testing equality of distribution of random forest curves. Then, we refine an observation made during that application into a means of testing hypotheses about the validation error of random forests, allowing for computationally efficient tests that are analogous to the F-test for linear regression. Finally, we propose a means of accounting for a discrepancy in test and training distributions, motivated by the problem of forecasting power outages from hurricanes

    Encrypted statistical machine learning: new privacy preserving methods

    Full text link
    We present two new statistical machine learning methods designed to learn on fully homomorphic encrypted (FHE) data. The introduction of FHE schemes following Gentry (2009) opens up the prospect of privacy preserving statistical machine learning analysis and modelling of encrypted data without compromising security constraints. We propose tailored algorithms for applying extremely random forests, involving a new cryptographic stochastic fraction estimator, and na\"{i}ve Bayes, involving a semi-parametric model for the class decision boundary, and show how they can be used to learn and predict from encrypted data. We demonstrate that these techniques perform competitively on a variety of classification data sets and provide detailed information about the computational practicalities of these and other FHE methods.Comment: 39 page
    • …
    corecore