9,502 research outputs found

    Modeling Binary Time Series Using Gaussian Processes with Application to Predicting Sleep States

    Full text link
    Motivated by the problem of predicting sleep states, we develop a mixed effects model for binary time series with a stochastic component represented by a Gaussian process. The fixed component captures the effects of covariates on the binary-valued response. The Gaussian process captures the residual variations in the binary response that are not explained by covariates and past realizations. We develop a frequentist modeling framework that provides efficient inference and more accurate predictions. Results demonstrate the advantages of improved prediction rates over existing approaches such as logistic regression, generalized additive mixed model, models for ordinal data, gradient boosting, decision tree and random forest. Using our proposed model, we show that previous sleep state and heart rates are significant predictors for future sleep states. Simulation studies also show that our proposed method is promising and robust. To handle computational complexity, we utilize Laplace approximation, golden section search and successive parabolic interpolation. With this paper, we also submit an R-package (HIBITS) that implements the proposed procedure.Comment: Journal of Classification (2018

    Tree Boosting Data Competitions with XGBoost

    Get PDF
    This Master's Degree Thesis objective is to provide understanding on how to approach a supervised learning predictive problem and illustrate it using a statistical/machine learning algorithm, Tree Boosting. A review of tree methodology is introduced in order to understand its evolution, since Classification and Regression Trees, followed by Bagging, Random Forest and, nowadays, Tree Boosting. The methodology is explained following the XGBoost implementation, which achieved state-of-the-art results in several data competitions. A framework for applied predictive modelling is explained with its proper concepts: objective function, regularization term, overfitting, hyperparameter tuning, k-fold cross validation and feature engineering. All these concepts are illustrated with a real dataset of videogame churn; used in a datathon competition

    RandomBoost: Simplified Multi-class Boosting through Randomization

    Full text link
    We propose a novel boosting approach to multi-class classification problems, in which multiple classes are distinguished by a set of random projection matrices in essence. The approach uses random projections to alleviate the proliferation of binary classifiers typically required to perform multi-class classification. The result is a multi-class classifier with a single vector-valued parameter, irrespective of the number of classes involved. Two variants of this approach are proposed. The first method randomly projects the original data into new spaces, while the second method randomly projects the outputs of learned weak classifiers. These methods are not only conceptually simple but also effective and easy to implement. A series of experiments on synthetic, machine learning and visual recognition data sets demonstrate that our proposed methods compare favorably to existing multi-class boosting algorithms in terms of both the convergence rate and classification accuracy.Comment: 15 page

    Boosting for high-dimensional linear models

    Full text link
    We prove that boosting with the squared error loss, L2L_2Boosting, is consistent for very high-dimensional linear models, where the number of predictor variables is allowed to grow essentially as fast as OO(exp(sample size)), assuming that the true underlying regression function is sparse in terms of the â„“1\ell_1-norm of the regression coefficients. In the language of signal processing, this means consistency for de-noising using a strongly overcomplete dictionary if the underlying signal is sparse in terms of the â„“1\ell_1-norm. We also propose here an AIC\mathit{AIC}-based method for tuning, namely for choosing the number of boosting iterations. This makes L2L_2Boosting computationally attractive since it is not required to run the algorithm multiple times for cross-validation as commonly used so far. We demonstrate L2L_2Boosting for simulated data, in particular where the predictor dimension is large in comparison to sample size, and for a difficult tumor-classification problem with gene expression microarray data.Comment: Published at http://dx.doi.org/10.1214/009053606000000092 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Predicting time to graduation at a large enrollment American university

    Full text link
    The time it takes a student to graduate with a university degree is mitigated by a variety of factors such as their background, the academic performance at university, and their integration into the social communities of the university they attend. Different universities have different populations, student services, instruction styles, and degree programs, however, they all collect institutional data. This study presents data for 160,933 students attending a large American research university. The data includes performance, enrollment, demographics, and preparation features. Discrete time hazard models for the time-to-graduation are presented in the context of Tinto's Theory of Drop Out. Additionally, a novel machine learning method: gradient boosted trees, is applied and compared to the typical maximum likelihood method. We demonstrate that enrollment factors (such as changing a major) lead to greater increases in model predictive performance of when a student graduates than performance factors (such as grades) or preparation (such as high school GPA).Comment: 28 pages, 11 figure
    • …
    corecore