9,505 research outputs found
Modeling Binary Time Series Using Gaussian Processes with Application to Predicting Sleep States
Motivated by the problem of predicting sleep states, we develop a mixed
effects model for binary time series with a stochastic component represented by
a Gaussian process. The fixed component captures the effects of covariates on
the binary-valued response. The Gaussian process captures the residual
variations in the binary response that are not explained by covariates and past
realizations. We develop a frequentist modeling framework that provides
efficient inference and more accurate predictions. Results demonstrate the
advantages of improved prediction rates over existing approaches such as
logistic regression, generalized additive mixed model, models for ordinal data,
gradient boosting, decision tree and random forest. Using our proposed model,
we show that previous sleep state and heart rates are significant predictors
for future sleep states. Simulation studies also show that our proposed method
is promising and robust. To handle computational complexity, we utilize Laplace
approximation, golden section search and successive parabolic interpolation.
With this paper, we also submit an R-package (HIBITS) that implements the
proposed procedure.Comment: Journal of Classification (2018
Tree Boosting Data Competitions with XGBoost
This Master's Degree Thesis objective is to provide understanding on how to approach a supervised learning predictive problem and illustrate it using a statistical/machine learning algorithm, Tree Boosting. A review of tree methodology is introduced in order to understand its evolution, since Classification and Regression Trees, followed by Bagging, Random Forest and, nowadays, Tree Boosting. The methodology is explained following the XGBoost implementation, which achieved state-of-the-art results in several data competitions. A framework for applied predictive modelling is explained with its proper concepts: objective function, regularization term, overfitting, hyperparameter tuning, k-fold cross validation and feature engineering. All these concepts are illustrated with a real dataset of videogame churn; used in a datathon competition
RandomBoost: Simplified Multi-class Boosting through Randomization
We propose a novel boosting approach to multi-class classification problems,
in which multiple classes are distinguished by a set of random projection
matrices in essence. The approach uses random projections to alleviate the
proliferation of binary classifiers typically required to perform multi-class
classification. The result is a multi-class classifier with a single
vector-valued parameter, irrespective of the number of classes involved. Two
variants of this approach are proposed. The first method randomly projects the
original data into new spaces, while the second method randomly projects the
outputs of learned weak classifiers. These methods are not only conceptually
simple but also effective and easy to implement. A series of experiments on
synthetic, machine learning and visual recognition data sets demonstrate that
our proposed methods compare favorably to existing multi-class boosting
algorithms in terms of both the convergence rate and classification accuracy.Comment: 15 page
Boosting for high-dimensional linear models
We prove that boosting with the squared error loss, Boosting, is
consistent for very high-dimensional linear models, where the number of
predictor variables is allowed to grow essentially as fast as (exp(sample
size)), assuming that the true underlying regression function is sparse in
terms of the -norm of the regression coefficients. In the language of
signal processing, this means consistency for de-noising using a strongly
overcomplete dictionary if the underlying signal is sparse in terms of the
-norm. We also propose here an -based method for tuning,
namely for choosing the number of boosting iterations. This makes Boosting
computationally attractive since it is not required to run the algorithm
multiple times for cross-validation as commonly used so far. We demonstrate
Boosting for simulated data, in particular where the predictor dimension
is large in comparison to sample size, and for a difficult tumor-classification
problem with gene expression microarray data.Comment: Published at http://dx.doi.org/10.1214/009053606000000092 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Predicting time to graduation at a large enrollment American university
The time it takes a student to graduate with a university degree is mitigated
by a variety of factors such as their background, the academic performance at
university, and their integration into the social communities of the university
they attend. Different universities have different populations, student
services, instruction styles, and degree programs, however, they all collect
institutional data. This study presents data for 160,933 students attending a
large American research university. The data includes performance, enrollment,
demographics, and preparation features. Discrete time hazard models for the
time-to-graduation are presented in the context of Tinto's Theory of Drop Out.
Additionally, a novel machine learning method: gradient boosted trees, is
applied and compared to the typical maximum likelihood method. We demonstrate
that enrollment factors (such as changing a major) lead to greater increases in
model predictive performance of when a student graduates than performance
factors (such as grades) or preparation (such as high school GPA).Comment: 28 pages, 11 figure
- …