50,929 research outputs found
Generating Compact Tree Ensembles via Annealing
Tree ensembles are flexible predictive models that can capture relevant
variables and to some extent their interactions in a compact and interpretable
manner. Most algorithms for obtaining tree ensembles are based on versions of
boosting or Random Forest. Previous work showed that boosting algorithms
exhibit a cyclic behavior of selecting the same tree again and again due to the
way the loss is optimized. At the same time, Random Forest is not based on loss
optimization and obtains a more complex and less interpretable model. In this
paper we present a novel method for obtaining compact tree ensembles by growing
a large pool of trees in parallel with many independent boosting threads and
then selecting a small subset and updating their leaf weights by loss
optimization. We allow for the trees in the initial pool to have different
depths which further helps with generalization. Experiments on real datasets
show that the obtained model has usually a smaller loss than boosting, which is
also reflected in a lower misclassification error on the test set.Comment: Comparison with Random Forest included in the results sectio
Machine Learning Techniques for Stellar Light Curve Classification
We apply machine learning techniques in an attempt to predict and classify
stellar properties from noisy and sparse time series data. We preprocessed over
94 GB of Kepler light curves from MAST to classify according to ten distinct
physical properties using both representation learning and feature engineering
approaches. Studies using machine learning in the field have been primarily
done on simulated data, making our study one of the first to use real light
curve data for machine learning approaches. We tuned our data using previous
work with simulated data as a template and achieved mixed results between the
two approaches. Representation learning using a Long Short-Term Memory (LSTM)
Recurrent Neural Network (RNN) produced no successful predictions, but our work
with feature engineering was successful for both classification and regression.
In particular, we were able to achieve values for stellar density, stellar
radius, and effective temperature with low error (~ 2 - 4%) and good accuracy
(~ 75%) for classifying the number of transits for a given star. The results
show promise for improvement for both approaches upon using larger datasets
with a larger minority class. This work has the potential to provide a
foundation for future tools and techniques to aid in the analysis of
astrophysical data.Comment: Accepted to The Astronomical Journa
Tree Boosting Data Competitions with XGBoost
This Master's Degree Thesis objective is to provide understanding on how to approach a supervised learning predictive problem and illustrate it using a statistical/machine learning algorithm, Tree Boosting. A review of tree methodology is introduced in order to understand its evolution, since Classification and Regression Trees, followed by Bagging, Random Forest and, nowadays, Tree Boosting. The methodology is explained following the XGBoost implementation, which achieved state-of-the-art results in several data competitions. A framework for applied predictive modelling is explained with its proper concepts: objective function, regularization term, overfitting, hyperparameter tuning, k-fold cross validation and feature engineering. All these concepts are illustrated with a real dataset of videogame churn; used in a datathon competition
Massive Open Online Courses Temporal Profiling for Dropout Prediction
Massive Open Online Courses (MOOCs) are attracting the attention of people
all over the world. Regardless the platform, numbers of registrants for online
courses are impressive but in the same time, completion rates are
disappointing. Understanding the mechanisms of dropping out based on the
learner profile arises as a crucial task in MOOCs, since it will allow
intervening at the right moment in order to assist the learner in completing
the course. In this paper, the dropout behaviour of learners in a MOOC is
thoroughly studied by first extracting features that describe the behavior of
learners within the course and then by comparing three classifiers (Logistic
Regression, Random Forest and AdaBoost) in two tasks: predicting which users
will have dropped out by a certain week and predicting which users will drop
out on a specific week. The former has showed to be considerably easier, with
all three classifiers performing equally well. However, the accuracy for the
second task is lower, and Logistic Regression tends to perform slightly better
than the other two algorithms. We found that features that reflect an active
attitude of the user towards the MOOC, such as submitting their assignment,
posting on the Forum and filling their Profile, are strong indicators of
persistence.Comment: 8 pages, ICTAI1
- …