50,929 research outputs found

    Generating Compact Tree Ensembles via Annealing

    Full text link
    Tree ensembles are flexible predictive models that can capture relevant variables and to some extent their interactions in a compact and interpretable manner. Most algorithms for obtaining tree ensembles are based on versions of boosting or Random Forest. Previous work showed that boosting algorithms exhibit a cyclic behavior of selecting the same tree again and again due to the way the loss is optimized. At the same time, Random Forest is not based on loss optimization and obtains a more complex and less interpretable model. In this paper we present a novel method for obtaining compact tree ensembles by growing a large pool of trees in parallel with many independent boosting threads and then selecting a small subset and updating their leaf weights by loss optimization. We allow for the trees in the initial pool to have different depths which further helps with generalization. Experiments on real datasets show that the obtained model has usually a smaller loss than boosting, which is also reflected in a lower misclassification error on the test set.Comment: Comparison with Random Forest included in the results sectio

    Machine Learning Techniques for Stellar Light Curve Classification

    Get PDF
    We apply machine learning techniques in an attempt to predict and classify stellar properties from noisy and sparse time series data. We preprocessed over 94 GB of Kepler light curves from MAST to classify according to ten distinct physical properties using both representation learning and feature engineering approaches. Studies using machine learning in the field have been primarily done on simulated data, making our study one of the first to use real light curve data for machine learning approaches. We tuned our data using previous work with simulated data as a template and achieved mixed results between the two approaches. Representation learning using a Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) produced no successful predictions, but our work with feature engineering was successful for both classification and regression. In particular, we were able to achieve values for stellar density, stellar radius, and effective temperature with low error (~ 2 - 4%) and good accuracy (~ 75%) for classifying the number of transits for a given star. The results show promise for improvement for both approaches upon using larger datasets with a larger minority class. This work has the potential to provide a foundation for future tools and techniques to aid in the analysis of astrophysical data.Comment: Accepted to The Astronomical Journa

    Tree Boosting Data Competitions with XGBoost

    Get PDF
    This Master's Degree Thesis objective is to provide understanding on how to approach a supervised learning predictive problem and illustrate it using a statistical/machine learning algorithm, Tree Boosting. A review of tree methodology is introduced in order to understand its evolution, since Classification and Regression Trees, followed by Bagging, Random Forest and, nowadays, Tree Boosting. The methodology is explained following the XGBoost implementation, which achieved state-of-the-art results in several data competitions. A framework for applied predictive modelling is explained with its proper concepts: objective function, regularization term, overfitting, hyperparameter tuning, k-fold cross validation and feature engineering. All these concepts are illustrated with a real dataset of videogame churn; used in a datathon competition

    Massive Open Online Courses Temporal Profiling for Dropout Prediction

    Get PDF
    Massive Open Online Courses (MOOCs) are attracting the attention of people all over the world. Regardless the platform, numbers of registrants for online courses are impressive but in the same time, completion rates are disappointing. Understanding the mechanisms of dropping out based on the learner profile arises as a crucial task in MOOCs, since it will allow intervening at the right moment in order to assist the learner in completing the course. In this paper, the dropout behaviour of learners in a MOOC is thoroughly studied by first extracting features that describe the behavior of learners within the course and then by comparing three classifiers (Logistic Regression, Random Forest and AdaBoost) in two tasks: predicting which users will have dropped out by a certain week and predicting which users will drop out on a specific week. The former has showed to be considerably easier, with all three classifiers performing equally well. However, the accuracy for the second task is lower, and Logistic Regression tends to perform slightly better than the other two algorithms. We found that features that reflect an active attitude of the user towards the MOOC, such as submitting their assignment, posting on the Forum and filling their Profile, are strong indicators of persistence.Comment: 8 pages, ICTAI1
    • …
    corecore