73,472 research outputs found

    Model Selection in Data Analysis Competitions

    Get PDF
    Abstract. The use of data analysis competitions for selecting the most appropriate model for a problem is a recent innovation in the field of predictive machine learning. Two of the most well-known examples of this trend was the Netflix Competition and recently the competitions hosted on the online platform Kaggle. In this paper, we will state and try to verify a set of qualitative hypotheses about predictive modelling, both in general and in the scope of data analysis competitions. To verify our hypotheses we will look at previous competitions and their outcomes, use qualitative interviews with top performers from Kaggle and use previous personal experiences from competing in Kaggle competitions. The stated hypotheses about feature engineering, ensembling, overfitting, model complexity and evaluation metrics give indications and guidelines on how to select a proper model for performing well in a competition on Kaggle.

    Genetic variation in competition traits at different ages and time periods and correlations with traits at field tests of 4-year-old Swedish Warmblood horses

    Get PDF
    For many years, the breeding value estimation for Swedish riding horses has been based on results from Riding Horse Quality Tests (RHQTs) of 4-year-olds only. Traits tested are conformation, gaits and jumping ability. An integrated index including competition results is under development to both get as reliable proofs as possible and increases the credibility of the indexes among breeders, trainers and riders. The objectives of this study were to investigate the suitability of competition data for use in genetic evaluations of horses and to examine how well young horse performance agrees with performance later in life. Competition results in dressage and show jumping for almost 40 000 horses from the beginning of the 1960s until 2006 were available. For RHQT data of 14 000 horses judged between 1988 and 2007 were used. Genetic parameters were estimated for accumulated competition results defined for different age groups (4 to 6 years of age, 4 to 9 years of age and lifetime), and for different birth year groups. Genetic correlations were estimated between results at RHQT and competitions with a multi-trait animal model. Heritabilities were higher for show jumping than dressage and increased with increasing age of the horse and amount of information. For dressage, heritabilities increased from 0.11 for the youngest group to 0.16 for lifetime results. For show jumping corresponding values increased from 0.24 to 0.28. Genetic correlations between competition results for the different age groups were highly positive (0.84 to 1.00), as were those between jumping traits at RHQT and competition results in show jumping (0.87 to 0.89). For dressage-related traits as 4-year-old and dressage competition results the estimated genetic correlations were between 0.47 and 0.77. We suggest that lifetime results from competitions should be integrated into the genetic evaluation system. However, genetic parameters showed that traits had changed during the over 35-year period covered due to the development of the sport, which needs to be considered in future genetic evaluations

    Tree Boosting Data Competitions with XGBoost

    Get PDF
    This Master's Degree Thesis objective is to provide understanding on how to approach a supervised learning predictive problem and illustrate it using a statistical/machine learning algorithm, Tree Boosting. A review of tree methodology is introduced in order to understand its evolution, since Classification and Regression Trees, followed by Bagging, Random Forest and, nowadays, Tree Boosting. The methodology is explained following the XGBoost implementation, which achieved state-of-the-art results in several data competitions. A framework for applied predictive modelling is explained with its proper concepts: objective function, regularization term, overfitting, hyperparameter tuning, k-fold cross validation and feature engineering. All these concepts are illustrated with a real dataset of videogame churn; used in a datathon competition

    Advances in forecasting with neural networks? Empirical evidence from the NN3 competition on time series prediction

    Get PDF
    This paper reports the results of the NN3 competition, which is a replication of the M3 competition with an extension of the competition towards neural network (NN) and computational intelligence (CI) methods, in order to assess what progress has been made in the 10 years since the M3 competition. Two masked subsets of the M3 monthly industry data, containing 111 and 11 empirical time series respectively, were chosen, controlling for multiple data conditions of time series length (short/long), data patterns (seasonal/non-seasonal) and forecasting horizons (short/medium/long). The relative forecasting accuracy was assessed using the metrics from the M3, together with later extensions of scaled measures, and non-parametric statistical tests. The NN3 competition attracted 59 submissions from NN, CI and statistics, making it the largest CI competition on time series data. Its main findings include: (a) only one NN outperformed the damped trend using the sMAPE, but more contenders outperformed the AutomatANN of the M3; (b) ensembles of CI approaches performed very well, better than combinations of statistical methods; (c) a novel, complex statistical method outperformed all statistical and Cl benchmarks; and (d) for the most difficult subset of short and seasonal series, a methodology employing echo state neural networks outperformed all others. The NN3 results highlight the ability of NN to handle complex data, including short and seasonal time series, beyond prior expectations, and thus identify multiple avenues for future research. (C) 2011 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved

    ASlib: A Benchmark Library for Algorithm Selection

    Full text link
    The task of algorithm selection involves choosing an algorithm from a set of algorithms on a per-instance basis in order to exploit the varying performance of algorithms over a set of instances. The algorithm selection problem is attracting increasing attention from researchers and practitioners in AI. Years of fruitful applications in a number of domains have resulted in a large amount of data, but the community lacks a standard format or repository for this data. This situation makes it difficult to share and compare different approaches effectively, as is done in other, more established fields. It also unnecessarily hinders new researchers who want to work in this area. To address this problem, we introduce a standardized format for representing algorithm selection scenarios and a repository that contains a growing number of data sets from the literature. Our format has been designed to be able to express a wide variety of different scenarios. Demonstrating the breadth and power of our platform, we describe a set of example experiments that build and evaluate algorithm selection models through a common interface. The results display the potential of algorithm selection to achieve significant performance improvements across a broad range of problems and algorithms.Comment: Accepted to be published in Artificial Intelligence Journa
    • …
    corecore