34 research outputs found

    Filtering participants improves generalization in competitions and benchmarks

    Get PDF
    International audienceWe address the problem of selecting a winning algorithm in a challenge or benchmark. While evaluations of algorithms carried out by third party organizers eliminate the inventor-evaluator bias, little attention has been paid to the risk of over-fitting the winner's selection by the organizers. In this paper, we carry out an empirical evaluation using the results of several challenges and benchmarks, evidencing this phenomenon. We show that a heuristic commonly used by organizers consisting of pre-filtering participants using a trial run, reduces over-fitting. We formalize this method and derive a semi-empirical formula to determine the optimal number of top k participants to retain from the trial run

    Judging competitions and benchmarks: a candidate election approach

    Get PDF
    International audienceMachine learning progress relies on algorithm benchmarks. We study the problem of declaring a winner, or ranking "candidate" algorithms, based on results obtained by "judges" (scores on various tasks). Inspired by social science and game theory on fair elections, we compare various ranking functions, ranging from simple score averaging to Condorcet methods. We devise novel empirical criteria to assess the quality of ranking functions, including the generalization to new tasks and the stability under judge or candidate perturbation. We conduct an empirical comparison on the results of 5 competitions and benchmarks (one artificially generated). While prior theoretical analyses indicate that no single ranking function satisfies all desired properties, our empirical study reveals that the classical "average rank" method fares well. However, some pairwise comparison methods can get better empirical results

    CodaLab Competitions: An open source platform to organize scientific challenges

    Get PDF
    CodaLab Competitions is an open source web platform designed to help data scientists and research teams to crowd-source the resolution of machine learning problems through the organization of competitions, also called challenges or contests. CodaLab Competitions provides useful features such as multiple phases, results and code submissions, multi-score leaderboards, and jobs running inside Docker containers. The platform is very flexible and can handle large scale experiments, by allowing organizers to upload large datasets and provide their own CPU or GPU compute workers

    Towards Automated Deep Learning: Analysis of the AutoDL challenge series 2019

    Get PDF
    International audienceWe present the design and results of recent competitions in Automated Deep Learning (AutoDL). In the AutoDL challenge series 2019, we organized 5 machine learning challenges: AutoCV, AutoCV2, AutoNLP, AutoSpeech and AutoDL. The first 4 challenges concern each a specific application domain, such as computer vision, natural language processing and speech recognition. At the time of March 2020, the last challenge AutoDL is still ongoing and we only present its design. 1 Some highlights of this work include: (1) a benchmark suite of baseline AutoML solutions, with emphasis on domains for which Deep Learning methods have had prior success (image, video, text, speech, etc); (2) a novel "anytime learning" framework, which opens doors for further theoretical consideration; (3) a repository of around 100 datasets (from all above domains) over half of which are released as public datasets to enable research on meta-learning; (4) analyses revealing that winning solutions generalize to new unseen datasets, validating progress towards universal AutoML 1. Its results will be presented in future work together with detailed introduction of winning solutions of each challenge

    Aircraft Numerical "Twin": A Time Series Regression Competition

    Get PDF
    International audienceThis paper presents the design and analysis of a data science competition on a problem of time series regression from aeronautics data. For the purpose of performing predictive maintenance, aviation companies seek to create aircraft "numerical twins", which are programs capable of accurately predicting strains at strategic positions in various body parts of the aircraft. Given a number of input parameters (sensor data) recorded in sequence during the flight, the competition participants had to predict output values (gauges), also recorded sequentially during test flights, but not recorded during regular flights. The competition data included hundreds of complete flights. It was a code submission competition with complete blind testing of algorithms. The results indicate that such a problem can be effectively solved with gradient boosted trees, after preprocessing and feature engineering. Deep learning methods did not prove as efficient

    Filtering participants improves generalization in competitions and benchmarks

    Get PDF
    International audienceWe address the problem of selecting a winning algorithm in a challenge or benchmark. While evaluations of algorithms carried out by third party organizers eliminate the inventor-evaluator bias, little attention has been paid to the risk of over-fitting the winner's selection by the organizers. In this paper, we carry out an empirical evaluation using the results of several challenges and benchmarks, evidencing this phenomenon. We show that a heuristic commonly used by organizers consisting of pre-filtering participants using a trial run, reduces over-fitting. We formalize this method and derive a semi-empirical formula to determine the optimal number of top k participants to retain from the trial run

    Judging competitions and benchmarks: a candidate election approach

    Get PDF
    International audienceMachine learning progress relies on algorithm benchmarks. We study the problem of declaring a winner, or ranking "candidate" algorithms, based on results obtained by "judges" (scores on various tasks). Inspired by social science and game theory on fair elections, we compare various ranking functions, ranging from simple score averaging to Condorcet methods. We devise novel empirical criteria to assess the quality of ranking functions, including the generalization to new tasks and the stability under judge or candidate perturbation. We conduct an empirical comparison on the results of 5 competitions and benchmarks (one artificially generated). While prior theoretical analyses indicate that no single ranking function satisfies all desired properties, our empirical study reveals that the classical "average rank" method fares well. However, some pairwise comparison methods can get better empirical results