546 research outputs found

    Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science

    Full text link
    As the field of data science continues to grow, there will be an ever-increasing demand for tools that make machine learning accessible to non-experts. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement an open source Tree-based Pipeline Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a series of simulated and real-world benchmark data sets. In particular, we show that TPOT can design machine learning pipelines that provide a significant improvement over a basic machine learning analysis while requiring little to no input nor prior knowledge from the user. We also address the tendency for TPOT to design overly complex pipelines by integrating Pareto optimization, which produces compact pipelines without sacrificing classification accuracy. As such, this work represents an important step toward fully automating machine learning pipeline design.Comment: 8 pages, 5 figures, preprint to appear in GECCO 2016, edits not yet made from reviewer comment

    An ADMM Based Framework for AutoML Pipeline Configuration

    Full text link
    We study the AutoML problem of automatically configuring machine learning pipelines by jointly selecting algorithms and their appropriate hyper-parameters for all steps in supervised learning pipelines. This black-box (gradient-free) optimization with mixed integer & continuous variables is a challenging problem. We propose a novel AutoML scheme by leveraging the alternating direction method of multipliers (ADMM). The proposed framework is able to (i) decompose the optimization problem into easier sub-problems that have a reduced number of variables and circumvent the challenge of mixed variable categories, and (ii) incorporate black-box constraints along-side the black-box optimization objective. We empirically evaluate the flexibility (in utilizing existing AutoML techniques), effectiveness (against open source AutoML toolkits),and unique capability (of executing AutoML with practically motivated black-box constraints) of our proposed scheme on a collection of binary classification data sets from UCI ML& OpenML repositories. We observe that on an average our framework provides significant gains in comparison to other AutoML frameworks (Auto-sklearn & TPOT), highlighting the practical advantages of this framework

    Performance and Competitiveness of Tree-Based Pipeline Optimization Tool

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceAutomated machine learning (AutoML) is the process of automating the entire machine learn-ing workflow when applied to real-world problems. AutoML can increase data science produc-tivity while keeping the same performance and accuracy, allowing non-experts to use complex machine learning methods. Tree-based Pipeline Optimization Tool (TPOT) was one of the first AutoML methods created by data scientists and is targeted to optimize machine learning pipe-lines using genetic programming. While still under active development, TPOT is a very prom-ising AutoML tool. This Thesis aims to explore the algorithm and analyse its performance using real word data. Results show that evolution-based optimization is at least as accurate as TPOT initialization. The effectiveness of genetic operators, however, depends on the nature of the test case
    corecore