23,619 research outputs found

    Structure-Aware Dynamic Scheduler for Parallel Machine Learning

    Full text link
    Training large machine learning (ML) models with many variables or parameters can take a long time if one employs sequential procedures even with stochastic updates. A natural solution is to turn to distributed computing on a cluster; however, naive, unstructured parallelization of ML algorithms does not usually lead to a proportional speedup and can even result in divergence, because dependencies between model elements can attenuate the computational gains from parallelization and compromise correctness of inference. Recent efforts toward this issue have benefited from exploiting the static, a priori block structures residing in ML algorithms. In this paper, we take this path further by exploring the dynamic block structures and workloads therein present during ML program execution, which offers new opportunities for improving convergence, correctness, and load balancing in distributed ML. We propose and showcase a general-purpose scheduler, STRADS, for coordinating distributed updates in ML algorithms, which harnesses the aforementioned opportunities in a systematic way. We provide theoretical guarantees for our scheduler, and demonstrate its efficacy versus static block structures on Lasso and Matrix Factorization

    Petuum: A New Platform for Distributed Machine Learning on Big Data

    Full text link
    What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized graph-based execution that relies on graph representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of ML programs at scale. We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions. This presents unique opportunities for an integrative system design, such as bounded-error network synchronization and dynamic scheduling based on ML program structure. We demonstrate the efficacy of these system designs versus well-known implementations of modern ML algorithms, allowing ML programs to run in much less time and at considerably larger model sizes, even on modestly-sized compute clusters.Comment: 15 pages, 10 figures, final version in KDD 2015 under the same titl

    Integrating labor awareness to energy-efficient production scheduling under real-time electricity pricing : an empirical study

    Get PDF
    With the penetration of smart grid into factories, energy-efficient production scheduling has emerged as a promising method for industrial demand response. It shifts flexible production loads to lower-priced periods to reduce energy cost for the same production task. However, the existing methods only focus on integrating energy awareness to conventional production scheduling models. They ignore the labor cost which is shift-based and follows an opposite trend of energy cost. For instance, the energy cost is lower during nights while the labor cost is higher. Therefore, this paper proposes a method for energy-efficient and labor-aware production scheduling at the unit process level. This integrated scheduling model is mathematically formulated. Besides the state-based energy model and genetic algorithm-based optimization, a continuous-time shift accumulation heuristic is proposed to synchronize power states and labor shifts. In a case study of a Belgian plastic bottle manufacturer, a set of empirical sensitivity analyses were performed to investigate the impact of energy and labor awareness, as well as the production-related factors that influence the economic performance of a schedule. Furthermore, the demonstration was performed in 9 large-scale test instances, which encompass the cases where energy cost is minor, moderate, and major compared to the joint energy and labor cost. The results have proven that the ignorance of labor in existing energy-efficient production scheduling studies increases the joint energy and labor cost, although the energy cost can be minimized. To achieve effective production cost reduction, energy and labor awareness are recommended to be jointly considered in production scheduling. (C) 2017 Elsevier Ltd. All rights reserved

    The Price of Anarchy for Minsum Related Machine Scheduling

    Get PDF
    We address the classical uniformly related machine scheduling problem with minsum objective. The problem is solvable in polynomial time by the algorithm of Horowitz and Sahni. In that solution, each machine sequences its jobs shortest first. However when jobs may choose the machine on which they are processed, while keeping the same sequencing rule per machine, the resulting Nash equilibria are in general not optimal. The price of anarchy measures this optimality gap. By means of a new characterization of the optimal solution, we show that the price of anarchy in this setting is bounded from above by 2. We also give a lower bound of e/(e-1). This complements recent results on the price of anarchy for the more general unrelated machine scheduling problem, where the price of anarchy equals 4. Interestingly, as Nash equilibria coincide with shortest processing time first (SPT) schedules, the same bounds hold for SPT schedules. Thereby, our work also fills a gap in the literature

    A general guide to applying machine learning to computer architecture

    Get PDF
    The resurgence of machine learning since the late 1990s has been enabled by significant advances in computing performance and the growth of big data. The ability of these algorithms to detect complex patterns in data which are extremely difficult to achieve manually, helps to produce effective predictive models. Whilst computer architects have been accelerating the performance of machine learning algorithms with GPUs and custom hardware, there have been few implementations leveraging these algorithms to improve the computer system performance. The work that has been conducted, however, has produced considerably promising results. The purpose of this paper is to serve as a foundational base and guide to future computer architecture research seeking to make use of machine learning models for improving system efficiency. We describe a method that highlights when, why, and how to utilize machine learning models for improving system performance and provide a relevant example showcasing the effectiveness of applying machine learning in computer architecture. We describe a process of data generation every execution quantum and parameter engineering. This is followed by a survey of a set of popular machine learning models. We discuss their strengths and weaknesses and provide an evaluation of implementations for the purpose of creating a workload performance predictor for different core types in an x86 processor. The predictions can then be exploited by a scheduler for heterogeneous processors to improve the system throughput. The algorithms of focus are stochastic gradient descent based linear regression, decision trees, random forests, artificial neural networks, and k-nearest neighbors.This work has been supported by the European Research Council (ERC) Advanced Grant RoMoL (Grant Agreemnt 321253) and by the Spanish Ministry of Science and Innovation (contract TIN 2015-65316P).Peer ReviewedPostprint (published version

    Scheduling MapReduce Jobs under Multi-Round Precedences

    Full text link
    We consider non-preemptive scheduling of MapReduce jobs with multiple tasks in the practical scenario where each job requires several map-reduce rounds. We seek to minimize the average weighted completion time and consider scheduling on identical and unrelated parallel processors. For identical processors, we present LP-based O(1)-approximation algorithms. For unrelated processors, the approximation ratio naturally depends on the maximum number of rounds of any job. Since the number of rounds per job in typical MapReduce algorithms is a small constant, our scheduling algorithms achieve a small approximation ratio in practice. For the single-round case, we substantially improve on previously best known approximation guarantees for both identical and unrelated processors. Moreover, we conduct an experimental analysis and compare the performance of our algorithms against a fast heuristic and a lower bound on the optimal solution, thus demonstrating their promising practical performance
    • …
    corecore