689 research outputs found

    Improving Backfilling by using Machine Learning to predict Running Times

    Get PDF
    International audienceThe job management system is the HPC middleware responsible for distributing computing power to applications. While such systems generate an ever increasing amount of data, they are characterized by uncertainties on some parameters like the job running times. The question raised in this work is: To what extent is it possible/useful to take into account predictions on the job running times for improving the global scheduling? We present a comprehensive study for answering this question assuming the popular EASY backfilling policy. More precisely, we rely on some classical methods in machine learning and propose new cost functions well-adapted to the problem. Then, we assess our proposed solutions through intensive simulations using several production logs. Finally, we propose a new scheduling algorithm that outperforms the popular EASY backfilling algorithm by 28% considering the average bounded slowdown objective

    Data-driven job dispatching in HPC systems

    Get PDF
    As High Performance Computing (HPC) systems get closer to exascale performance, job dispatching strategies become critical for keeping system utilization high while keeping waiting times low for jobs competing for HPC system resources. In this paper, we take a data-driven approach and investigate whether better dispatching decisions can be made by transforming the log data produced by an HPC system into useful knowledge about its workload. In particular, we focus on job duration, develop a data-driven approach to job duration prediction, and analyze the effect of different prediction approaches in making dispatching decisions using a real workload dataset collected from Eurora, a hybrid HPC system. Experiments on various dispatching methods show promising results

    Metascheduling of HPC Jobs in Day-Ahead Electricity Markets

    Full text link
    High performance grid computing is a key enabler of large scale collaborative computational science. With the promise of exascale computing, high performance grid systems are expected to incur electricity bills that grow super-linearly over time. In order to achieve cost effectiveness in these systems, it is essential for the scheduling algorithms to exploit electricity price variations, both in space and time, that are prevalent in the dynamic electricity price markets. In this paper, we present a metascheduling algorithm to optimize the placement of jobs in a compute grid which consumes electricity from the day-ahead wholesale market. We formulate the scheduling problem as a Minimum Cost Maximum Flow problem and leverage queue waiting time and electricity price predictions to accurately estimate the cost of job execution at a system. Using trace based simulation with real and synthetic workload traces, and real electricity price data sets, we demonstrate our approach on two currently operational grids, XSEDE and NorduGrid. Our experimental setup collectively constitute more than 433K processors spread across 58 compute systems in 17 geographically distributed locations. Experiments show that our approach simultaneously optimizes the total electricity cost and the average response time of the grid, without being unfair to users of the local batch systems.Comment: Appears in IEEE Transactions on Parallel and Distributed System

    Improving the Performance of Batch Schedulers Using Online Job Size Classification

    Get PDF
    Job scheduling in high-performance computing platforms is a hard problem that involves uncertainties on both the job arrival process and their execution time. Users typically provide a loose upper bound estimate for job execution times that are hardly useful. Previous studies attempted to improve these estimates using regression techniques. Although these attempts provide reasonable predictions, they require a long period of training data. Furthermore, aiming for perfect prediction may be of limited use for scheduling purposes. In this work, we propose a simpler approach by classifying jobs as small or large and prioritizing the execution of small jobs over large ones. Indeed, small jobs are the most impacted by queuing delays but they typically represent a light load and incur a small burden on the other jobs. The classifier operates online and learns by using data collected over the previous weeks, facilitating its deployment and enabling fast adaptations to changes in workload characteristics. We evaluate our approach using four scheduling policies on six HPC platform workload traces. We show that: (i) incorporating such classification reduces the average bounded slowdown of jobs in all scenarios, and (ii) the obtained improvements are comparable, in most scenarios, to the ideal hypothetical situation where the scheduler would know the exact running time of jobs in advance

    Adapting Batch Scheduling to Workload Characteristics: What can we expect From Online Learning?

    Get PDF
    Despite the impressive growth and size of super-computers, the computational power they provide still cannot match the demand. Efficient and fair resource allocation is a critical task. Super-computers use Resource and Job Management Systems to schedule applications, which is generally done by relying on generic index policies such as First Come First Served and Shortest Processing time First in combination with Backfilling strategies. Unfortunately, such generic policies often fail to exploit specific characteristics of real workloads. In this work, we focus on improving the performance of online schedulers. We study mixed policies, which are created by combining multiple job characteristics in a weighted linear expression, as opposed to classical pure policies which use only a single characteristic. This larger class of scheduling policies aims at providing more flexibility and adaptability. We use space coverage and black-box optimization techniques to explore this new space of mixed policies and we study how can they adapt to the changes in the workload. We perform an extensive experimental campaign through which we show that (1) even the best pure policy is far from optimal and that (2) using a carefully tuned mixed policy would allow to significantly improve the performance of the system. (3) We also provide empirical evidence that there is no one size fits all policy, by showing that the rapid workload evolution seems to prevent classical online learning algorithms from being effective.Malgré la croissance impressionnante et la taille des super-ordinateurs, le la puissance de calcul qu’ils fournissent ne peut toujours pas correspondre à la demande. Une allocation efficace et juste des ressources est essentielle tâche. Les super-ordinateurs utilisent des systèmes de gestion des ressources et des tâches pour programmer les applications, ce qui est généralement fait en s?appuyant sur des politiques d’index telles que First Come First Served et Shortest Temps de traitement D’abord en combinaison avec les stratégies de remblayage. Malheureusement, ces politiques génériques échouent souvent exploiter les caractéristiques spécifiques des charges de travail réelles. Dans ce travail, nous nous concentrons sur l’amélioration des performances des ordonnanceurs en ligne. Nous étudions des stratégies mixtes, créées en combinant plusieurs tâches caractéristiques dans une expression linéaire pondérée, par opposition à les politiques pures classiques qui n’utilisent qu’une seule caractéristique. Ce une plus grande classe de politiques de planification vise à offrir plus de flexibilité et adaptabilité. Nous utilisons la couverture d’espace et l’optimisation de la boîtenoire techniques pour explorer ce nouvel espace de politiques mixtes et nous étudions Comment peuvent-ils s’adapter aux changements de la charge de travail? Nous réalisons une vaste campagne expérimentale à travers laquelle nous montrons que (1) même la meilleure politique pure est loin d?être optimale et que (2) l?utilisation d?une politique mixte soigneusement adaptée permettrait de améliorer de manière significative les performances du système. (3) nous aussi fournir des preuves empiriques qu’il n’y a pas de politique uniforme, en montrant que l’évolution rapide de la charge de travail semble empêcher algorithmes classiques d’apprentissage en ligne d’être efficaces
    • …
    corecore