123 research outputs found

    Self-Adaptive Scheduler Parameterization

    Get PDF
    High-end parallel systems present a tremendous research challenge on how to best allocate their resources to match dynamic workload characteristics and user habits that are often unique to each system. Although thoroughly investigated, job scheduling for production systems remains an inexact science, requiring significant experience and intuition from system administrators to properly configure batch schedulers. State-of-the-art schedulers provide many parameters for their configuration, but tuning these to optimize performance and to appropriately respond to the continuously varying characteristics of the workloads can be very difficult — the effects of different parameters and their interactions are often unintuitive. In this paper, we introduce a new and general methodology for automating the difficult process of job scheduler parameterization. Our proposed methodology is based on online simulations of a model of the actual system to provide on-the-fly suggestions to the scheduler for automated parameter adjustment. Detailed performance comparisons via simulation using actual supercomputing traces from the Parallel Workloads Archive indicate that this self-adaptive parameterization via online simulation consistently outperforms other workload-aware methods for scheduler parameterization. This methodology is unique, flexible, and practical in that it requires no a priori knowledge of the workload, it works well even in the presence of poor user runtime estimates, and it can be used to address any system statistic of interest

    BSLD threshold driven parallel job scheduling for energy efficient HPC centers

    Get PDF
    Recently, power awareness in high performance computing (HPC) community has increased significantly. While CPU power reduction of HPC applications using Dynamic Voltage Frequency Scaling (DVFS) has been explored thoroughly, CPU power management for large scale parallel systems at system level has left unexplored. In this paper we propose a power-aware parallel job scheduler assuming DVFS enabled clusters. Traditional parallel job schedulers determine when a job will be run, power aware ones should assign CPU frequency which it will be run at. We have introduced two adjustable thresholds in order to enable fine grain energy performance trade-off control. Since our power reduction approach is policy independent it can be added to any parallel job scheduling policy. Furthermore, we have done an analysis of HPC system dimension. Running an application at lower frequency on more processors can be more energy efficient than running it at the highest CPU frequency on less processors. This paper investigates whether having more DVFS enabled processors and same load can lead to better energy efficiency and performance. Five workload logs from systems in production use with up to 9 216 processors are simulated to evaluate the proposed algorithm and the dimensioning problem. Our approach decreases CPU energy by 7%- 18% on average depending on allowed job performance penalty. Applying the same frequency scaling algorithm on 20% larger system, CPU energy needed to execute same load can be decreased by almost 30% while having same or better job performance.Postprint (published version

    THE EXTENSION OF TORQUE SCHEDULER ALLOWING THE USE OF PLANNING AND OPTIMIZATION IN GRIDS

    Get PDF
    In this work we present a major extension of the open source TORQUE Resource Manager system. We have replaced a naive scheduler provided in the TORQUE distribution with complex scheduling system that allows to plan job execution ahead and predict the behavior of the system. It is based on the application of job schedule, which represents the jobs’ execution plan. Such a functionality is very useful as the plan can be used by the users to see when and where their jobs will be executed. Moreover, created plans can be easily evaluated in order to identify possible inefficiencies. Then, repair actions can be taken immediately and the inefficiencies can be fixed, producing better schedules with respect to considered criteria

    Including accurate user estimates in HPC schedulers: ban empirical analysis

    Get PDF
    This article focuses on the problem of dealing with low accuracy of job runtime estimates provided by users of high performance computing systems. The main goal of the study is to evaluate the benefits on the system utilization of providing accurate estimations, in order to motivate users to make an effort to provide better estimates. We propose the Penalty Scheduling Policy for including information about user estimates. The experimental evaluation is performed over realistic workload and scenarios, and validated by the use of a job scheduler simulator. We simulated different static and dynamic scenarios, which emulate diverse user behavior regarding the estimation of jobs runtime. Results demonstrate that the accuracy of users runtime estimates influences the waiting time of jobs. Under our proposed policy, in a scenario where users improve their estimates, waiting time of users with high accuracy can be up to 2.43 times lower than users with the lowest accuracy.XV Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    Backfilling with fairness and slack for parallel job scheduling

    Get PDF
    Parallel jobs have different runtimes and numbers of threads/processes. Thus, scheduling parallel jobs involves a packing problem. If jobs are packed as tightly as possible, utilization will be improved. Otherwise, some resources have to stay idle. The common solution to deal with idle resources is backfilling, which schedule smaller jobs submitted later to execute earlier as long as they do not postpone the first job or all the previous jobs in the waiting queue. Traditionally, backfilling uses first fit for idle resources, according to the submission order. However, in this case, better packing of jobs could be missed. Hence, we propose an algorithm which looks further ahead if significantly improving utilization. However at the same time, this could be unfair to some jobs ahead in the queue. So we use a delay factor as a constraint to limit unfairness. We propose a branch and bound algorithm which selects jobs for backfilling which keep utilization high, while trying to stay close to First-Come-First-Served (FCFS). We evaluate relative response time and utilization and compare to other backfilling approaches. The selection of jobs for backfilling to optimize for high utilization and low delay is implemented as an extension of the existing Scojo-PECT preemptive scheduler

    Parallel job scheduling policies to improve fairness : a case study.

    Full text link

    A hybrid Markov chain modeling architecture for workload on parallel computers

    Get PDF
    This paper proposes a comprehensive modeling architecture for workloads on parallel computers using Markov chains in combination with state dependent empirical distribution functions. This hybrid approach is based on the requirements of scheduling algorithms: the model considers the four essential job attributes submission time, number of required processors, estimated processing time, and actual processing time. To assess the goodness-of-fit of a workload model the similarity of sequences of real jobs and jobs generated from the model needs to be captured. We propose to reduce the complexity of this task and to evaluate the model by comparing the results of a widely-used scheduling algorithm instead. This approach is demonstrated with commonly used scheduling objectives like the Average Weighted Response Time and total Utilization. We compare their outcomes on the simulated workload traces from our model with those of an original workload trace from a real Massively Parallel Processing system installation. To verify this new evaluation technique, standard criteria for assessing the goodness-of-fit for workload models are additionally applied
    corecore