Search CORE

15 research outputs found

The Workflow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures -- Technical Report

Author: Deelman Ewa
Hegeman Tim
Iosup Alexandru
Mathá Roland
Prodan Radu
Talluri Sacheendra
Versluis Laurens
Publication venue
Publication date: 11/07/2019
Field of study

Realistic, relevant, and reproducible experiments often need input traces collected from real-world environments. We focus in this work on traces of workflows---common in datacenters, clouds, and HPC infrastructures. We show that the state-of-the-art in using workflow-traces raises important issues: (1) the use of realistic traces is infrequent, and (2) the use of realistic, {\it open-access} traces even more so. Alleviating these issues, we introduce the Workflow Trace Archive (WTA), an open-access archive of workflow traces from diverse computing infrastructures and tooling to parse, validate, and analyze traces. The WTA includes

{>}48

million workflows captured from

{>}10

computing infrastructures, representing a broad diversity of trace domains and characteristics. To emphasize the importance of trace diversity, we characterize the WTA contents and analyze in simulation the impact of trace diversity on experiment results. Our results indicate significant differences in characteristics, properties, and workflow structures between workload sources, domains, and fields.Comment: Technical repor

arXiv.org e-Print Archive

VU Research Portal

Web-Scale Job Scheduling

Author: A. Nissimov
A.C. Sodan
A.K. Mishra
B. Atikoglu
B. Schroeder
D. Borthakur
D. Jackson
D. Meisner
D.G. Andersen
D.G. Feitelson
D.G. Feitelson
E. Frachtenberg
E. Frachtenberg
G. Holt
G. Sabin
H. Raj
J. Dean
J. Leverich
K. Xiong
L.A. Barroso
T. Jones
T. White
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Crossref

Tuning EASY-Backfilling Queues

Author: Lelong Jérôme
Reis Valentin
Trystram Denis
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 29/05/2017
Field of study

International audienceEASY-Backfilling is a popular scheduling heuristic for allocating jobs in large scale High Performance Computing platforms. While its aggressive reservation mechanism is fast and prevents job starvation, it does not try to optimize any scheduling objective per se. We consider in this work the problem of tuning EASY using queue reordering policies. More precisely, we propose to tune the reordering using a simulation-based methodology. For a given system, we choose the policy in order to minimize the average waiting time. This methodology departs from the First-Come, First-Serve rule and introduces a risk on the maximum values of the waiting time, which we control using a queue thresholding mechanism. This new approach is evaluated through a comprehensive experimental campaign on five production logs. In particular, we show that the behavior of the systems under study is stable enough to learn a heuristic that generalizes in a train/test fashion. Indeed, the average waiting time can be reduced consistently (between 11% to 42% for the logs used) compared to EASY, with almost no increase in maximum waiting times. This work departs from previous learning-based approaches and shows that scheduling heuristics for HPC can be learned directly in a policy space

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Build-and-Test Workloads for Grid Middleware: Problem, Analysis, and Applications

Author: Epema Dick H. J.
Iosup Alexandru
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

Crossref

VU Research Portal

Improving Backfilling by using Machine Learning to predict Running Times

Author: Gaussier Eric
Glesser David
Reis Valentin
Trystram Denis
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 15/11/2015
Field of study

International audienceThe job management system is the HPC middleware responsible for distributing computing power to applications. While such systems generate an ever increasing amount of data, they are characterized by uncertainties on some parameters like the job running times. The question raised in this work is: To what extent is it possible/useful to take into account predictions on the job running times for improving the global scheduling? We present a comprehensive study for answering this question assuming the popular EASY backfilling policy. More precisely, we rely on some classical methods in machine learning and propose new cost functions well-adapted to the problem. Then, we assess our proposed solutions through intensive simulations using several production logs. Finally, we propose a new scheduling algorithm that outperforms the popular EASY backfilling algorithm by 28% considering the average bounded slowdown objective

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Supercomputer Emulation For Evaluating Scheduling Algorithms

Author: Barberato Claudio
Publication venue
Publication date: 01/01/2017
Field of study

Scheduling algorithms have a significant impact on the optimal utilization of HPC facilities, yet the vast majority of the research in this area is done using simulations. In working with simulations, a great deal of factors that affect a real scheduler, such as its scheduling processing time, communication latencies and the scheduler intrinsic implementation complexity are not considered. As a result, despite theoretical improvements reported in several articles, practically no new algorithms proposed have been implemented in real schedulers, with HPC facilities still using the basic first-come-first-served (FCFS) with Backfill policy scheduling algorithm. A better approach could be, therefore, the use of real schedulers in an emulation environment to evaluate new algorithms. This thesis investigates two related challenges in emulations: computational cost and faithfulness of the results to real scheduling environments. It finds that the sampling, shrinking and shuffling of a trace must be done carefully to keep the classical metrics invariant or linear variant in relation to size and times of the original workload. This is accomplished by the careful control of the submission period and the consideration of drifts in the submission period and trace duration. This methodology can help researchers to better evaluate their scheduling algorithms and help HPC administrators to optimize the parameters of production schedulers. In order to assess the proposed methodology, we evaluated both the FCFS with Backfill and Suspend/Resume scheduling algorithms. The results strongly suggest that Suspend/Resume leads to a better utilization of a supercomputer when high priorities are given to big jobs

The Australian National University

Recommended from our members

Personal mobile grids with a honeybee inspired resource scheduler

Author: Kurdi Heba Abdullataif
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2010
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The overall aim of the thesis has been to introduce Personal Mobile Grids (PMGrids) as a novel paradigm in grid computing that scales grid infrastructures to mobile devices and extends grid entities to individual personal users. In this thesis, architectural designs as well as simulation models for PM-Grids are developed. The core of any grid system is its resource scheduler. However, virtually all current conventional grid schedulers do not address the non-clairvoyant scheduling problem, where job information is not available before the end of execution. Therefore, this thesis proposes a honeybee inspired resource scheduling heuristic for PM-Grids (HoPe) incorporating a radical approach to grid resource scheduling to tackle this problem. A detailed design and implementation of HoPe with a decentralised self-management and adaptive policy are initiated. Among the other main contributions are a comprehensive taxonomy of grid systems as well as a detailed analysis of the honeybee colony and its nectar acquisition process (NAP), from the resource scheduling perspective, which have not been presented in any previous work, to the best of our knowledge. PM-Grid designs and HoPe implementation were evaluated thoroughly through a strictly controlled empirical evaluation framework with a well-established heuristic in high throughput computing, the opportunistic scheduling heuristic (OSH), as a benchmark algorithm. Comparisons with optimal values and worst bounds are conducted to gain a clear insight into HoPe behaviour, in terms of stability, throughput, turnaround time and speedup, under different running conditions of number of jobs and grid scales. Experimental results demonstrate the superiority of HoPe performance where it has successfully maintained optimum stability and throughput in more than 95% of the experiments, with HoPe achieving three times better than the OSH under extremely heavy loads. Regarding the turnaround time and speedup, HoPe has effectively achieved less than 50% of the turnaround time incurred by the OSH, while doubling its speedup in more than 60% of the experiments. These results indicate the potential of both PM-Grids and HoPe in realising futuristic grid visions. Therefore considering the deployment of PM-Grids in real life scenarios and the utilisation of HoPe in other parallel processing and high throughput computing systems are recommended

Brunel University Research Archive

DESIGN AND EVALUATION OF RESOURCE ALLOCATION AND JOB SCHEDULING ALGORITHMS ON COMPUTATIONAL GRIDS

Author: MEHMOOD SHAH SYED NASIR
Publication venue
Publication date: 01/01/2012
Field of study

Grid, an infrastructure for resource sharing, currently has shown its importance in many scientific applications requiring tremendously high computational power. Grid computing enables sharing, selection and aggregation of resources for solving complex and large-scale scientific problems. Grids computing, whose resources are distributed, heterogeneous and dynamic in nature, introduces a number of fascinating issues in resource management. Grid scheduling is the key issue in grid environment in which its system must meet the functional requirements of heterogeneous domains, which are sometimes conflicting in nature also, like user, application, and network. Moreover, the system must satisfy non-functional requirements like reliability, efficiency, performance, effective resource utilization, and scalability. Thus, overall aim of this research is to introduce new grid scheduling algorithms for resource allocation as well as for job scheduling for enabling a highly efficient and effective utilization of the resources in executing various applications. The four prime aspects of this work are: firstly, a model of the grid scheduling problem for dynamic grid computing environment; secondly, development of a new web based simulator (SyedWSim), enabling the grid users to conduct a statistical analysis of grid workload traces and provides a realistic basis for experimentation in resource allocation and job scheduling algorithms on a grid; thirdly, proposal of a new grid resource allocation method of optimal computational cost using synthetic and real workload traces with respect to other allocation methods; and finally, proposal of some new job scheduling algorithms of optimal performance considering parameters like waiting time, turnaround time, response time, bounded slowdown, completion time and stretch time. The issue is not only to develop new algorithms, but also to evaluate them on an experimental computational grid, using synthetic and real workload traces, along with the other existing job scheduling algorithms. Experimental evaluation confirmed that the proposed grid scheduling algorithms possess a high degree of optimality in performance, efficiency and scalability

UTPedia

Workload modeling and performance evaluation in parallel systems

Author: Minh T.N.
Publication venue
Publication date: 22/03/2012
Field of study

Scheduling plays a significant role in producing good performance for clusters and grids. Smart scheduling policies in these systems are essential to enable efficient resource allocation mechanisms. One of the key factors that have a strong effect on scheduling is the workload. This workload problem is associated with four research topics to obtain an effective scheduler, namely workload characterisation, workload modeling, performance evaluation and prediction, and scheduling design. Workload data collected from real systems are the best source for improving our knowledge about performance issues of clusters and grids. Observed features of these workloads are precious sources of clues, which can be utilized to enhance scheduling. To this end, several long-term parallel and grid workloads have been collected and this thesis used these real workloads in the study of workload characterisation, workload modeling, per formance evaluation and prediction. Our research resulted in many workload modeling tools, a performance predictor and several useful clues that are essential to develop efficient cluster and grid schedulers.UBL - phd migration 201

Leiden University Scholary Publications