15 research outputs found
The Workflow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures -- Technical Report
Realistic, relevant, and reproducible experiments often need input traces
collected from real-world environments. We focus in this work on traces of
workflows---common in datacenters, clouds, and HPC infrastructures. We show
that the state-of-the-art in using workflow-traces raises important issues: (1)
the use of realistic traces is infrequent, and (2) the use of realistic, {\it
open-access} traces even more so. Alleviating these issues, we introduce the
Workflow Trace Archive (WTA), an open-access archive of workflow traces from
diverse computing infrastructures and tooling to parse, validate, and analyze
traces. The WTA includes million workflows captured from
computing infrastructures, representing a broad diversity of trace domains and
characteristics. To emphasize the importance of trace diversity, we
characterize the WTA contents and analyze in simulation the impact of trace
diversity on experiment results. Our results indicate significant differences
in characteristics, properties, and workflow structures between workload
sources, domains, and fields.Comment: Technical repor
Tuning EASY-Backfilling Queues
International audienceEASY-Backfilling is a popular scheduling heuristic for allocating jobs in large scale High Performance Computing platforms. While its aggressive reservation mechanism is fast and prevents job starvation, it does not try to optimize any scheduling objective per se. We consider in this work the problem of tuning EASY using queue reordering policies. More precisely, we propose to tune the reordering using a simulation-based methodology. For a given system, we choose the policy in order to minimize the average waiting time. This methodology departs from the First-Come, First-Serve rule and introduces a risk on the maximum values of the waiting time, which we control using a queue thresholding mechanism. This new approach is evaluated through a comprehensive experimental campaign on five production logs. In particular, we show that the behavior of the systems under study is stable enough to learn a heuristic that generalizes in a train/test fashion. Indeed, the average waiting time can be reduced consistently (between 11% to 42% for the logs used) compared to EASY, with almost no increase in maximum waiting times. This work departs from previous learning-based approaches and shows that scheduling heuristics for HPC can be learned directly in a policy space
Improving Backfilling by using Machine Learning to predict Running Times
International audienceThe job management system is the HPC middleware responsible for distributing computing power to applications. While such systems generate an ever increasing amount of data, they are characterized by uncertainties on some parameters like the job running times. The question raised in this work is: To what extent is it possible/useful to take into account predictions on the job running times for improving the global scheduling? We present a comprehensive study for answering this question assuming the popular EASY backfilling policy. More precisely, we rely on some classical methods in machine learning and propose new cost functions well-adapted to the problem. Then, we assess our proposed solutions through intensive simulations using several production logs. Finally, we propose a new scheduling algorithm that outperforms the popular EASY backfilling algorithm by 28% considering the average bounded slowdown objective
Supercomputer Emulation For Evaluating Scheduling Algorithms
Scheduling algorithms have a significant impact on the optimal
utilization of HPC facilities, yet the vast majority of the
research in this area is done using simulations. In working with
simulations, a great deal of factors that affect a real
scheduler, such as its scheduling processing time, communication
latencies and the scheduler intrinsic
implementation complexity are not considered. As a result,
despite theoretical improvements reported in several articles,
practically no new algorithms proposed have been implemented in
real schedulers, with HPC facilities still using the basic
first-come-first-served (FCFS) with Backfill policy scheduling
algorithm.
A better approach could be, therefore, the use of real schedulers
in an emulation environment to evaluate new algorithms.
This thesis investigates two related challenges in emulations:
computational cost and faithfulness of the results to real
scheduling environments.
It finds that the sampling, shrinking and shuffling of a trace
must be done carefully to keep the classical metrics invariant or
linear variant in relation to size and times of the original
workload. This is accomplished by the careful control of the
submission period and the consideration of drifts in the
submission period and trace duration.
This methodology can help researchers to better evaluate their
scheduling algorithms and help HPC administrators to optimize the
parameters of production schedulers.
In order to assess the proposed methodology, we evaluated both
the FCFS with Backfill and Suspend/Resume scheduling algorithms.
The results strongly suggest that Suspend/Resume leads to a
better utilization of a supercomputer when high priorities are
given to big jobs
Recommended from our members
Personal mobile grids with a honeybee inspired resource scheduler
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The overall aim of the thesis has been to introduce Personal Mobile Grids (PMGrids)
as a novel paradigm in grid computing that scales grid infrastructures to mobile devices and extends grid entities to individual personal users. In this thesis, architectural designs as well as simulation models for PM-Grids are developed.
The core of any grid system is its resource scheduler. However, virtually all current conventional grid schedulers do not address the non-clairvoyant scheduling problem, where job information is not available before the end of execution. Therefore, this thesis proposes a honeybee inspired resource scheduling heuristic for PM-Grids (HoPe) incorporating a radical approach to grid resource scheduling to tackle this problem. A detailed design and implementation of HoPe with a decentralised self-management and adaptive policy are initiated.
Among the other main contributions are a comprehensive taxonomy of grid systems as well as a detailed analysis of the honeybee colony and its nectar acquisition process (NAP), from the resource scheduling perspective, which have not been presented in any previous work, to the best of our knowledge.
PM-Grid designs and HoPe implementation were evaluated thoroughly through a strictly controlled empirical evaluation framework with a well-established heuristic in high throughput computing, the opportunistic scheduling heuristic (OSH), as a benchmark algorithm. Comparisons with optimal values and worst bounds are conducted to gain a clear insight into HoPe behaviour, in terms of stability, throughput, turnaround time and speedup, under different running conditions of number of jobs and grid scales.
Experimental results demonstrate the superiority of HoPe performance where it
has successfully maintained optimum stability and throughput in more than 95%
of the experiments, with HoPe achieving three times better than the OSH under
extremely heavy loads. Regarding the turnaround time and speedup, HoPe has
effectively achieved less than 50% of the turnaround time incurred by the OSH, while doubling its speedup in more than 60% of the experiments.
These results indicate the potential of both PM-Grids and HoPe in realising futuristic grid visions. Therefore considering the deployment of PM-Grids in real life scenarios and the utilisation of HoPe in other parallel processing and high throughput computing systems are recommended
DESIGN AND EVALUATION OF RESOURCE ALLOCATION AND JOB SCHEDULING ALGORITHMS ON COMPUTATIONAL GRIDS
Grid, an infrastructure for resource sharing, currently has shown its importance in
many scientific applications requiring tremendously high computational power. Grid
computing enables sharing, selection and aggregation of resources for solving
complex and large-scale scientific problems. Grids computing, whose resources are
distributed, heterogeneous and dynamic in nature, introduces a number of fascinating
issues in resource management. Grid scheduling is the key issue in grid environment
in which its system must meet the functional requirements of heterogeneous domains,
which are sometimes conflicting in nature also, like user, application, and network.
Moreover, the system must satisfy non-functional requirements like reliability,
efficiency, performance, effective resource utilization, and scalability. Thus, overall
aim of this research is to introduce new grid scheduling algorithms for resource
allocation as well as for job scheduling for enabling a highly efficient and effective
utilization of the resources in executing various applications.
The four prime aspects of this work are: firstly, a model of the grid scheduling
problem for dynamic grid computing environment; secondly, development of a new
web based simulator (SyedWSim), enabling the grid users to conduct a statistical
analysis of grid workload traces and provides a realistic basis for experimentation in
resource allocation and job scheduling algorithms on a grid; thirdly, proposal of a new
grid resource allocation method of optimal computational cost using synthetic and
real workload traces with respect to other allocation methods; and finally, proposal of
some new job scheduling algorithms of optimal performance considering parameters
like waiting time, turnaround time, response time, bounded slowdown, completion
time and stretch time. The issue is not only to develop new algorithms, but also to
evaluate them on an experimental computational grid, using synthetic and real
workload traces, along with the other existing job scheduling algorithms.
Experimental evaluation confirmed that the proposed grid scheduling algorithms
possess a high degree of optimality in performance, efficiency and scalability
Workload modeling and performance evaluation in parallel systems
Scheduling plays a significant role in producing good performance for clusters and grids. Smart scheduling policies in these systems are essential to enable efficient resource allocation mechanisms. One of the key factors that have a strong effect on scheduling is the workload. This workload problem is associated with four research topics to obtain an effective scheduler, namely workload characterisation, workload modeling, performance evaluation and prediction, and scheduling design. Workload data collected from real systems are the best source for improving our knowledge about performance issues of clusters and grids. Observed features of these workloads are precious sources of clues, which can be utilized to enhance scheduling. To this end, several long-term parallel and grid workloads have been collected and this thesis used these real workloads in the study of workload characterisation, workload modeling, per formance evaluation and prediction. Our research resulted in many workload modeling tools, a performance predictor and several useful clues that are essential to develop efficient cluster and grid schedulers.UBL - phd migration 201