840 research outputs found
Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels
Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kernels
concurrently. On these GPUs, the thread block scheduler (TBS) uses the FIFO
policy to schedule their thread blocks. We show that FIFO leaves performance to
chance, resulting in significant loss of performance and fairness. To improve
performance and fairness, we propose use of the preemptive Shortest Remaining
Time First (SRTF) policy instead. Although SRTF requires an estimate of runtime
of GPU kernels, we show that such an estimate of the runtime can be easily
obtained using online profiling and exploiting a simple observation on GPU
kernels' grid structure. Specifically, we propose a novel Structural Runtime
Predictor. Using a simple Staircase model of GPU kernel execution, we show that
the runtime of a kernel can be predicted by profiling only the first few thread
blocks. We evaluate an online predictor based on this model on benchmarks from
ERCBench, and find that it can estimate the actual runtime reasonably well
after the execution of only a single thread block. Next, we design a thread
block scheduler that is both concurrent kernel-aware and uses this predictor.
We implement the SRTF policy and evaluate it on two-program workloads from
ERCBench. SRTF improves STP by 1.18x and ANTT by 2.25x over FIFO. When compared
to MPMax, a state-of-the-art resource allocation policy for concurrent kernels,
SRTF improves STP by 1.16x and ANTT by 1.3x. To improve fairness, we also
propose SRTF/Adaptive which controls resource usage of concurrently executing
kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12x, ANTT by
2.23x and Fairness by 2.95x compared to FIFO. Overall, our implementation of
SRTF achieves system throughput to within 12.64% of Shortest Job First (SJF, an
oracle optimal scheduling policy), bridging 49% of the gap between FIFO and
SJF.Comment: 14 pages, full pre-review version of PACT 2014 poste
Predicting batch queue job wait times for informed scheduling of urgent HPC workloads
There is increasing interest in the use of HPC machines for urgent workloads
to help tackle disasters as they unfold. Whilst batch queue systems are not
ideal in supporting such workloads, many disadvantages can be worked around by
accurately predicting when a waiting job will start to run. However there are
numerous challenges in achieving such a prediction with high accuracy, not
least because the queue's state can change rapidly and depend upon many
factors. In this work we explore a novel machine learning approach for
predicting queue wait times, hypothesising that such a model can capture the
complex behaviour resulting from the queue policy and other interactions to
generate accurate job start times.
For ARCHER2 (HPE Cray EX), Cirrus (HPE 8600) and 4-cabinet (HPE Cray EX) we
explore how different machine learning approaches and techniques improve the
accuracy of our predictions, comparing against the estimation generated by
Slurm. We demonstrate that our techniques deliver the most accurate predictions
across our machines of interest, with the result of this work being the ability
to predict job start times within one minute of the actual start time for
around 65\% of jobs on ARCHER2 and 4-cabinet, and 76\% of jobs on Cirrus. When
compared against what Slurm can deliver, this represents around 3.8 times
better accuracy on ARCHER2 and 18 times better for Cirrus. Furthermore our
approach can accurately predicting the start time for three quarters of all job
within ten minutes of the actual start time on ARCHER2 and 4-cabinet, and for
90\% of jobs on Cirrus. Whilst the driver of this work has been to better
facilitate placement of urgent workloads across HPC machines, the insights
gained can be used to provide wider benefits to users and also enrich existing
batch queue systems and inform policy too.Comment: Preprint of article at the 2022 Cray User Group (CUG
Autonomous grid scheduling using probabilistic job runtime scheduling
Computational Grids are evolving into a global, service-oriented architecture –
a universal platform for delivering future computational services to a range of
applications of varying complexity and resource requirements. The thesis focuses
on developing a new scheduling model for general-purpose, utility clusters
based on the concept of user requested job completion deadlines. In such a
system, a user would be able to request each job to finish by a certain deadline,
and possibly to a certain monetary cost. Implementing deadline scheduling is
dependent on the ability to predict the execution time of each queued job, and
on an adaptive scheduling algorithm able to use those predictions to maximise
deadline adherence. The thesis proposes novel solutions to these two problems
and documents their implementation in a largely autonomous and self-managing
way.
The starting point of the work is an extensive analysis of a representative
Grid workload revealing consistent workflow patterns, usage cycles and correlations between the execution times of jobs and its properties commonly collected
by the Grid middleware for accounting purposes. An automated approach is
proposed to identify these dependencies and use them to partition the highly
variable workload into subsets of more consistent and predictable behaviour.
A range of time-series forecasting models, applied in this context for the first
time, were used to model the job execution times as a function of their historical
behaviour and associated properties. Based on the resulting predictions of job
runtimes a novel scheduling algorithm is able to estimate the latest job start
time necessary to meet the requested deadline and sort the queue accordingly to
minimise the amount of deadline overrun.
The testing of the proposed approach was done using the actual job trace
collected from a production Grid facility. The best performing execution time
predictor (the auto-regressive moving average method) coupled to workload
partitioning based on three simultaneous job properties returned the median
absolute percentage error centroid of only 4.75%. This level of prediction
accuracy enabled the proposed deadline scheduling method to reduce the average deadline overrun time ten-fold compared to the benchmark batch scheduler.
Overall, the thesis demonstrates that deadline scheduling of computational
jobs on the Grid is achievable using statistical forecasting of job execution times
based on historical information. The proposed approach is easily implementable,
substantially self-managing and better matched to the human workflow making
it well suited for implementation in the utility Grids of the future
Specification-Driven Predictive Business Process Monitoring
Predictive analysis in business process monitoring aims at forecasting the
future information of a running business process. The prediction is typically
made based on the model extracted from historical process execution logs (event
logs). In practice, different business domains might require different kinds of
predictions. Hence, it is important to have a means for properly specifying the
desired prediction tasks, and a mechanism to deal with these various prediction
tasks. Although there have been many studies in this area, they mostly focus on
a specific prediction task. This work introduces a language for specifying the
desired prediction tasks, and this language allows us to express various kinds
of prediction tasks. This work also presents a mechanism for automatically
creating the corresponding prediction model based on the given specification.
Differently from previous studies, instead of focusing on a particular
prediction task, we present an approach to deal with various prediction tasks
based on the given specification of the desired prediction tasks. We also
provide an implementation of the approach which is used to conduct experiments
using real-life event logs.Comment: This article significantly extends the previous work in
https://doi.org/10.1007/978-3-319-91704-7_7 which has a technical report in
arXiv:1804.00617. This article and the previous work have a coauthor in
commo
"Virtual malleability" applied to MPI jobs to improve their execution in a multiprogrammed environment"
This work focuses on scheduling of MPI jobs when executing in shared-memory multiprocessors (SMPs). The objective was to obtain the best performance in response time in multiprogrammed multiprocessors systems using batch systems, assuming all the jobs have the same priority. To achieve that purpose, the benefits of supporting malleability on MPI jobs to reduce fragmentation and consequently improve the performance of the system were studied. The contributions made in this work can be summarized as follows:· Virtual malleability: A mechanism where a job is assigned a dynamic processor partition, where the number of processes is greater than the number of processors. The partition size is modified at runtime, according to external requirements such as the load of the system, by varying the multiprogramming level, making the job contend for resources with itself. In addition to this, a mechanism which decides at runtime if applying local or global process queues to an application depending on the load balancing between processes of it. · A job scheduling policy, that takes decisions such as how many processes to start with and the maximum multiprogramming degree based on the type and number of applications running and queued. Moreover, as soon as a job finishes execution and where there are queued jobs, this algorithm analyzes whether it is better to start execution of another job immediately or just wait until there are more resources available. · A new alternative to backfilling strategies for the problema of window execution time expiring. Virtual malleability is applied to the backfilled job, reducing its partition size but without aborting or suspending it as in traditional backfilling. The evaluation of this thesis has been done using a practical approach. All the proposals were implemented, modifying the three scheduling levels: queuing system, processor scheduler and runtime library. The impact of the contributions were studied under several types of workloads, varying machine utilization, communication and, balance degree of the applications, multiprogramming level, and job size. Results showed that it is possible to offer malleability over MPI jobs. An application obtained better performance when contending for the resources with itself than with other applications, especially in workloads with high machine utilization. Load imbalance was taken into account obtaining better performance if applying the right queue type to each application independently.The job scheduling policy proposed exploited virtual malleability by choosing at the beginning of execution some parameters like the number of processes and maximum multiprogramming level. It performed well under bursty workloads with low to medium machine utilizations. However as the load increases, virtual malleability was not enough. That is because, when the machine is heavily loaded, the jobs, once shrunk are not able to expand, so they must be executed all the time with a partition smaller than the job size, thus degrading performance. Thus, at this point the job scheduling policy concentrated just in moldability.Fragmentation was alleviated also by applying backfilling techniques to the job scheduling algorithm. Virtual malleability showed to be an interesting improvement in the window expiring problem. Backfilled jobs even on a smaller partition, can continue execution reducing memory swapping generated by aborts/suspensions In this way the queueing system is prevented from reinserting the backfilled job in the queue and re-executing it in the future.Postprint (published version
An occam Style Communications System for UNIX Networks
This document describes the design of a communications system which provides occam style communications primitives under a Unix environment, using TCP/IP protocols, and any number of other protocols deemed suitable as underlying transport layers. The system will integrate with a low overhead scheduler/kernel without incurring significant costs to the execution of processes within the run time environment. A survey of relevant occam and occam3 features and related research is followed by a look at the Unix and TCP/IP facilities which determine our working constraints, and a description of the T9000 transputer's Virtual Channel Processor, which was instrumental in our formulation. Drawing from the information presented here, a design for the communications system is subsequently proposed. Finally, a preliminary investigation of methods for lightweight access control to shared resources in an environment which does not provide support for critical sections, semaphores, or busy waiting, is made. This is presented with relevance to mutual exclusion problems which arise within the proposed design. Future directions for the evolution of this project are discussed in conclusion
Supercomputer Emulation For Evaluating Scheduling Algorithms
Scheduling algorithms have a significant impact on the optimal
utilization of HPC facilities, yet the vast majority of the
research in this area is done using simulations. In working with
simulations, a great deal of factors that affect a real
scheduler, such as its scheduling processing time, communication
latencies and the scheduler intrinsic
implementation complexity are not considered. As a result,
despite theoretical improvements reported in several articles,
practically no new algorithms proposed have been implemented in
real schedulers, with HPC facilities still using the basic
first-come-first-served (FCFS) with Backfill policy scheduling
algorithm.
A better approach could be, therefore, the use of real schedulers
in an emulation environment to evaluate new algorithms.
This thesis investigates two related challenges in emulations:
computational cost and faithfulness of the results to real
scheduling environments.
It finds that the sampling, shrinking and shuffling of a trace
must be done carefully to keep the classical metrics invariant or
linear variant in relation to size and times of the original
workload. This is accomplished by the careful control of the
submission period and the consideration of drifts in the
submission period and trace duration.
This methodology can help researchers to better evaluate their
scheduling algorithms and help HPC administrators to optimize the
parameters of production schedulers.
In order to assess the proposed methodology, we evaluated both
the FCFS with Backfill and Suspend/Resume scheduling algorithms.
The results strongly suggest that Suspend/Resume leads to a
better utilization of a supercomputer when high priorities are
given to big jobs
EECluster: An Energy-Efficient Tool for managing HPC Clusters
High Performance Computing clusters have become a very important element in research, academic and industrial communities because they are an excellent platform for solving a wide range of problems through parallel and distributed applications. Nevertheless, this high performance comes at the price of consuming large amounts of energy, which combined with notably increasing electricity prices are having an important economical impact, driving up power and cooling costs and forcing IT companies to reduce operation costs. To reduce the high energy consumptions of HPC clusters we propose a tool, named EECluster, for managing the energy-efficient allocation of the cluster resources, that works with both OGE/SGE and PBS/TORQUE Resource Management Systems (RMS) and whose decision-making mechanism is tuned automatically in a machine learning approach. Experimental studies have been made using actual workloads from the Scientific Modelling Cluster at Oviedo University and the academic-cluster used by the Oviedo University for teaching high performance computing subjects to evaluate the results obtained with the adoption of this too
- …