243 research outputs found
Towards Optimality in Parallel Scheduling
To keep pace with Moore's law, chip designers have focused on increasing the
number of cores per chip rather than single core performance. In turn, modern
jobs are often designed to run on any number of cores. However, to effectively
leverage these multi-core chips, one must address the question of how many
cores to assign to each job. Given that jobs receive sublinear speedups from
additional cores, there is an obvious tradeoff: allocating more cores to an
individual job reduces the job's runtime, but in turn decreases the efficiency
of the overall system. We ask how the system should schedule jobs across cores
so as to minimize the mean response time over a stream of incoming jobs.
To answer this question, we develop an analytical model of jobs running on a
multi-core machine. We prove that EQUI, a policy which continuously divides
cores evenly across jobs, is optimal when all jobs follow a single speedup
curve and have exponentially distributed sizes. EQUI requires jobs to change
their level of parallelization while they run. Since this is not possible for
all workloads, we consider a class of "fixed-width" policies, which choose a
single level of parallelization, k, to use for all jobs. We prove that,
surprisingly, it is possible to achieve EQUI's performance without requiring
jobs to change their levels of parallelization by using the optimal fixed level
of parallelization, k*. We also show how to analytically derive the optimal k*
as a function of the system load, the speedup curve, and the job size
distribution.
In the case where jobs may follow different speedup curves, finding a good
scheduling policy is even more challenging. We find that policies like EQUI
which performed well in the case of a single speedup function now perform
poorly. We propose a very simple policy, GREEDY*, which performs near-optimally
when compared to the numerically-derived optimal policy
Learning Scheduling Algorithms for Data Processing Clusters
Efficiently scheduling data processing jobs on distributed compute clusters
requires complex algorithms. Current systems, however, use simple generalized
heuristics and ignore workload characteristics, since developing and tuning a
scheduling policy for each workload is infeasible. In this paper, we show that
modern machine learning techniques can generate highly-efficient policies
automatically. Decima uses reinforcement learning (RL) and neural networks to
learn workload-specific scheduling algorithms without any human instruction
beyond a high-level objective such as minimizing average job completion time.
Off-the-shelf RL techniques, however, cannot handle the complexity and scale of
the scheduling problem. To build Decima, we had to develop new representations
for jobs' dependency graphs, design scalable RL models, and invent RL training
methods for dealing with continuous stochastic job arrivals. Our prototype
integration with Spark on a 25-node cluster shows that Decima improves the
average job completion time over hand-tuned scheduling heuristics by at least
21%, achieving up to 2x improvement during periods of high cluster load
Quality of Service Driven Runtime Resource Allocation in Reconfigurable HPC Architectures
Heterogeneous System Architectures (HSA) are gaining importance in the High Performance Computing (HPC) domain due to increasing computational requirements coupled with energy consumption concerns, which conventional CPU architectures fail to effectively address. Systems based on Field Programmable Gate Array (FPGA) recently emerged as an effective alternative to Graphical Processing Units (GPUs) for demanding HPC applications, although they lack the abstractions available in conventional CPU-based systems. This work tackles the problem of runtime resource management of a system using FPGA-based co-processors to accelerate multi-programmed HPC workloads. We propose a novel resource manager able to dynamically vary the number of FPGAs allocated to each of the jobs running in a multi-accelerator system, with the goal of meeting a given Quality of Service metric for the running jobs measured in terms of deadline or throughput. We implement the proposed resource manager in a commercial HPC system, evaluating its behavior with representative workloads
A Dynamic Run-Profile Energy-Aware Approach for Scheduling Computationally Intensive Bioinformatics Applications
High Performance Computing (HPC) resources are housed in large datacenters, which consume exorbitant amounts of energy and are quickly demanding attention from businesses as they result in high operating costs. On the other hand HPC environments have been very useful to researchers in many emerging areas in life sciences such as Bioinformatics and Medical Informatics. In an earlier work, we introduced a dynamic model for energy aware scheduling (EAS) in a HPC environment; the model is domain agnostic and incorporates both the deadline parameter as well as energy parameters for computationally intensive applications. Our proposed EAS model incorporates 2-phases. In the Offline Phase, we use a run profile based approach to generate the initial schedule. In the Online Phase a feedback mechanism is incorporated between the EAS Engine and the master scheduling process. As scheduled tasks are completed, actual execution times are used to adjust the resources required for scheduling remaining tasks using the least number of nodes while meeting a given deadline. In this paper we study the impact of the quality of initial schedule using different run profiles which is the starting point for the EAS algorithm on the number of adjustments which is critical to the overall energy optimization as every adjustment made has an overhead. The conducted experiments show that the proposed approach succeeded in meeting preset deadlines while minimizing the number of nodes; thus reducing overall energy utilized and that choosing the right profile in the Offline phase has an impact on the energy optimization achieved by the EAS algorithm
Idle regulation in non-clairvoyant scheduling of parallel jobs
AbstractThe optimization of parallel applications is difficult to achieve by classical optimization techniques because of their diversity and the variety of actual parallel and distributed platforms and/or environments. Adaptive algorithmic schemes, capable of dynamically changing the allocation of jobs during the execution to optimize global system behavior, are the best alternatives for solving this problem. In this paper, we focus on non-clairvoyant scheduling of parallel jobs with known resource requirements but unknown running times, with emphasis on the regulation of idle periods in the context of general list policies. We consider a new family of scheduling strategies based on two phases which successively combine sequential and parallel execution of jobs. We generalize known worst-case performance bounds by considering two extra parameters, in addition to the number of processors and maximum processor requirements considered in the literature, namely, job parallelization penalty and idle regulation factor. Furthermore, we prove that under certain conditions of idle regulation, the performance guarantee of parallel job scheduling in space-sharing mode can be improved
Energy-Efficient Multiprocessor Scheduling for Flow Time and Makespan
We consider energy-efficient scheduling on multiprocessors, where the speed
of each processor can be individually scaled, and a processor consumes power
when running at speed , for . A scheduling algorithm
needs to decide at any time both processor allocations and processor speeds for
a set of parallel jobs with time-varying parallelism. The objective is to
minimize the sum of the total energy consumption and certain performance
metric, which in this paper includes total flow time and makespan. For both
objectives, we present instantaneous parallelism clairvoyant (IP-clairvoyant)
algorithms that are aware of the instantaneous parallelism of the jobs at any
time but not their future characteristics, such as remaining parallelism and
work. For total flow time plus energy, we present an -competitive
algorithm, which significantly improves upon the best known non-clairvoyant
algorithm and is the first constant competitive result on multiprocessor speed
scaling for parallel jobs. In the case of makespan plus energy, which is
considered for the first time in the literature, we present an
-competitive algorithm, where is the total number of
processors. We show that this algorithm is asymptotically optimal by providing
a matching lower bound. In addition, we also study non-clairvoyant scheduling
for total flow time plus energy, and present an algorithm that achieves -competitive for jobs with arbitrary release time and
-competitive for jobs with identical release time. Finally,
we prove an lower bound on the competitive ratio of
any non-clairvoyant algorithm, matching the upper bound of our algorithm for
jobs with identical release time
Distributed Training Large-Scale Deep Architectures
Scale of data and scale of computation infrastructures together enable the
current deep learning renaissance. However, training large-scale deep
architectures demands both algorithmic improvement and careful system
configuration. In this paper, we focus on employing the system approach to
speed up large-scale training. Via lessons learned from our routine
benchmarking effort, we first identify bottlenecks and overheads that hinter
data parallelism. We then devise guidelines that help practitioners to
configure an effective system and fine-tune parameters to achieve desired
speedup. Specifically, we develop a procedure for setting minibatch size and
choosing computation algorithms. We also derive lemmas for determining the
quantity of key components such as the number of GPUs and parameter servers.
Experiments and examples show that these guidelines help effectively speed up
large-scale deep learning training
- …