15,486 research outputs found
Local search performance guarantees for restricted related parallel machine scheduling
We consider the problem of minimizing the makespan on restricted related parallel machines. In restricted machine scheduling each job is only allowed to be scheduled on a subset of machines. We study the worst-case behavior of local search algorithms. In particular, we analyze the quality of local optima with respect to the jump, swap, push and lexicographical jump neighborhood.operations research and management science;
Algorithms for Hierarchical and Semi-Partitioned Parallel Scheduling
We propose a model for scheduling jobs in a parallel machine setting that takes into account the cost of migrations by assuming that the processing time of a job may depend on the specific set of machines among which the job is migrated. For the makespan minimization objective, the model generalizes classical scheduling problems such as unrelated parallel machine scheduling, as well as novel ones such as semi-partitioned and clustered scheduling. In the case of a hierarchical family of machines, we derive a compact integer linear programming formulation of the problem and leverage its fractional relaxation to obtain a polynomial-time 2-approximation algorithm. Extensions that incorporate memory capacity constraints are also discussed
Non-clairvoyant Scheduling Games
In a scheduling game, each player owns a job and chooses a machine to execute
it. While the social cost is the maximal load over all machines (makespan), the
cost (disutility) of each player is the completion time of its own job. In the
game, players may follow selfish strategies to optimize their cost and
therefore their behaviors do not necessarily lead the game to an equilibrium.
Even in the case there is an equilibrium, its makespan might be much larger
than the social optimum, and this inefficiency is measured by the price of
anarchy -- the worst ratio between the makespan of an equilibrium and the
optimum. Coordination mechanisms aim to reduce the price of anarchy by
designing scheduling policies that specify how jobs assigned to a same machine
are to be scheduled. Typically these policies define the schedule according to
the processing times as announced by the jobs. One could wonder if there are
policies that do not require this knowledge, and still provide a good price of
anarchy. This would make the processing times be private information and avoid
the problem of truthfulness. In this paper we study these so-called
non-clairvoyant policies. In particular, we study the RANDOM policy that
schedules the jobs in a random order without preemption, and the EQUI policy
that schedules the jobs in parallel using time-multiplexing, assigning each job
an equal fraction of CPU time
Data Structures for Task-based Priority Scheduling
Many task-parallel applications can benefit from attempting to execute tasks
in a specific order, as for instance indicated by priorities associated with
the tasks. We present three lock-free data structures for priority scheduling
with different trade-offs on scalability and ordering guarantees. First we
propose a basic extension to work-stealing that provides good scalability, but
cannot provide any guarantees for task-ordering in-between threads. Next, we
present a centralized priority data structure based on -fifo queues, which
provides strong (but still relaxed with regard to a sequential specification)
guarantees. The parameter allows to dynamically configure the trade-off
between scalability and the required ordering guarantee. Third, and finally, we
combine both data structures into a hybrid, -priority data structure, which
provides scalability similar to the work-stealing based approach for larger
, while giving strong ordering guarantees for smaller . We argue for
using the hybrid data structure as the best compromise for generic,
priority-based task-scheduling.
We analyze the behavior and trade-offs of our data structures in the context
of a simple parallelization of Dijkstra's single-source shortest path
algorithm. Our theoretical analysis and simulations show that both the
centralized and the hybrid -priority based data structures can give strong
guarantees on the useful work performed by the parallel Dijkstra algorithm. We
support our results with experimental evidence on an 80-core Intel Xeon system
Towards a High-Level Implementation of Execution Primitives for Unrestricted, Independent And-Parallelism
Most efficient implementations of parallel logic programming rely on complex low-level machinery which is arguably difficult to implement and modify. We explore an alternative approach aimed at taming that complexity by raising core parts of the implementation to the source language level for the particular case of and-parallellism. We handle a significant portion of the parallel implementation at the Prolog level with the help of a comparatively small number of concurrency.related primitives which take case of lower-level tasks such as locking, thread management, stack set management, etc. The approach does not eliminate altogether modifications to the abstract machine, but it does greatly simplify them and it also facilitates experimenting with different alternatives. We show how this approach allows implementing both restricted and unrestricted (i.e., non fork-join) parallelism. Preliminary esperiments show thay the performance safcrifieced is reasonable, although granularity of unrestricted parallelism contributes to better observed speedups
Towards high-level execution primitives for and-parallelism: preliminary results
Most implementations of parallel logic programming rely on complex low-level machinery which is arguably difflcult to implement and modify. We explore an alternative approach aimed at taming that complexity by raising core parts of the implementation to the source language level for the particular case of and-parallelism. Therefore, we handle a signiflcant portion of the parallel implementation mechanism at the Prolog level with the help of a comparatively small number of concurrency-related primitives which take care of lower-level tasks such as locking, thread management, stack set management, etc. The approach does not eliminate altogether modiflcations to the abstract machine, but it does greatly simplify them and it also facilitates experimenting with different alternatives. We show how this approach allows implementing both restricted and unrestricted (i.e., non fork-join) parallelism. Preliminary experiments show that the amount of performance sacriflced is reasonable, although granularity control is required in some cases. Also, we observe that the availability of unrestricted parallelism contributes to better observed speedups
AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
CPU-FPGA heterogeneous architectures are attracting ever-increasing attention
in an attempt to advance computational capabilities and energy efficiency in
today's datacenters. These architectures provide programmers with the ability
to reprogram the FPGAs for flexible acceleration of many workloads.
Nonetheless, this advantage is often overshadowed by the poor programmability
of FPGAs whose programming is conventionally a RTL design practice. Although
recent advances in high-level synthesis (HLS) significantly improve the FPGA
programmability, it still leaves programmers facing the challenge of
identifying the optimal design configuration in a tremendous design space.
This paper aims to address this challenge and pave the path from software
programs towards high-quality FPGA accelerators. Specifically, we first propose
the composable, parallel and pipeline (CPP) microarchitecture as a template of
accelerator designs. Such a well-defined template is able to support efficient
accelerator designs for a broad class of computation kernels, and more
importantly, drastically reduce the design space. Also, we introduce an
analytical model to capture the performance and resource trade-offs among
different design configurations of the CPP microarchitecture, which lays the
foundation for fast design space exploration. On top of the CPP
microarchitecture and its analytical model, we develop the AutoAccel framework
to make the entire accelerator generation automated. AutoAccel accepts a
software program as an input and performs a series of code transformations
based on the result of the analytical-model-based design space exploration to
construct the desired CPP microarchitecture. Our experiments show that the
AutoAccel-generated accelerators outperform their corresponding software
implementations by an average of 72x for a broad class of computation kernels
- …