7,948 research outputs found
Configurable Strategies for Work-stealing
Work-stealing systems are typically oblivious to the nature of the tasks they
are scheduling. For instance, they do not know or take into account how long a
task will take to execute or how many subtasks it will spawn. Moreover, the
actual task execution order is typically determined by the underlying task
storage data structure, and cannot be changed. There are thus possibilities for
optimizing task parallel executions by providing information on specific tasks
and their preferred execution order to the scheduling system.
We introduce scheduling strategies to enable applications to dynamically
provide hints to the task-scheduling system on the nature of specific tasks.
Scheduling strategies can be used to independently control both local task
execution order as well as steal order. In contrast to conventional scheduling
policies that are normally global in scope, strategies allow the scheduler to
apply optimizations on individual tasks. This flexibility greatly improves
composability as it allows the scheduler to apply different, specific
scheduling choices for different parts of applications simultaneously. We
present a number of benchmarks that highlight diverse, beneficial effects that
can be achieved with scheduling strategies. Some benchmarks (branch-and-bound,
single-source shortest path) show that prioritization of tasks can reduce the
total amount of work compared to standard work-stealing execution order. For
other benchmarks (triangle strip generation) qualitatively better results can
be achieved in shorter time. Other optimizations, such as dynamic merging of
tasks or stealing of half the work, instead of half the tasks, are also shown
to improve performance. Composability is demonstrated by examples that combine
different strategies, both within the same kernel (prefix sum) as well as when
scheduling multiple kernels (prefix sum and unbalanced tree search)
Optimized Compilation of Aggregated Instructions for Realistic Quantum Computers
Recent developments in engineering and algorithms have made real-world
applications in quantum computing possible in the near future. Existing quantum
programming languages and compilers use a quantum assembly language composed of
1- and 2-qubit (quantum bit) gates. Quantum compiler frameworks translate this
quantum assembly to electric signals (called control pulses) that implement the
specified computation on specific physical devices. However, there is a
mismatch between the operations defined by the 1- and 2-qubit logical ISA and
their underlying physical implementation, so the current practice of directly
translating logical instructions into control pulses results in inefficient,
high-latency programs. To address this inefficiency, we propose a universal
quantum compilation methodology that aggregates multiple logical operations
into larger units that manipulate up to 10 qubits at a time. Our methodology
then optimizes these aggregates by (1) finding commutative intermediate
operations that result in more efficient schedules and (2) creating custom
control pulses optimized for the aggregate (instead of individual 1- and
2-qubit operations). Compared to the standard gate-based compilation, the
proposed approach realizes a deeper vertical integration of high-level quantum
software and low-level, physical quantum hardware. We evaluate our approach on
important near-term quantum applications on simulations of superconducting
quantum architectures. Our proposed approach provides a mean speedup of
, with a maximum of . Because latency directly affects the
feasibility of quantum computation, our results not only improve performance
but also have the potential to enable quantum computation sooner than otherwise
possible.Comment: 13 pages, to apper in ASPLO
Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
This paper introduces Tiramisu, a polyhedral framework designed to generate
high performance code for multiple platforms including multicores, GPUs, and
distributed machines. Tiramisu introduces a scheduling language with novel
extensions to explicitly manage the complexities that arise when targeting
these systems. The framework is designed for the areas of image processing,
stencils, linear algebra and deep learning. Tiramisu has two main features: it
relies on a flexible representation based on the polyhedral model and it has a
rich scheduling language allowing fine-grained control of optimizations.
Tiramisu uses a four-level intermediate representation that allows full
separation between the algorithms, loop transformations, data layouts, and
communication. This separation simplifies targeting multiple hardware
architectures with the same algorithm. We evaluate Tiramisu by writing a set of
image processing, deep learning, and linear algebra benchmarks and compare them
with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu
matches or outperforms existing compilers and libraries on different hardware
architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041
- …