20,877 research outputs found
Efficient management of backtracking in and-parallelism
A backtracking algorithm for AND-Parallelism and its implementation at the Abstract Machine level are presented: first, a class of AND-Parallelism models based on goal independence is defined, and a generalized version of Restricted AND-Parallelism (RAP) introduced as characteristic of this class. A simple and efficient backtracking algorithm for R A P is then discussed. An implementation scheme is presented for this algorithm which
offers minimum overhead, while retaining the performance and storage economy of sequent ial implementations and taking advantage of goal independence to avoid unnecessary
backtracking ("restricted intelligent backtracking"). Finally, the implementation of backtracking in sequential and AND-Parallcl systems is explained through a number of
examples
Compiler-assisted workload consolidation to efficiently exploit dynamic parallelism for recursive applications
GPUs have been widely used to parallelize and accelerate applications for its high throughput. Traditionally, a GPU function can only be launched from the CPU side. This results in the fact that GPUs are preferable for those application which express a flat data parallelism, a simple data parallelism that is known at compiling time and can be easily distributed to different GPU blocks and threads. However, for those applications that contain nested data parallelism, which is not known a priori and can only be discovered at running time, it is difficult to write a GPU function that achieve high performance on parallelization and acceleration. One can easily end up with either a too coarse-grained or too fine-grained GPU function. Since Kepler architecture, Nvidia introduced a new feature -- Dynamic Parallelism (DP), which enables the initiation of GPU functions from inside a GPU function. This makes the nested parallelism easy to be explored on GPU since one can program in a way that a new GPU function can be launched whenever a local nested parallelism is met during the execution. What is more, DP makes implementing recursion on GPU without the intervention of CPUs possible. Many computations exhibit a pattern of nested data parallelism and among those is parallel recursion. However, preliminary data shows that simple DP-based implementations of recursion result in poor performance. This work focus on how to efficiently exploit DP for parallel recursive applications on GPU. Specifically, the goal is to free the users from programming with the complexity of GPUs' hardware and software and to automatically generate high performance GPU recursive functions implemented with DP given the inputs of simple parallel CPU recursive functions. To this end, first, I propose several DP-based parallel recursive templates that can be generated from a serial CPU recursive function. I compare the parallel recursive templates with non DP-based counterparts (flat kernels) to see if using DP in parallel recursive application can be beneficial or not. Second, to reduce the overhead of DP, I propose compiler techniques that improve the efficiency of simple DP-based parallel recursive functions by performing workload consolidation. My evaluation shows that GPU kernels consolidated with the proposed code transformations achieve an average speedup in the order of 1500x over basic implementations using DP and an average speedup of 3.9x over optimized flat GPU kernels for both tree traversal and graph based applications
Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers
The nested parallel (a.k.a. fork-join) model is widely used for writing
parallel programs. However, the two composition constructs, i.e. ""
(parallel) and "" (serial), are insufficient in expressing "partial
dependencies" or "partial parallelism" in a program. We propose a new dataflow
composition construct "" to express partial dependencies in
algorithms in a processor- and cache-oblivious way, thus extending the Nested
Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign
several divide-and-conquer algorithms ranging from dense linear algebra to
dynamic-programming in the ND model and prove that they all have optimal span
while retaining optimal cache complexity. We propose the design of runtime
schedulers that map ND programs to multicore processors with multiple levels of
possibly shared caches (i.e, Parallel Memory Hierarchies) and provide
theoretical guarantees on their ability to preserve locality and load balance.
For this, we adapt space-bounded (SB) schedulers for the ND model. We show that
our algorithms have increased "parallelizability" in the ND model, and that SB
schedulers can use the extra parallelizability to achieve asymptotically
optimal bounds on cache misses and running time on a greater number of
processors than in the NP model. The running time for the algorithms in this
paper is , where is the cache complexity of task ,
is the cost of cache miss at level- cache which is of size ,
is a constant, and is the number of processors in an
-level cache hierarchy
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Omphale: Streamlining the Communication for Jobs in a Multi Processor System on Chip
Our Multi Processor System on Chip (MPSoC) template provides processing tiles that are connected via a network on chip. A processing tile contains a processing unit and a Scratch Pad Memory (SPM). This paper presents the Omphale tool that performs the first step in mapping a job, represented by a task graph, to such an MPSoC, given the SPM sizes as constraints. Furthermore a memory tile is introduced. The result of Omphale is a Cyclo Static DataFlow (CSDF) model and a task graph where tasks communicate via sliding windows that are located in circular buffers. The CSDF model is used to determine the size of the buffers and the communication pattern of the data. A buffer must fit in the SPM of the processing unit that is reading from it, such that low latency access is realized with a minimized number of stall cycles. If a task and its buffer exceed the size of the SPM, the task is examined for additional parallelism or the circular buffer is partly located in a memory tile. This results in an extended task graph that satisfies the SPM size constraints
- …
