14 research outputs found
Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
This paper introduces Tiramisu, a polyhedral framework designed to generate
high performance code for multiple platforms including multicores, GPUs, and
distributed machines. Tiramisu introduces a scheduling language with novel
extensions to explicitly manage the complexities that arise when targeting
these systems. The framework is designed for the areas of image processing,
stencils, linear algebra and deep learning. Tiramisu has two main features: it
relies on a flexible representation based on the polyhedral model and it has a
rich scheduling language allowing fine-grained control of optimizations.
Tiramisu uses a four-level intermediate representation that allows full
separation between the algorithms, loop transformations, data layouts, and
communication. This separation simplifies targeting multiple hardware
architectures with the same algorithm. We evaluate Tiramisu by writing a set of
image processing, deep learning, and linear algebra benchmarks and compare them
with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu
matches or outperforms existing compilers and libraries on different hardware
architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041
An Exact Method for Analysis of Value-based Array Data Dependences
Standard array data dependence testing algorithms give information
about the aliasing of array references. If statement 1 writes a[5],
and statement 2 later reads a[5], standard techniques
described this as a flow dependence, even if there was an intervening write.
We call a dependence between two references to the same memory
location a memory-based dependence. In contrast, if there are no
intervening writes, the references touch the same value and we call the
dependence a value-based dependence.
There has been a surge of recent work on value-based array data dependence
analysis (also referred to as computation of array data-flow dependence
information). In this paper, we describe a technique that is exact
over programs without control flow (other than loops) and non-linear
references. We compare our proposal with the technique proposed
by Paul Feautrier, which is the other technique that is complete over the same
domain as ours. We also compare our work with that of Tu and Padua, a
representative approximate scheme for array privatization.
(Also cross-referenced as UMIACS-TR-93-137
System Support for Implicitly Parallel Programming
Coordinated Science Laboratory was formerly known as Control Systems Laborator
Static Analysis of Upper and Lower Bounds on Dependences and Parallelism
Existing compilers often fail to parallelize sequential code, even
when a program can be manually transformed into parallel form
by a sequence of well-understood transformations
(as is the case for many of the Perfect Club Benchmark
programs).
These failures can occur for several reasons: the code transformations
implemented in the compiler may not be sufficient to produce parallel
code, the compiler may not find the proper sequence of
transformations, or the compiler may not be able to prove that one
of the necessary transformations is legal.
When a compiler extract sufficient parallelism from a program,
the programmer extract additional parallelism.
Unfortunately, the programmer is typically left to search for
parallelism without significant assistance.
The compiler generally does not give feedback about which parts of the
program might contain additional parallelism, or about the types of
transformations that might be needed to realize this parallelism.
Standard program transformations and dependence abstractions cannot be
used to provide this feedback.
In this paper, we propose a two step approach for the search for
parallelism in sequential programs:
We first construct several sets of constraints that describe, for each
statement, which iterations of that statement can be executed
concurrently.
By constructing constraints that correspond to different assumptions
about which dependences might be eliminated through additional
analysis, transformations and user assertions, we can determine
whether we can expose parallelism by eliminating dependences.
In the second step of our search for parallelism, we examine these
constraint sets to identify the kinds of transformations that are
needed to exploit scalable parallelism.
Our tests will identify conditional parallelism and parallelism that
can be exposed by combinations of transformations that reorder the
iteration space (such as loop interchange and loop peeling).
This approach lets us distinguish inherently sequential code from code
that contains unexploited parallelism.
It also produces information about the kinds of transformations that
will be needed to parallelize the code, without worrying about the
order of application of the transformations.
Furthermore, when our dependence test is inexact,
we can identify which unresolved dependences inhibit parallelism
by comparing the effects of assuming dependence or independence.
We are currently exploring the use of this information in
programmer-assisted parallelization.
(Also cross-referenced as UMIACS-TR-94-40
Speculative parallelization of partially parallel loops
Current parallelizing compilers cannot identify a significant fraction of parallelizable
loops because they have complex or statically insufficiently defined access patterns.
In our previous work, we have speculatively executed a loop as a doall, and applied a
fully parallel data dependence test to determine if it had any cross–processor depen-
dences. If the test failed, then the loop was re–executed serially. While this method
exploits doall parallelism well, it can cause slowdowns for loops with even one cross-
processor flow dependence because we have to re-execute sequentially. Moreover, the
existing, partial parallelism of loops is not exploited.
We demonstrate a generalization of the speculative doall parallelization tech-
nique, called the Recursive LRPD test, that can extract and exploit the maximum
available parallelism of any loop and that limits potential slowdowns to the over-
head of the run-time dependence test itself. In this thesis, we have presented the
base algorithm and an analysis of the different heuristics for its practical applica-
tion. To reduce the run-time overhead of the Recursive LRPD test, we have im-
plemented on-demand checkpointing and commit, more efficient data dependence
analysis and shadow structures, and feedback-guided load balancing. We obtained
scalable speedups for loops from Track, Spice, and FMA3D that were not paralleliz-
able by previous speculative parallelization methods
Run-time optimization of adaptive irregular applications
Compared to traditional compile-time optimization, run-time optimization could offer significant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identified a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Specifically, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm specified by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems
On Extracting Course-Grained Function Parallelism from C Programs
To efficiently utilize the emerging heterogeneous multi-core architecture, it is essential to exploit the inherent coarse-grained parallelism in applications. In addition to data parallelism, applications like telecommunication, multimedia, and gaming can also benefit from the exploitation of coarse-grained function parallelism. To exploit coarse-grained function parallelism, the common wisdom is to rely on programmers to explicitly express the coarse-grained data-flow between coarse-grained functions using data-flow or streaming languages.
This research is set to explore another approach to exploiting coarse-grained function parallelism, that is to rely on compiler to extract coarse-grained data-flow from imperative programs. We believe imperative languages and the von Neumann programming model will still be the dominating programming languages programming model in the future.
This dissertation discusses the design and implementation of a memory data-flow analysis system which extracts coarse-grained data-flow from C programs. The memory data-flow analysis system partitions a C program into a hierarchy of program regions. It then traverses the program region hierarchy from bottom up, summarizing the exposed memory access patterns for each program region, meanwhile deriving a conservative producer-consumer relations between program regions. An ensuing top-down traversal of the program region hierarchy will refine the producer-consumer relations by pruning spurious relations.
We built an in-lining based prototype of the memory data-flow analysis system on top of the IMPACT compiler infrastructure. We applied the prototype to analyze the memory data-flow of several MediaBench programs. The experiment results showed that while the prototype performed reasonably well for the tested programs, the in-lining based implementation may not efficient for larger programs. Also, there is still room in improving the effectiveness of the memory data-flow analysis system. We did root cause analysis for the inaccuracy in the memory data-flow analysis results, which provided us insights on how to improve the memory data-flow analysis system in the future
Run-time optimization of adaptive irregular applications
Compared to traditional compile-time optimization, run-time optimization could offer significant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identified a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Specifically, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm specified by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems
Automatic Parallelization of Tiled Stencil Loop Nests on GPUs
This thesis attempts to design and implement a compiler framework based on the polyhedral model. The compiler automatically parallelizes loop nests; especially stencil kernels, into efficient GPU code by loop tiling transformations which the polyhedral model describes. To enhance parallel performance, we introduce three practically efficient techniques to process different types of loop nests. The experimental results of our compiler framework have demonstrated that these advanced techniques can outperform previous approaches.
Firstly, we aim to find efficient tiling transformations without violating data dependences. How to select a tile's shape and size is an open issue that is performance-critical and influenced by GPU's hardware constraints.
We propose an approach to determine the tile shapes out of consideration for improving two-level parallelism of GPUs. The new approach finds appropriate tiling hyperplanes by embedding parallelism-enhancing constraints into the polyhedral model to maximize intra-tile, i.e., intra-SM parallelism. This improves the load balance among the streaming processors (SPs), which execute a wavefront of loop iterations within a tile. We eliminate parallelism-hindering false dependences to optimize inter-tile, i.e., inter-SM parallelism. This improves the load balance among the streaming multiprocessors (SMs), which execute a wavefront of tiles.
Furthermore, to avoid combinatorial explosion of tile size's configurations, we present a model-driven approach to automating tile size selection that is performance-critical for loop tiling transformations, especially for DOACROSS loop nests. Our tile size selection model accurately estimates the execution times of tiled loop nests running on GPUs. The selected tile sizes lead to the performance results that are close to the best observed for a range of problem sizes tested.
Finally, to address the difficulty and low-performance of parallelizing widely used SOR stencil loop nests, we present a new tiled parallel SOR method, called MLSOR, which admits more efficient data-parallel SIMD execution on GPUs. Unlike the previous two approaches that are dependence-preserving, the basic idea is to algorithmically restructure a stencil kernel based on a non-dependence-preserving parallelization scheme to avoid pipelining for higher parallelism. The new approach can be implemented in compilers through a pattern matching pass to optimize SOR-like DOACROSS loop nests on GPUs
SUDS : automatic parallelization for raw processors
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (p. 177-181).A computer can never be too fast or too cheap. Computer systems pervade nearly every aspect of science, engineering, communications and commerce because they perform certain tasks at rates unachievable by any other kind of system built by humans. A computer system's throughput, however, is constrained by that system's ability to find concurrency. Given a particular target work load the computer architect's role is to design mechanisms to find and exploit the available concurrency in that work load. This thesis describes SUDS (Software Un-Do System), a compiler and runtime system that can automatically find and exploit the available concurrency of scalar operations in imperative programs with arbitrary unstructured and unpredictable control flow. The core compiler transformation that enables this is scalar queue conversion. Scalar queue conversion makes scalar renaming an explicit operation through a process similar to closure conversion, a technique traditionally used to compile functional languages. The scalar queue conversion compiler transformation is speculative, in the sense that it may introduce dynamic memory allocation operations into code that would not otherwise dynamically allocate memory. Thus, SUDS also includes a transactional runtime system that periodically checkpoints machine state, executes code speculatively, checks if the speculative execution produced results consistent with the original sequential program semantics, and then either commits or rolls back the speculative execution path. In addition to safely running scalar queue converted code, the SUDS runtime system safely permits threads to speculatively run in parallel and concurrently issue memory operations, even when the compiler is unable to prove that the reordered memory operations will always produce correct results.(cont.) Using this combination of compile time and runtime techniques, SUDS can find concurrency in programs where previous compiler based renaming techniques fail because the programs contain unstructured loops, and where Tomasulo's algorithm fails because it sequentializes mispredicted branches. Indeed, we describe three application programs, with unstructured control flow, where the prototype SUDS system, running in software on a Raw microprocessor, achieves speedups equivalent to, or better than, an idealized, and unrealizable, model of a hardware implementation of Tomasulo's algorithm.by Matthew Ian Frank.Ph.D