11 research outputs found
Static Analysis of Upper and Lower Bounds on Dependences and Parallelism
Existing compilers often fail to parallelize sequential code, even
when a program can be manually transformed into parallel form
by a sequence of well-understood transformations
(as is the case for many of the Perfect Club Benchmark
programs).
These failures can occur for several reasons: the code transformations
implemented in the compiler may not be sufficient to produce parallel
code, the compiler may not find the proper sequence of
transformations, or the compiler may not be able to prove that one
of the necessary transformations is legal.
When a compiler extract sufficient parallelism from a program,
the programmer extract additional parallelism.
Unfortunately, the programmer is typically left to search for
parallelism without significant assistance.
The compiler generally does not give feedback about which parts of the
program might contain additional parallelism, or about the types of
transformations that might be needed to realize this parallelism.
Standard program transformations and dependence abstractions cannot be
used to provide this feedback.
In this paper, we propose a two step approach for the search for
parallelism in sequential programs:
We first construct several sets of constraints that describe, for each
statement, which iterations of that statement can be executed
concurrently.
By constructing constraints that correspond to different assumptions
about which dependences might be eliminated through additional
analysis, transformations and user assertions, we can determine
whether we can expose parallelism by eliminating dependences.
In the second step of our search for parallelism, we examine these
constraint sets to identify the kinds of transformations that are
needed to exploit scalable parallelism.
Our tests will identify conditional parallelism and parallelism that
can be exposed by combinations of transformations that reorder the
iteration space (such as loop interchange and loop peeling).
This approach lets us distinguish inherently sequential code from code
that contains unexploited parallelism.
It also produces information about the kinds of transformations that
will be needed to parallelize the code, without worrying about the
order of application of the transformations.
Furthermore, when our dependence test is inexact,
we can identify which unresolved dependences inhibit parallelism
by comparing the effects of assuming dependence or independence.
We are currently exploring the use of this information in
programmer-assisted parallelization.
(Also cross-referenced as UMIACS-TR-94-40
A Simple Data Dependency Analyzer For C Programs.
Data dependencies that exist in a sequential program are a major hindrance towards parallelization
Polly's Polyhedral Scheduling in the Presence of Reductions
The polyhedral model provides a powerful mathematical abstraction to enable
effective optimization of loop nests with respect to a given optimization goal,
e.g., exploiting parallelism. Unexploited reduction properties are a frequent
reason for polyhedral optimizers to assume parallelism prohibiting dependences.
To our knowledge, no polyhedral loop optimizer available in any production
compiler provides support for reductions. In this paper, we show that
leveraging the parallelism of reductions can lead to a significant performance
increase. We give a precise, dependence based, definition of reductions and
discuss ways to extend polyhedral optimization to exploit the associativity and
commutativity of reduction computations. We have implemented a
reduction-enabled scheduling approach in the Polly polyhedral optimizer and
evaluate it on the standard Polybench 3.2 benchmark suite. We were able to
detect and model all 52 arithmetic reductions and achieve speedups up to
2.21 on a quad core machine by exploiting the multidimensional
reduction in the BiCG benchmark.Comment: Presented at the IMPACT15 worksho
Nonlinear Array Dependence Analysis
Standard array data dependence techniques can only reason about linear
constraints. There has also been work on analyzing some dependences
involving polynomial constraints. Analyzing array data dependences in
real-world programs requires handling many ``unanalyzable'' terms:
subscript arrays, run-time tests, function calls.
The standard approach to analyzing such programs has been to omit and
ignore any constraints that cannot be reasoned about. This is unsound
when reasoning about value-based dependences and whether privatization
is legal. Also, this prevents us from determining the conditions that
must be true to disprove the dependence. These conditions could be
checked by a run-time test or verified by a programmer or aggressive,
demand-driven interprocedural analysis.
We describe a solution to these problems. Our solution makes our
system sound and more accurate for analyzing value-based dependences
and derives conditions that can be used to disprove dependences. We
also give some preliminary results from applying our techniques to
programs from the Perfect benchmark suite.
(Also cross-referenced as UMIACS-TR-94-123
Reduction Drawing: Language Constructs and Polyhedral Compilation for Reductions on GPUs
International audienceReductions are common in scientific and data-crunching codes, and a typical source of bottlenecks on massively parallel architectures such as GPUs. Reductions are memory-bound, and achieving peak performance involves sophisticated optimizations. There exist libraries such as CUB and Thrust providing highly tuned implementations of reductions on GPUs. However, library APIs are not flexible enough to express user-defined reductions on arbitrary data types and array indexing schemes. Languages such as OpenACC provide declarative syntax to express reductions. Such approaches support a limited range of reduction operators and do not facilitate the application of complex program transformations in presence of reductions. We present language constructs that let a programmer express arbitrary reductions on user-defined data types matching the performance of tuned library implementations. We also extend a polyhedral compilation flow to process these user-defined reductions, enabling optimizations such as the fusion of multiple reductions, combining reductions with other loop transformations, and optimizing data transfers and storage in the presence of reductions. We implemented these language constructs and compilation methods in the PPCG framework and conducted experiments on multiple GPU targets. For single reductions the generated code performs on par with highly tuned libraries, and for multiple reductions it significantly outperforms both libraries and OpenACC on all platforms
Unified Polyhedral Modeling of Temporal and Spatial Locality
Despite decades of work in this area, the construction of effective loop nest optimizers and parallelizers continues to be challenging due to the increasing diversity of both loop-intensive application workloads and complex memory/computation hierarchies in modern processors. The lack of a systematic approach to optimizing locality and parallelism, with a well-founded data locality model, is a major obstacle to the design of optimizing compilers coping with the variety of software and hardware. Acknowledging the conflicting demands on loop nest optimization, we propose a new unified algorithm for optimizing parallelism and locality in loop nests, that is capable of modeling temporal and spatial effects of multiprocessors and accelerators with deep memory hierarchies and multiple levels of parallelism. It orchestrates a collection of parameterizable optimization problems for locality and parallelism objectives over a polyhedral space of semantics-preserving transformations. The overall problem is not convex and is only constrained by semantics preservation. We discuss the rationale for this unified algorithm, and validate it on a collection of representative computational kernels/benchmarks.Malgré les décennies de travail dans ce domaine, la construction de compilateurs capables de paraléliser et optimiser les nids de boucle reste un problème difficile, dans le contexte d’une augmentation de la diversité des applications calculatoires et de la complexité de la hiérarchie de calcul et de stockage des processeurs modernes. L’absence d’une méthode systématique pour optimiser la localité et le parallélisme, fondée sur un modèle de localité des données pertinent, constitue un obstacle majeur pour prendre en charge la variété des besoins en optimisation de boucles issus du logiciel et du matériel. Dans ce contexte, nous proposons un nouvel algorithme unifié pour l’optimisation du parallélisme et de la localité dans les nids de boucles, capable de modéliser les effets temporels et spatiaux des multiprocesseurs et accélérateurs comportant des hiérarchies profondes de parallélisme et de mémoire. Cet algorithme coordonne la résolution d’une collection de problèmes d’optimisation paramètrés, portant sur des objectifs de localité ou et de parallélisme, dans un espace polyédrique de transformations préservant la sémantique du programme. La conception de cet algorithme fait l’objet d’une discussion systématique, ainsi que d’une validation expérimentale sur des noyaux calculatoires et benchmarks représentatifs
Static Analysis of Upper and Lower Bounds on Dependences and Parallelism
Existing compilers often fail to parallelize sequential code, even when a program can be manually transformed into parallel form by a sequence of well-understood transformations (as is the case for many of the Perfect Club Benchmark programs). These failures can occur for several reasons: the code transformations implemented in the compiler may not be sufficient to produce parallel code, the compiler may not find the proper sequence of transformations, or the compiler may not be able to prove that one of the necessary transformations is legal. When a compiler extract sufficient parallelism from a program, the programmer extract additional parallelism. Unfortunately, the programmer is typically left to search for parallelism without significant assistance. The compiler generally does not give feedback about which parts of the program might contain additional parallelism, or about the types of transformations that might be needed to realize this parallelism. Standard program transformati..