37 research outputs found

    Optimization Techniques for Stencil Data Parallel Programs: Methodologies and Applications

    Get PDF
    The optimization of data parallel programs is a challenging open problem. We analyzed in detail the optimization techniques for stencil computations, which are a subset of data parallel computations. Drawing from previous research, we developed a structured model to describe the program transformations. We used this model to compare the different optimizations presented in literature and study the interaction between them

    Generalized full sparse tiling of loop chains

    Get PDF
    2013 Fall.Includes bibliographical references.Computer and computational scientists are tackling increasingly large and complex problems and are seeking ways of improving the performance of their codes. The key issue faced is how to reach an effective balance between parallelism and locality. In trying to reach this balance, a problem commonly encountered is that of ascertaining the data dependences. Approaches that rely on automatic extraction of data dependences are frequently stymied by complications such as interprocedural and alias analysis. Placing the dependence analysis burden upon the programmer creates a significant barrier to adoption. In this work, we present a new programming abstraction, the loop chain, that specifies a series of loops and the data they access. Given this abstraction, a compiler, inspector, or runtime optimizer can avoid the computationally expensive process of formally determining data dependences, yet still determine beneficial and legal data and iteration reorderings. One optimization method that has been previously applied to irregular scientific codes is full sparse tiling. Full sparse tiling has been used to improve the performance of a handful of scientific codes, but in each case the technique had to be applied from scratch by an expert after careful manual analysis of the possible data dependence patterns. The full sparse tiling approach was extended and generalized as part of this work to apply to any code represented by the loop chain abstraction. Using only the abstraction, the generalized algorithm can produce a new data and iteration ordering as well as a parallel execution schedule. Insight into tuning a generalized full sparse tiled application was gained through a study of the different factors influencing tile count. This work lays the foundation for an efficient autotuning approach to optimizing tile count

    Automatic Parallelization of Tiled Stencil Loop Nests on GPUs

    Full text link
    This thesis attempts to design and implement a compiler framework based on the polyhedral model. The compiler automatically parallelizes loop nests; especially stencil kernels, into efficient GPU code by loop tiling transformations which the polyhedral model describes. To enhance parallel performance, we introduce three practically efficient techniques to process different types of loop nests. The experimental results of our compiler framework have demonstrated that these advanced techniques can outperform previous approaches. Firstly, we aim to find efficient tiling transformations without violating data dependences. How to select a tile's shape and size is an open issue that is performance-critical and influenced by GPU's hardware constraints. We propose an approach to determine the tile shapes out of consideration for improving two-level parallelism of GPUs. The new approach finds appropriate tiling hyperplanes by embedding parallelism-enhancing constraints into the polyhedral model to maximize intra-tile, i.e., intra-SM parallelism. This improves the load balance among the streaming processors (SPs), which execute a wavefront of loop iterations within a tile. We eliminate parallelism-hindering false dependences to optimize inter-tile, i.e., inter-SM parallelism. This improves the load balance among the streaming multiprocessors (SMs), which execute a wavefront of tiles. Furthermore, to avoid combinatorial explosion of tile size's configurations, we present a model-driven approach to automating tile size selection that is performance-critical for loop tiling transformations, especially for DOACROSS loop nests. Our tile size selection model accurately estimates the execution times of tiled loop nests running on GPUs. The selected tile sizes lead to the performance results that are close to the best observed for a range of problem sizes tested. Finally, to address the difficulty and low-performance of parallelizing widely used SOR stencil loop nests, we present a new tiled parallel SOR method, called MLSOR, which admits more efficient data-parallel SIMD execution on GPUs. Unlike the previous two approaches that are dependence-preserving, the basic idea is to algorithmically restructure a stencil kernel based on a non-dependence-preserving parallelization scheme to avoid pipelining for higher parallelism. The new approach can be implemented in compilers through a pattern matching pass to optimize SOR-like DOACROSS loop nests on GPUs

    Predictive Modeling in a Polyhedral Optimization Space

    Get PDF
    International audienceHigh-level program optimizations, such as loop transformations, are critical for high performance on multi-core targets. However, complex sequences of loop transformations are often required to expose parallelism (both coarse-grain and fine-grain) and improve data locality. The polyhedral compilation framework has proved to be very effective at representing these complex sequences and restructuring compute-intensive applications, seamlessly handling perfectly and imperfectly nested loops. Nevertheless identifying the most effective loop transformations remains a major challenge. We address the problem of selecting the best polyhedral optimizations with dedicated machine learning models, trained specifically on the target machine. We show that these models can quickly select high-performance optimizations with very limited iterative search. Our end-to-end framework is validated using numerous benchmarks on two modern multi-core platforms. We investigate a variety of different machine learning algorithms and hardware counters, and we obtain performance improvements over productions compilers ranging on average from 3.2x to 8.7x, by running not more than 6 program variants from a polyhedral optimization space

    Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles to Reconcile Parallelism and Locality, avoiding Divergence and Load Imbalance

    Get PDF
    International audienceTiling is a key technology to increase data reuse in computation kernels. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, the option of tiling only the parallel inner loops is generally not profitable because it does not enable enough data reuse. To combine parallelism and locality, several tiling algorithms propose to tile the time loop together with one or more of the parallel inner loops. However, all these algorithms have some limitations: they are either limited to special computation patterns, require the redundant execution of certain iterations (overlapped tiling), or require the use of wavefront parallelism which makes the parallel workload unbalanced. One approach to tiling that addresses most of these issues is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate an affine schedule and index-set splitting that enable us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear programming problem. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs

    Polyhedral+Dataflow Graphs

    Get PDF
    This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains. The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods

    TRACO: Source-to-Source Parallelizing Compiler

    Get PDF
    The paper presents a source-to-source compiler, TRACO, for automatic extraction of both coarse- and fine-grained parallelism available in C/C++ loops. Parallelization techniques implemented in TRACO are based on the transitive closure of a relation describing all the dependences in a loop. Coarse- and fine-grained parallelism is represented with synchronization-free slices (space partitions) and a legal loop statement instance schedule (time partitions), respectively. TRACO enables also applying scalar and array variable privatization as well as parallel reduction. On its output, TRACO produces compilable parallel OpenMP C/C++ and/or OpenACC C/C++ code. The effectiveness of TRACO, efficiency of parallel code produced by TRACO, and the time of parallel code production are evaluated by means of the NAS Parallel Benchmark and Polyhedral Benchmark suites. These features of TRACO are compared with closely related compilers such as ICC, Pluto, Par4All, and Cetus. Feature work is outlined

    Beyond shared memory loop parallelism in the polyhedral model

    Get PDF
    2013 Spring.Includes bibliographical references.With the introduction of multi-core processors, motivated by power and energy concerns, parallel processing has become main-stream. Parallel programming is much more difficult due to its non-deterministic nature, and because of parallel programming bugs that arise from non-determinacy. One solution is automatic parallelization, where it is entirely up to the compiler to efficiently parallelize sequential programs. However, automatic parallelization is very difficult, and only a handful of successful techniques are available, even after decades of research. Automatic parallelization for distributed memory architectures is even more problematic in that it requires explicit handling of data partitioning and communication. Since data must be partitioned among multiple nodes that do not share memory, the original memory allocation of sequential programs cannot be directly used. One of the main contributions of this dissertation is the development of techniques for generating distributed memory parallel code with parametric tiling. Our approach builds on important contributions to the polyhedral model, a mathematical framework for reasoning about program transformations. We show that many affine control programs can be uniformized only with simple techniques. Being able to assume uniform dependences significantly simplifies distributed memory code generation, and also enables parametric tiling. Our approach implemented in the AlphaZ system, a system for prototyping analyses, transformations, and code generators in the polyhedral model. The key features of AlphaZ are memory re-allocation, and explicit representation of reductions. We evaluate our approach on a collection of polyhedral kernels from the PolyBench suite, and show that our approach scales as well as PLuTo, a state-of-the-art shared memory automatic parallelizer using the polyhedral model. Automatic parallelization is only one approach to dealing with the non-deterministic nature of parallel programming that leaves the difficulty entirely to the compiler. Another approach is to develop novel parallel programming languages. These languages, such as X10, aim to provide highly productive parallel programming environment by including parallelism into the language design. However, even in these languages, parallel bugs remain to be an important issue that hinders programmer productivity. Another contribution of this dissertation is to extend the array dataflow analysis to handle a subset of X10 programs. We apply the result of dataflow analysis to statically guarantee determinism. Providing static guarantees can significantly increase programmer productivity by catching questionable implementations at compile-time, or even while programming

    Multicoloring of grid-structured PDE solvers on shared-memorymultiprocessors

    Get PDF
    In order to execute a parallel PDE (partial differential equation) solver on a shared-memory multiprocessor, we have to avoid memory conflicts in accessing multidimensional data grids. A new multicoloring technique is proposed for speeding sparse matrix operations. The new technique enables parallel access of grid-structured data elements in the shared memory without causing conflicts. The coloring scheme is formulated as an algebraic mapping which can be easily implemented with low overhead on commercial multiprocessors. The proposed multicoloring scheme bas been tested on an Alliant FX/80 multiprocessor for solving 2D and 3D problems using the CGNR method. Compared to the results reported by Saad (1989) on an identical Alliant system, our results show a factor of 30 times higher performance in Mflops. Multicoloring transforms sparse matrices into ones with a diagonal diagonal block (DDB) structure, enabling parallel LU decomposition in solving PDE problems. The multicoloring technique can also be extended to solve other scientific problems characterized by sparse matrices.published_or_final_versio
    corecore