8 research outputs found

    A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures

    Full text link
    Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of exposed parallelism. A fast and scalable coloring algorithm using as few colors as possible is vital for the overall parallel performance and scalability of many irregular applications that depend upon runtime dependency analysis. Catalyurek et al. have proposed a graph coloring algorithm which relies on speculative, local assignment of colors. In this paper we present an improved version which runs even more optimistically with less thread synchronization and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We show that the new technique scales better on multi-core and many-core systems and performs up to 1.5x faster than its predecessor on graphs with high-degree vertices, while keeping the number of colors at the same near-optimal levels.Comment: To appear in the proceedings of Euro Par 201

    Scheduling Transformation and Dependence Tests for Recursive Programs

    Get PDF
    Scheduling transformations reorder the execution of operations in a program to improve locality and/or parallelism. The polyhedral model provides a general framework for performing instance-wise scheduling transformations for regular programs, reordering the iterations of loops that operate over dense arrays through transformations like tiling. There is no analogous framework for recursive programs—despite recent interest in optimizations like tiling and fusion for recursive applications. This paper presents PolyRec, the first general framework for applying scheduling transformations—like inlining, interchange, and code motion—to nested recursive programs and reasoning about their correctness. We describe the phases of PolyRec—representing dynamic instances, applying transformations, reasoning about correctness—and show that PolyRec is able to apply sophisticated, composed transformations to complex, nested recursive programs and improve performance through enhanced locality

    Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS

    Get PDF
    The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2Ă—\times on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively, 3.5Ă—3.5\times on the linear solver TeaLeaf, and 1.7Ă—1.7\times on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information

    Generalizing Run-Time Tiling with the Loop Chain Abstraction

    No full text

    Polyhedral+Dataflow Graphs

    Get PDF
    This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains. The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods

    Communication-avoiding optimizations for large-scale unstructured-mesh applications with OP2

    Get PDF
    This thesis presents data movement-reducing and communication-avoiding optimizations and their practicable implementation for large-scale unstructured-mesh numerical simulation applications. Utilizing the high-level abstractions of the OP2 domain-specific library, we reason about techniques for reduced communications across a consecutive sequence of loops – a loop-chain. The optimizations are explored for shared-memory systems where multiple processors share a common memory space and distributed-memory systems that comprise separate memory spaces across multiple nodes. We elucidate the challenges when executing unstructured-mesh applications on large-scale high-performance systems that are specifically related to data sharing and movement, synchronization, and communication among processes. A key feature of the work is to mitigate these problems for real-world, large-scale applications and computing kernels, bringing together proven and effective techniques within a DSL framework. On shared-memory systems, We explore cache-blocking tiling, a key technique for exploiting data locality, in unstructured-mesh applications by integrating the SLOPE library, a cache-blocking tiling library, with OP2. For distributed-memory systems, we analyze the trade-off between increased redundant computation in place of data movement and design a new communication-avoiding back-end for OP2 that applies these techniques automatically to any OP2 application targeting CPUs and GPUs. The communication-avoiding optimizations are applied to two non-trivial applications, including the OP2 version of Rolls Royce’s production CFD application, Hydra, on problem sizes representative of real-world workloads. Results demonstrate how, for select configurations, the new communication-avoiding back-end provides between 30 – 65% runtime reductions for the loop-chains in these applications on both an HPE Cray EX system and an NVIDIA V100 GPU cluster. We model and examine the determinants and characteristics of a given unstructured-mesh loop-chain that lead to performance benefits with communication-avoidance techniques, providing insights into the general feasibility and profitability of using the optimizations for this class of applications
    corecore