26 research outputs found

    Autotuning wavefront applications for multicore multi-GPU hybrid architectures

    Get PDF

    Doctor of Philosophy

    Get PDF
    dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented

    Doctor of Philosophy in Computer Science

    Get PDF
    dissertationStencil computations are operations on structured grids. They are frequently found in partial differential equation solvers, making their performance critical to a range of scientific applications. On modern architectures where data movement costs dominate computation, optimizing stencil computations is a challenging task. Typically, domain scientists must reduce and orchestrate data movement to tackle the memory bandwidth and latency bottlenecks. Furthermore, optimized code must map efficiently to ever increasing parallelism on a chip. This dissertation studies several stencils with varying arithmetic intensities, thus requiring contrasting optimization strategies. Stencils traditionally have low arithmetic intensity, making their performance limited by memory bandwidth. Contemporary higher-order stencils are designed to require smaller grids, hence less memory, but are bound by increased floating-point operations. This dissertation develops communication-avoiding optimizations to reduce data movement in memory-bound stencils. For higher-order stencils, a novel transformation, partial sums, is designed to reduce the number of floating-point operations and improve register reuse. These optimizations are implemented in a compiler framework, which is further extended to generate parallel code targeting multicores and graphics processor units (GPUs). The augmented compiler framework is then combined with autotuning to productively address stencil optimization challenges. Autotuning explores a search space of possible implementations of a computation to find the optimal code for an execution context. In this dissertation, autotuning is used to compose sequences of optimizations to drive the augmented compiler framework. This compiler-directed autotuning approach is used to optimize stencils in the context of a linear solver, Geometric Multigrid (GMG). GMG uses sequences of stencil computations, and presents greater optimization challenges than isolated stencils, as interactions between stencils must also be considered. The efficacy of our approach is demonstrated by comparing the performance of generated code against manually tuned code, over commercial compiler-generated code, and against analytic performance bounds. Generated code outperforms manually optimized codes on multicores and GPUs. Against Intel's compiler on multicores, generated code achieves up to 4x speedup for stencils, and 3x for the solver. On GPUs, generated code achieves 80% of an analytically computed performance bound

    Portable performance on heterogeneous architectures

    Get PDF
    Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine. To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.United States. Dept. of Energy (Award DE-SC0005288)United States. Defense Advanced Research Projects Agency (Award HR0011-10-9-0009)National Science Foundation (U.S.) (Award CCF-0632997

    Enhancing productivity and performance portability of opencl applications on heterogeneous systems using runtime optimizations

    Get PDF
    Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable Gate Arrays or other specialized processors have very different architectures. This puts an enormous strain on programming models and software developers to take full advantage of the computing power at hand. Because of this diversity and the unachievable flexibility and portability necessary to optimize for each target individually, heterogeneous systems remain typically vastly under-utilized. In this thesis, we explore two distinct ways to tackle this problem. Providing automated, non intrusive methods in the form of compiler tools and implementing efficient abstractions to automatically tune parameters for a restricted domain are two complementary approaches investigated to better utilize compute resources in heterogeneous systems. First, we explore a fully automated compiler based approach, where a runtime system analyzes the computation flow of an OpenCL application and optimizes it across multiple compute kernels. This method can be deployed on any existing application transparently and replaces significant software engineering effort spent to tune application for a particular system. We show that this technique achieves speedups of up to 3x over unoptimized code and an average of 1.4x over manually optimized code for highly dynamic applications. Second, a library based approach is designed to provide a high level abstraction for complex problems in a specific domain, stencil computation. Using domain specific techniques, the underlying framework optimizes the code aggressively. We show that even in a restricted domain, automatic tuning mechanisms and robust architectural abstraction are necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs and 3.6x on four
    corecore