533 research outputs found

    Architecture-Aware Optimization on a 1600-core Graphics Processor

    Get PDF
    The graphics processing unit (GPU) continues to make significant strides as an accelerator in commodity cluster computing for high-performance computing (HPC). For example, three of the top five fastest supercomputers in the world, as ranked by the TOP500, employ GPUs as accelerators. Despite this increasing interest in GPUs, however, optimizing the performance of a GPU-accelerated compute node requires deep technical knowledge of the underlying architecture. Although significant literature exists on how to optimize GPU performance on the more mature NVIDIA CUDA architecture, the converse is true for OpenCL on the AMD GPU. Consequently, we present and evaluate architecture-aware optimizations for the AMD GPU. The most prominent optimizations include (i) explicit use of registers, (ii) use of vector types, (iii) removal of branches, and (iv) use of image memory for global data. We demonstrate the efficacy of our AMD GPU optimizations by applying each optimization in isolation as well as in concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific GPU optimizations, the AMD Radeon HD 5870 GPU delivers 65% better performance than with the wellknown NVIDIA-specific optimizations

    Tiling Optimization For Nested Loops On Gpus

    Get PDF
    Optimizing nested loops has been considered as an important topic and widely studied in parallel programming. With the development of GPU architectures, the performance of these computations can be significantly boosted with the massively parallel hardware. General matrix-matrix multiplication is a typical example where executing such an algorithm on GPUs outperforms the performance obtained on other multicore CPUs. However, achieving ideal performance on GPUs usually requires a lot of human effort to manage the massively parallel computation resources. Therefore, the efficient implementation of optimizing nested loops on GPUs became a popular topic in recent years. We present our work based on the tiling strategy in this dissertation to address three kinds of popular problems. Different kinds of computations bring in different latency issues where dependencies in the computation may result in insufficient parallelism and the performance of computations without dependencies may be degraded due to intensive memory accesses. In this thesis, we tackle the challenges for each kind of problem and believe that other computations performed in nested loops can also benefit from the presented techniques. We improve a parallel approximation algorithm for the problem of scheduling jobs on parallel identical machines to minimize makespan with a high-dimensional tiling method. The algorithm is designed and optimized for solving this kind of problem efficiently on GPUs. Because the algorithm is based on a higher-dimensional dynamic programming approach, where dimensionality refers to the number of variables in the dynamic programming equation characterizing the problem, the existing implementation suffers from the pain of dimensionality and cannot fully utilize GPU resources. We design a novel data-partitioning technique to accelerate the higher-dimensional dynamic programming component of the algorithm. Both the load imbalance and exceeding memory capacity issues are addressed in our GPU solution. We present performance results to demonstrate how our proposed design improves the GPU utilization and makes it possible to solve large higher-dimensional dynamic programming problems within the limited GPU memory. Experimental results show that the GPU implementation achieves up to 25X speedup compared to the best existing OpenMP implementation. In addition, we focus on optimizing wavefront parallelism on GPUs. Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research on such applications, which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. Wavefront parallelism faces the load imbalance issue because the parallelism is passing along the diagonal. The tiling method has been introduced as a popular solution to address this issue. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. We evaluate the performance of our proposed technique for four different applications: Sequence Alignment, Edit Distance, Summed-Area Table, and 2DSOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous best-known hyperplane tiling-based GPU implementation. Finally, we extend the hyperplane tiling to high order 2D stencil computations. Unlike wavefront parallelism that has dependence in the spatial dimension, dependence remains only across two adjacent time steps along the temporal dimension in stencil computations. Even if the no-dependence property significantly increases the parallelism obtained in the spatial dimensions, full parallelism may not be efficient on GPUs. Due to the limited cache capacity owned by each streaming multiprocessor, full parallelism can be obtained on global memory only, which has high latency to access. Therefore, the tiling technique can be applied to improve the memory efficiency by caching the small tiled blocks. Because the widely studied tiling methods, like overlapped tiling and split tiling, have considerable computation overhead caused by load imbalance or extra operations, we propose a time skewed tiling method, which is designed upon the GPU architecture. We work around the serialized computation issue and coordinate the intra-tile parallelism and inter-tile parallelism to minimize the load imbalance caused by pipelined processing. Moreover, we address the high-order stencil computations in our development, which has not been comprehensively studied. The proposed method achieves up to 3.5X performance improvement when the stencil computation is performed on a Moore neighborhood pattern

    Astaroth: Ohjelmistokirjasto stensiililaskentaan grafiikkasuorittimilla

    Get PDF
    Graphics processing units (GPUs) are coprocessors, which offer higher throughput and better power efficiency than central processing units in dataparallel tasks. For this reason, graphics processors provide a good platform for high-performance computing. However, programming GPUs such that all the available performance is utilized requires in-depth knowledge of the architecture of the hardware. Additionally, the problem of high-order stencil computations on GPUs in challenging multiphysics applications has not been adequately explored in previous work. In this thesis, we address these issues by presenting a library, an efficient algorithm and a domain-specific language for solving stencil computations within a structured grid. We tested our implementation by simulating magnetohydrodynamics, which involved the computation of first, second, and cross partial derivatives using second-, fourth-, sixth-, and eight-order finite differences with single and double precision. The running time of our integration kernel was 2.8–9.1 times slower than the theoretical minimum time, which it would take to read the computational domain and write it back to device memory exactly once, without taking into account the effects of finite caches or arithmetic operations on performance. Additionally, we made a performance comparison with a CPU solver widely used for scientific computations, which we benchmarked on a total of 24 cores of two Intel Xeon E5-2690 v3 processors. Our solver, benchmarked on a Tesla P100 PCIe GPU, outperformed the CPU solver by factors of 6.7 and 10.4 when using single and double precision, respectively.Grafiikkasuorittimet ovat apusuorittimia, jotka tarjoavat rinnakkain laskettavissa tehtävissä parempaa suoritus- ja energiatehokkuutta kuin keskussuorittimet. Tästä syystä grafiikkasuorittimet tarjoavat hyvän alustan suurteholaskennan tarpeisiin. Toisaalta grafiikkasuorittimen ohjelmointi siten, että kaikki tarjolla oleva suorituskyky saadaan hyödynnettyä, vaatii syvällistä asiantuntemusta ohjelmoitavan laitteiston arkkitehtuurista. Korkean asteen stensiililaskentaa haastavissa fysiikkasovelluksissa ei ole myöskään tutkittu laajalti aiemmissa julkaisuissa. Tässä työssä otamme kantaa näihin ongelmiin esittelemällä ohjelmistokirjaston, tehokkaan algoritmin, sekä tehtävään räätälöidyn ohjelmointikielen stensiililaskujen ratkaisemiseen säännöllisessä hilassa. Testasimme toteutustamme simuloimalla magnetohydrodynamiikkaa, johon kuului ensimmäisen ja toisen kertaluvun derivaattojen lisäksi ristiderivaattojen ratkaisutoisen, neljännen, kuudennen ja kahdeksannen kertaluvun differenssimenetelmällä käyttäen sekä 32- että 64-bittisiä liukulukuja. Integrointifunktiomme suoritusaika oli 2.8–9.1 kertaa hitaampi kuin teoreettinen vähimmäisajoaika, joka menisi laskennallisen alueen lukemiseen ja kirjoittamiseen apusuorittimen muistista täsmälleen kerran, ottamatta huomioon äärellisen välimuistin tai laskennan vaikutusta suoritusaikaan. Vertasimme kirjastomme suoritusaikaa laajalti tieteellisessä laskennassa käytettyyn keskussuorittimille tarkoitettuun ratkaisijaan, jonka ajoimme kokonaisuudessaan 24:llä ytimellä kahdella Intel Xeon E5-2690 v3 -suorittimella. Tähän ratkaisijaan verrattuna Tesla P100 PCIe -grafiikkasuorittimella ajettu ratkaisijamme oli 6.7 ja 10.4 kertaa nopeampi 32- ja 64-bittisillä liukuluvuilla laskettaessa, tässä järjestyksessä

    Massively parallel lattice–Boltzmann codes on large GPU clusters

    Get PDF
    This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics

    Generating and auto-tuning parallel stencil codes

    Get PDF
    In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform. A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation. The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology. The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance. Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually. We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code
    • …
    corecore