5,851 research outputs found

    GPU Accelerated Particle Visualization with Splotch

    Get PDF
    Splotch is a rendering algorithm for exploration and visual discovery in particle-based datasets coming from astronomical observations or numerical simulations. The strengths of the approach are production of high quality imagery and support for very large-scale datasets through an effective mix of the OpenMP and MPI parallel programming paradigms. This article reports our experiences in re-designing Splotch for exploiting emerging HPC architectures nowadays increasingly populated with GPUs. A performance model is introduced for data transfers, computations and memory access, to guide our re-factoring of Splotch. A number of parallelization issues are discussed, in particular relating to race conditions and workload balancing, towards achieving optimal performances. Our implementation was accomplished by using the CUDA programming paradigm. Our strategy is founded on novel schemes achieving optimized data organisation and classification of particles. We deploy a reference simulation to present performance results on acceleration gains and scalability. We finally outline our vision for future work developments including possibilities for further optimisations and exploitation of emerging technologies.Comment: 25 pages, 9 figures. Astronomy and Computing (2014

    Doctor of Philosophy

    Get PDF
    dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented

    Doctor of Philosophy

    Get PDF
    dissertationEmerging trends such as growing architectural diversity and increased emphasis on energy and power efficiency motivate the need for code that adapts to its execution context (input dataset and target architecture). Unfortunately, writing such code remains difficult, and is typically attempted only by a small group of motivated expert programmers who are highly knowledgeable about the relationship between software and its hardware mapping. In this dissertation, we introduce novel abstractions and techniques based on automatic performance tuning that enable both experts and nonexperts (application developers) to produce adaptive code. We present two new frameworks for adaptive programming: Nitro and Surge. Nitro enables expert programmers to specify code variants, or alternative implementations of the same computation, together with meta-information for selecting among them. It then utilizes supervised classification to select an optimal code variant at runtime based on characteristics of the execution context. Surge, on the other hand, provides a high-level nested data-parallel programming interface for application developers to specify computations. It then employs a two-level mechanism to automatically generate code variants and then tunes them using Nitro. The resulting code performs on par with or better than handcrafted reference implementations on both CPUs and GPUs. In addition to abstractions for expressing code variants, this dissertation also presents novel strategies for adaptively tuning them. First, we introduce a technique for dynamically selecting an optimal code variant at runtime based on characteristics of the input dataset. On five high-performance GPU applications, variants tuned using this strategy achieve over 93% of the performance of variants selected through exhaustive search. Next, we present a novel approach based on multitask learning to develop a code variant selection model on a target architecture from training on different source architectures. We evaluate this approach on a set of six benchmark applications and a collection of six NVIDIA GPUs from three distinct architecture generations. Finally, we implement support for combined code variant and frequency selection based on multiple objectives, including power and energy efficiency. Using this strategy, we construct a GPU sorting implementation that provides improved energy and power efficiency with less than a proportional drop in sorting throughput

    A Survey on Compiler Autotuning using Machine Learning

    Full text link
    Since the mid-1990s, researchers have been trying to use machine-learning based approaches to solve a number of different compiler optimization problems. These techniques primarily enhance the quality of the obtained results and, more importantly, make it feasible to tackle two main compiler optimization problems: optimization selection (choosing which optimizations to apply) and phase-ordering (choosing the order of applying optimizations). The compiler optimization space continues to grow due to the advancement of applications, increasing number of compiler optimizations, and new target architectures. Generic optimization passes in compilers cannot fully leverage newly introduced optimizations and, therefore, cannot keep up with the pace of increasing options. This survey summarizes and classifies the recent advances in using machine learning for the compiler optimization field, particularly on the two major problems of (1) selecting the best optimizations and (2) the phase-ordering of optimizations. The survey highlights the approaches taken so far, the obtained results, the fine-grain classification among different approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated quarterly here (Send me your new published papers to be added in the subsequent version) History: Received November 2016; Revised August 2017; Revised February 2018; Accepted March 2018

    Sparse matrix-vector multiplication on GPGPUs

    Get PDF
    The multiplication of a sparse matrix by a dense vector (SpMV) is a centerpiece of scientific computing applications: it is the essential kernel for the solution of sparse linear systems and sparse eigenvalue problems by iterative methods. The efficient implementation of the sparse matrix-vector multiplication is therefore crucial and has been the subject of an immense amount of research, with interest renewed with every major new trend in high performance computing architectures. The introduction of General Purpose Graphics Processing Units (GPGPUs) is no exception, and many articles have been devoted to this problem. With this paper we provide a review of the techniques for implementing the SpMV kernel on GPGPUs that have appeared in the literature of the last few years. We discuss the issues and trade-offs that have been encountered by the various researchers, and a list of solutions, organized in categories according to common features. We also provide a performance comparison across different GPGPU models and on a set of test matrices coming from various application domains
    • …
    corecore