2,608 research outputs found

    A Flexible Patch-Based Lattice Boltzmann Parallelization Approach for Heterogeneous GPU-CPU Clusters

    Full text link
    Sustaining a large fraction of single GPU performance in parallel computations is considered to be the major problem of GPU-based clusters. In this article, this topic is addressed in the context of a lattice Boltzmann flow solver that is integrated in the WaLBerla software framework. We propose a multi-GPU implementation using a block-structured MPI parallelization, suitable for load balancing and heterogeneous computations on CPUs and GPUs. The overhead required for multi-GPU simulations is discussed in detail and it is demonstrated that the kernel performance can be sustained to a large extent. With our GPU implementation, we achieve nearly perfect weak scalability on InfiniBand clusters. However, in strong scaling scenarios multi-GPUs make less efficient use of the hardware than IBM BG/P and x86 clusters. Hence, a cost analysis must determine the best course of action for a particular simulation task. Additionally, weak scaling results of heterogeneous simulations conducted on CPUs and GPUs simultaneously are presented using clusters equipped with varying node configurations.Comment: 20 pages, 12 figure

    PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation

    Full text link
    High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that support this technique. In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie

    Performance analysis and optimization of automatic speech recognition

    Get PDF
    © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Fast and accurate Automatic Speech Recognition (ASR) is emerging as a key application for mobile devices. Delivering ASR on such devices is challenging due to the compute-intensive nature of the problem and the power constraints of embedded systems. In this paper, we provide a performance and energy characterization of Pocketsphinx, a popular toolset for ASR that targets mobile devices. We identify the computation of the Gaussian Mixture Model (GMM) as the main bottleneck, consuming more than 80 percent of the execution time. The CPI stack analysis shows that branches and main memory accesses are the main performance limiting factors for GMM computation. We propose several software-level optimizations driven by the power/performance analysis. Unlike previous proposals that trade accuracy for performance by reducing the number of Gaussians evaluated, we maintain accuracy and improve performance by effectively using the underlying CPU microarchitecture. First, we use a refactored implementation of the innermost loop of the GMM evaluation code to ameliorate the impact of branches. Second, we exploit the vector unit available on most modern CPUs to boost GMM computation, introducing a novel memory layout for storing the means and variances of the Gaussians in order to maximize the effectiveness of vectorization. Third, we compute the Gaussians for multiple frames in parallel, so means and variances can be fetched once in the on-chip caches and reused across multiple frames, significantly reducing memory bandwidth usage. We evaluate our optimizations using both hardware counters on real CPUs and simulations. Our experimental results show that the proposed optimizations provide 2.68x speedup over the baseline Pocketsphinx decoder on a high-end Intel Skylake CPU, while achieving 61 percent energy savings. On a modern ARM Cortex-A57 mobile processor our techniques improve performance by 1.85x, while providing 59 percent energy savings without any loss in the accuracy of the ASR system.Peer ReviewedPostprint (author's final draft

    The Iray Light Transport Simulation and Rendering System

    Full text link
    While ray tracing has become increasingly common and path tracing is well understood by now, a major challenge lies in crafting an easy-to-use and efficient system implementing these technologies. Following a purely physically-based paradigm while still allowing for artistic workflows, the Iray light transport simulation and rendering system allows for rendering complex scenes by the push of a button and thus makes accurate light transport simulation widely available. In this document we discuss the challenges and implementation choices that follow from our primary design decisions, demonstrating that such a rendering system can be made a practical, scalable, and efficient real-world application that has been adopted by various companies across many fields and is in use by many industry professionals today

    Matching non-uniformity for program optimizations on heterogeneous many-core systems

    Get PDF
    As computing enters an era of heterogeneity and massive parallelism, it exhibits a distinct feature: the deepening non-uniform relations among the computing elements in both hardware and software. Besides traditional non-uniform memory accesses, much deeper non-uniformity shows in a processor, runtime, and application, exemplified by the asymmetric cache sharing, memory coalescing, and thread divergences on multicore and many-core processors. Being oblivious to the non-uniformity, current applications fail to tap into the full potential of modern computing devices.;My research presents a systematic exploration into the emerging property. It examines the existence of such a property in modern computing, its influence on computing efficiency, and the challenges for establishing a non-uniformity--aware paradigm. I propose several techniques to translate the property into efficiency, including data reorganization to eliminate non-coalesced accesses, asynchronous data transformations for locality enhancement and a controllable scheduling for exploiting non-uniformity among thread blocks. The experiments show much promise of these techniques in maximizing computing throughput, especially for programs with complex data access patterns
    • …
    corecore