240 research outputs found

    Optimization of lattice Boltzmann simulations on heterogeneous computers

    Get PDF
    High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach, in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting overall performances. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper, we consider exactly this problem for a class of applications based on lattice Boltzmann methods, widely used in computational fluid dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit the different parallel and vector options of the various accelerators efficiently, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using, as testbeds, HPC clusters incorporating different accelerators: Intel Xeon Phi many-core processors, NVIDIA GPUs, and AMD GPUs

    Early Experience on Using Knights Landing Processors for Lattice Boltzmann Applications

    Full text link
    The Knights Landing (KNL) is the codename for the latest generation of Intel processors based on Intel Many Integrated Core (MIC) architecture. It relies on massive thread and data parallelism, and fast on-chip memory. This processor operates in standalone mode, booting an off-the-shelf Linux operating system. The KNL peak performance is very high - approximately 3 Tflops in double precision and 6 Tflops in single precision - but sustained performance depends critically on how well all parallel features of the processor are exploited by real-life applications. We assess the performance of this processor for Lattice Boltzmann codes, widely used in computational fluid-dynamics. In our OpenMP code we consider several memory data-layouts that meet the conflicting computing requirements of distinct parts of the application, and sustain a large fraction of peak performance. We make some performance comparisons with other processors and accelerators, and also discuss the impact of the various memory layouts on energy efficiency

    Massively parallel lattice–Boltzmann codes on large GPU clusters

    Get PDF
    This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics

    Status and Future Perspectives for Lattice Gauge Theory Calculations to the Exascale and Beyond

    Full text link
    In this and a set of companion whitepapers, the USQCD Collaboration lays out a program of science and computing for lattice gauge theory. These whitepapers describe how calculation using lattice QCD (and other gauge theories) can aid the interpretation of ongoing and upcoming experiments in particle and nuclear physics, as well as inspire new ones.Comment: 44 pages. 1 of USQCD whitepapers

    Efficient Algorithms And Optimizations For Scientific Computing On Many-Core Processors

    Get PDF
    Designing efficient algorithms for many-core and multicore architectures requires using different strategies to allow for the best exploitation of the hardware resources on those architectures. Researchers have ported many scientific applications to modern many-core and multicore parallel architectures, and by doing so they have achieved significant speedups over running on single CPU cores. While many applications have achieved significant speedups, some applications still require more effort to accelerate due to their inherently serial behavior. One class of applications that has this serial behavior is the Monte Carlo simulations. Monte Carlo simulations have been used to simulate many problems in statistical physics and statistical mechanics that were not possible to simulate using Molecular Dynamics. While there are a fair number of well-known and recognized GPU Molecular Dynamics codes, the existing Monte Carlo ensemble simulations have not been ported to the GPU, so they are relatively slow and could not run large systems in a reasonable amount of time. Due to the previously mentioned shortcomings of existing Monte Carlo ensemble codes and due to the interest of researchers to have a fast Monte Carlo simulation framework that can simulate large systems, a new GPU framework called GOMC is implemented to simulate different particle and molecular-based force fields and ensembles. GOMC simulates different Monte Carlo ensembles such as the canonical, grand canonical, and Gibbs ensembles. This work describes many challenges in developing a GPU Monte Carlo code for such ensembles and how I addressed these challenges. This work also describes efficient many-core and multicore large-scale energy calculations for Monte Carlo Gibbs ensemble using cell lists. Designing Monte Carlo molecular simulations is challenging as they have less computation and parallelism when compared to similar molecular dynamics applications. The modified cell list allows for more speedup gains for energy calculations on both many-core and multicore architectures when compared to other implementations without using the conventional cell lists. The work presents results and analysis of the cell list algorithms for each one of the parallel architectures using top of the line GPUs, CPUs, and Intel’s Phi coprocessors. In addition, the work evaluates the performance of the cell list algorithms for different problem sizes and different radial cutoffs. In addition, this work evaluates two cell list approaches, a hybrid MPI+OpenMP approach and a hybrid MPI+CUDA approach. The cell list methods are evaluated on a small cluster of multicore CPUs, Intel Phi coprocessors, and GPUs. The performance results are evaluated using different combinations of MPI processes, threads, and problem sizes. Another application presented in this dissertation involves the understanding of the properties of crystalline materials, and their design and control. Recent developments include the introduction of new models to simulate system behavior and properties that are of large experimental and theoretical interest. One of those models is the Phase-Field Crystal (PFC) model. The PFC model has enabled researchers to simulate 2D and 3D crystal structures and study defects such as dislocations and grain boundaries. In this work, GPUs are used to accelerate various dynamic properties of polycrystals in the 2D PFC model. Some properties require very intensive computation that may involve hundreds of thousands of atoms. The GPU implementation has achieved significant speedups of more than 46 times for some large systems simulations

    Simulating Nonlinear Neutrino Oscillations on Next-Generation Many-Core Architectures

    Get PDF
    In this work an astrophysical simulation code, XFLAT, is developed to study neutrino oscillations in supernovae. XFLAT is a hybrid modular code which was designed to utilize multiple levels of parallelism through MPI, OpenMP, and SIMD instructions (vectorization). It can run on both the CPU and the Xeon Phi co-processor, the latter of which is based on the Intel Many Integrated Core Architecture (MIC). The performance of XFLAT on various system configurations and physics scenarios has been analyzed. In addition, the impact of I/O and the multi-node configuration on the Xeon Phi-equipped heterogeneous supercomputers such as Stampede at the Texas Advanced Computing Center (TACC) was investigated
    corecore