15 research outputs found

    Multi-Node Advanced Performance and Power Analysis with Paraver

    Get PDF
    Performance analysis tools allow application developers to identify and characterize the inefficiencies that cause performance degradation in their codes. Due to the increasing interest in the High Performance Computing (HPC) community towards energy-efficiency issues, it is of paramount importance to be able to correlate performance and power figures within the same profiling and analysis tools. For this reason, we present a preliminary performance and energy-efficiency study aimed at demonstrating how a single tool can be used to collect most of the relevant metrics. Moreover we show how the same analysis techniques are applicable on different architectures, analyzing the same HPC application running on two clusters, based respectively on Intel Haswell and Arm Cortex-A57 CPUs.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects, grant agreements n. 288777, 610402 and 671697. E.C. was partially founded by “Contributo 5 per mille assegnato all’Universit`a degli Studi di Ferrara - dichiarazione dei redditi dell’anno 2014”.Peer ReviewedPostprint (author's final draft

    Optimization of lattice Boltzmann simulations on heterogeneous computers

    Get PDF
    High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach, in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting overall performances. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper, we consider exactly this problem for a class of applications based on lattice Boltzmann methods, widely used in computational fluid dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit the different parallel and vector options of the various accelerators efficiently, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using, as testbeds, HPC clusters incorporating different accelerators: Intel Xeon Phi many-core processors, NVIDIA GPUs, and AMD GPUs

    Massively parallel lattice–Boltzmann codes on large GPU clusters

    Get PDF
    This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics

    exploiting parallelism in many core architectures lattice boltzmann models as a test case

    Get PDF
    In this paper we address the problem of identifying and exploiting techniques that optimize the performance of large scale scientific codes on many-core processors. We consider as a test-bed a state-of-the-art Lattice Boltzmann (LB) model, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equations of state of a perfect gas. The regular structure of Lattice Boltzmann algorithms makes it relatively easy to identify a large degree of available parallelism; the challenge is that of mapping this parallelism onto processors whose architecture is becoming more and more complex, both in terms of an increasing number of independent cores and – within each core – of vector instructions on longer and longer data words. We take as an example the Intel Sandy Bridge micro-architecture, that supports AVX instructions operating on 256-bit vectors; we address the problem of efficiently implementing the key computational kernels of LB codes – streaming and collision – on this family of processors; we introduce several successive optimization steps and quantitatively assess the impact of each of them on performance. Our final result is a production-ready code already in use for large scale simulations of the Rayleigh-Taylor instability. We analyze both raw performance and scaling figures, and compare with GPU-based implementations of similar codes

    Evolution of a double-front Rayleigh-Taylor system using a GPU-based high resolution thermal Lattice-Boltzmann model

    Full text link
    We study the turbulent evolution originated from a system subjected to a Rayleigh-Taylor instability with a double density at high resolution in a 2 dimensional geometry using a highly optimized thermal Lattice Boltzmann code for GPUs. The novelty of our investigation stems from the initial condition, given by the superposition of three layers with three different densities, leading to the development of two Rayleigh-Taylor fronts that expand upward and downward and collide in the middle of the cell. By using high resolution numerical data we highlight the effects induced by the collision of the two turbulent fronts in the long time asymptotic regime. We also provide details on the optimized Lattice-Boltzmann code that we have run on a cluster of GPU
    corecore