25 research outputs found

    Optimization of lattice Boltzmann simulations on heterogeneous computers

    Get PDF
    High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach, in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting overall performances. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper, we consider exactly this problem for a class of applications based on lattice Boltzmann methods, widely used in computational fluid dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit the different parallel and vector options of the various accelerators efficiently, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using, as testbeds, HPC clusters incorporating different accelerators: Intel Xeon Phi many-core processors, NVIDIA GPUs, and AMD GPUs

    Optimisation of computational fluid dynamics applications on multicore and manycore architectures

    Get PDF
    This thesis presents a number of optimisations used for mapping the underlying computational patterns of finite volume CFD applications onto the architectural features of modern multicore and manycore processors. Their effectiveness and impact is demonstrated in a block-structured and an unstructured code of representative size to industrial applications and across a variety of processor architectures that make up contemporary high-performance computing systems. The importance of vectorization and the ways through which this can be achieved is demonstrated in both structured and unstructured solvers together with the impact that the underlying data layout can have on performance. The utility of auto-tuning for ensuring performance portability across multiple architectures is demonstrated and used for selecting optimal parameters such as prefetch distances for software prefetching or tile sizes for strip mining/loop tiling. On the manycore architectures, running more than one thread per physical core is found to be crucial for good performance on processors with in-order core designs but not required on out-of-order architectures. For architectures with high-bandwidth memory packages, their exploitation, whether explicitly or implicitly, is shown to be imperative for best performance. The implementation of all of these optimisations led to application speed-ups ranging between 2.7X and 3X on the multicore CPUs and 5.7X to 24X on the manycore processors.Open Acces

    Algorithmische und Code-Optimierungen Molekulardynamiksimulationen fĂĽr Verfahrenstechnik

    Get PDF
    The focus of this work lies on implementational improvements and, in particular, node-level performance optimization of the simulation software ls1-mardyn. Through data structure improvements, SIMD vectorization and, especially, OpenMP parallelization, the world’s first simulation of 2*1013 molecules at over 1 PFLOP/sec was enabled. To allow for long-range interactions, the Fast Multipole Method was introduced to ls1-mardyn. The algorithm was optimized for sequential, shared-memory, and distributed-memory execution on up to 32,768 MPI processes.Der Fokus dieser Arbeit liegt auf Code-Optimierungen und insbesondere Leistungsoptimierung auf Knoten-Ebene für die Simulationssoftware ls1-mardyn. Durch verbesserte Datenstrukturen, SIMD-Vektorisierung und vor allem OpenMP-Parallelisierung wurde die weltweit erste Petaflop-Simulation von 2*1013 Molekülen ermöglicht. Zur Simulation von langreichweitigen Wechselwirkungen wurde die Fast-Multipole-Methode in ls1-mardyn eingeführt. Sequenzielle, Shared- und Distributed-Memory-Optimierungen wurden angewandt und erlaubten eine Ausführung auf bis zu 32768 MPI-Prozessen

    Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS

    Get PDF
    The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2Ă—\times on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively, 3.5Ă—3.5\times on the linear solver TeaLeaf, and 1.7Ă—1.7\times on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information
    corecore