3 research outputs found

    Embracing a new era of highly efficient and productive quantum Monte Carlo simulations

    Full text link
    QMCPACK has enabled cutting-edge materials research on supercomputers for over a decade. It scales nearly ideally but has low single-node efficiency due to the physics-based abstractions using array-of-structures objects, causing inefficient vectorization. We present a systematic approach to transform QMCPACK to better exploit the new hardware features of modern CPUs in portable and maintainable ways. We develop miniapps for fast prototyping and optimizations. We implement new containers in structure-of-arrays data layout to facilitate vectorizations by the compilers. Further speedup and smaller memory-footprints are obtained by computing data on the fly with the vectorized routines and expanding single-precision use. All these are seamlessly incorporated in production QMCPACK. We demonstrate upto 4.5x speedups on recent Intel processors and IBM Blue Gene/Q for representative workloads. Energy consumption is reduced significantly commensurate to the speedup factor. Memory-footprints are reduced by up-to 3.8x, opening the possibility to solve much larger problems of future.Comment: 12 pages, 10 figures, 2 tables, to be published at SC1

    Honing and proofing Astrophysical codes on the road to Exascale. Experiences from code modernization on many-core systems

    Full text link
    The complexity of modern and upcoming computing architectures poses severe challenges for code developers and application specialists, and forces them to expose the highest possible degree of parallelism, in order to make the best use of the available hardware. The Intel(R)^{(R)} Xeon Phi(TM)^{(TM)} of second generation (code-named Knights Landing, henceforth KNL) is the latest many-core system, which implements several interesting hardware features like for example a large number of cores per node (up to 72), the 512 bits-wide vector registers and the high-bandwidth memory. The unique features of KNL make this platform a powerful testbed for modern HPC applications. The performance of codes on KNL is therefore a useful proxy of their readiness for future architectures. In this work we describe the lessons learnt during the optimisation of the widely used codes for computational astrophysics P-Gadget-3, Flash and Echo. Moreover, we present results for the visualisation and analysis tools VisIt and yt. These examples show that modern architectures benefit from code optimisation at different levels, even more than traditional multi-core systems. However, the level of modernisation of typical community codes still needs improvements, for them to fully utilise resources of novel architectures.Comment: 16 pages, 10 figures, 4 tables. To be published in Future Generation of Computer Systems (FGCS), Special Issue on "On The Road to Exascale II: Advances in High Performance Computing and Simulations

    Memory-Efficient Object-Oriented Programming on GPUs

    Full text link
    Object-oriented programming is often regarded as too inefficient for high-performance computing (HPC), despite the fact that many important HPC problems have an inherent object structure. Our goal is to bring efficient, object-oriented programming to massively parallel SIMD architectures, especially GPUs. In this thesis, we develop various techniques for optimizing object-oriented GPU code. Most notably, we identify the object-oriented Single-Method Multiple-Objects (SMMO) programming model. We first develop an embedded C++ Structure of Arrays (SOA) data layout DSL for SMMO applications. We then design a lock-free, dynamic memory allocator that stores allocations in SOA layout. Finally, we show how to further optimize the memory access of SMMO applications with memory defragmentation.Comment: Ph.D. Thesi
    corecore