3,592 research outputs found

    The HPCG benchmark: analysis, shared memory preliminary improvements and evaluation on an Arm-based platform

    Get PDF
    The High-Performance Conjugate Gradient (HPCG) benchmark complements the LINPACK benchmark in the performance evaluation coverage of large High-Performance Computing (HPC) systems. Due to its lower arithmetic intensity and higher memory pressure, HPCG is recognized as a more representative benchmark for data-center and irregular memory access pattern workloads, therefore its popularity and acceptance is raising within the HPC community. As only a small fraction of the reference version of the HPCG benchmark is parallelized with shared memory techniques (OpenMP), we introduce in this report two OpenMP parallelization methods. Due to the increasing importance of Arm architecture in the HPC scenario, we evaluate our HPCG code at scale on a state-of-the-art HPC system based on Cavium ThunderX2 SoC. We consider our work as a contribution to the Arm ecosystem: along with this technical report, we plan in fact to release our code for boosting the tuning of the HPCG benchmark within the Arm community.Postprint (author's final draft

    A Note on Solving Problem 7 of the SIAM 100-Digit Challenge Using C-XSC

    Get PDF
    C-XSC is a powerful C++ class library which simplifies the development of selfverifying numerical software. But C-XSC is not only a development tool, it also provides a lot of predefined highly accurate routines to compute reliable bounds for the solution to standard numerical problems. In this note we discuss the usage of a reliable linear system solver to compute the solution of problem 7 of the SIAM 100-digit challenge. To get the result we have to solve a 20 000 × 20 000 system of linear equations using interval computations. To perform this task we run our software on the advanced Linux cluster engine ALiCEnext located at the University of Wuppertal and on the high performance computer HP XC6000 at the computing center of the University of Karlsruhe. The main purpose of this note is to demonstrate the power/weakness of our approach to solve linear interval systems with a large dense system matrix using C-XSC and to get feedback from other research groups all over the world concerned with the topic described. We are very much interested to see comparisons concerning different methods/algorithms, timings, memory consumptions, and different hardware/software environments. It should be easy to adapt our main routine (see Section 3 below) to other programming languages, and different computing environments. Changing just one variable allows the generation of arbitrary large system matrices making it easy to do sound (reproducible and comparable) timings and to check for the largest possible system size that can be handled successfully by a specific package/environment

    A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices

    Get PDF
    We present the submatrix method, a highly parallelizable method for the approximate calculation of inverse p-th roots of large sparse symmetric matrices which are required in different scientific applications. We follow the idea of Approximate Computing, allowing imprecision in the final result in order to be able to utilize the sparsity of the input matrix and to allow massively parallel execution. For an n x n matrix, the proposed algorithm allows to distribute the calculations over n nodes with only little communication overhead. The approximate result matrix exhibits the same sparsity pattern as the input matrix, allowing for efficient reuse of allocated data structures. We evaluate the algorithm with respect to the error that it introduces into calculated results, as well as its performance and scalability. We demonstrate that the error is relatively limited for well-conditioned matrices and that results are still valuable for error-resilient applications like preconditioning even for ill-conditioned matrices. We discuss the execution time and scaling of the algorithm on a theoretical level and present a distributed implementation of the algorithm using MPI and OpenMP. We demonstrate the scalability of this implementation by running it on a high-performance compute cluster comprised of 1024 CPU cores, showing a speedup of 665x compared to single-threaded execution

    A fast GPU Monte Carlo Radiative Heat Transfer Implementation for Coupling with Direct Numerical Simulation

    Full text link
    We implemented a fast Reciprocal Monte Carlo algorithm, to accurately solve radiative heat transfer in turbulent flows of non-grey participating media that can be coupled to fully resolved turbulent flows, namely to Direct Numerical Simulation (DNS). The spectrally varying absorption coefficient is treated in a narrow-band fashion with a correlated-k distribution. The implementation is verified with analytical solutions and validated with results from literature and line-by-line Monte Carlo computations. The method is implemented on GPU with a thorough attention to memory transfer and computational efficiency. The bottlenecks that dominate the computational expenses are addressed and several techniques are proposed to optimize the GPU execution. By implementing the proposed algorithmic accelerations, a speed-up of up to 3 orders of magnitude can be achieved, while maintaining the same accuracy
    • …
    corecore