1,840 research outputs found

    An Efficient OpenMP Runtime System for Hierarchical Arch

    Get PDF
    Exploiting the full computational power of always deeper hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture. The emergence of multi-core chips and NUMA machines makes it important to minimize the number of remote memory accesses, to favor cache affinities, and to guarantee fast completion of synchronization steps. By using the BubbleSched platform as a threading backend for the GOMP OpenMP compiler, we are able to easily transpose affinities of thread teams into scheduling hints using abstractions called bubbles. We then propose a scheduling strategy suited to nested OpenMP parallelism. The resulting preliminary performance evaluations show an important improvement of the speedup on a typical NAS OpenMP benchmark application

    RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors

    Get PDF
    Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors. This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance

    Complex pipelined executions in OpenMP parallel applications

    Get PDF
    This paper proposes a set of extensions to the OpenMP programming model to express complex pipelined computations. This is accomplished by defining, in the form of directives, precedence relations among the tasks originated from work-sharing constructs. The proposal is based on the definition of a name space that identifies the work parceled out by these work-sharing constructs. Then the programmer defines the precedence relations using this name space. This relieves the programmer from the burden of defining complex synchronization data structures and the insertion of explicit synchronization actions in the program that make the program difficult to understand and maintain. This work is transparently done by the compiler with the support of the OpenMP runtime library. The proposal is motivated and evaluated with a synthetic multi-block example. The paper also includes a description of the compiler and runtime support in the framework of the NanosCompiler for OpenMP.This research has been supported by the Ministry of Education of Spain under contract TIC98-511 and the CEPBA (European Center for Parallelism of Barcelona).Peer ReviewedPostprint (author's final draft

    The HPCG benchmark: analysis, shared memory preliminary improvements and evaluation on an Arm-based platform

    Get PDF
    The High-Performance Conjugate Gradient (HPCG) benchmark complements the LINPACK benchmark in the performance evaluation coverage of large High-Performance Computing (HPC) systems. Due to its lower arithmetic intensity and higher memory pressure, HPCG is recognized as a more representative benchmark for data-center and irregular memory access pattern workloads, therefore its popularity and acceptance is raising within the HPC community. As only a small fraction of the reference version of the HPCG benchmark is parallelized with shared memory techniques (OpenMP), we introduce in this report two OpenMP parallelization methods. Due to the increasing importance of Arm architecture in the HPC scenario, we evaluate our HPCG code at scale on a state-of-the-art HPC system based on Cavium ThunderX2 SoC. We consider our work as a contribution to the Arm ecosystem: along with this technical report, we plan in fact to release our code for boosting the tuning of the HPCG benchmark within the Arm community.Postprint (author's final draft
    • …
    corecore