10 research outputs found

    Enabling Locality-Aware Computations in OpenMP

    Get PDF

    An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines

    Get PDF
    International audienceNowadays shared memory HPC platforms expose a large number of cores organized in a hierarchical way. Parallel application programmers strug- gle to express more and more fine-grain parallelism and to ensure locality on such NUMA platforms. Independent loops stand as a natural source of paral- lelism. Parallel environments like OpenMP provide ways of parallelizing them efficiently, but the achieved performance is closely related to the choice of pa- rameters like the granularity of work or the loop scheduler. Considering that both can depend on the target computer, the input data and the loop workload, the application programmer most of the time fails at designing both portable and ef- ficient implementations. We propose in this paper a new OpenMP loop scheduler, called adaptive, that dynamically adapts the granularity of work considering the underlying system state. Our scheduler is able to perform dynamic load balancing while taking memory affinity into account on NUMA architectures. Results show that adaptive outperforms state-of-the-art OpenMP loop schedulers on memory- bound irregular applications, while obtaining performance comparable to static on parallel loops with a regular workload

    Automatic, Abstracted and Portable Topology-Aware Thread Placement

    Get PDF
    International audienceEfficiently programming shared-memory machines is a difficult challenge because mapping application threads onto the memory hierarchy has a strong impact on the performance. However, optimizing such thread placement is difficult: architectures become increasingly complex and application behavior changes with implementations and input parameters, e.g problem size and number of threads. In this work, we propose a fully automatic, abstracted and portable affinity module. It produces and implements an optimized affinity strategy that combines knowledge about application characteristics and the platform topology. Implemented in the back-end of our runtime system (ORWL), our approach was used to enhance the performance and the scalability of several unmodified ORWL-coded applications: matrix multiplication, a 2D stencil (Livermore Kernel 23), and a video tracking real world application. On two SMP machines with quite different hardware characteristics, our tests show spectacular performance improvements for these unmodified application codes due to a dramatic decrease of cache misses and pipeline stalls. A comparison to reference implementations using OpenMP confirms this performance gain of almost one order of magnitude

    Optimizing Data Locality for Fork/Join Programs Using Constrained Work Stealing

    Full text link

    Center for Programming Models for Scalable Parallel Computing - Towards Enhancing OpenMP for Manycore and Heterogeneous Nodes

    Get PDF
    OpenMP was not well recognized at the beginning of the project, around year 2003, because of its limited use in DoE production applications and the inmature hardware support for an efficient implementation. Yet in the recent years, it has been graduately adopted both in HPC applications, mostly in the form of MPI+OpenMP hybrid code, and in mid-scale desktop applications for scientific and experimental studies. We have observed this trend and worked deligiently to improve our OpenMP compiler and runtimes, as well as to work with the OpenMP standard organization to make sure OpenMP are evolved in the direction close to DoE missions. In the Center for Programming Models for Scalable Parallel Computing project, the HPCTools team at the University of Houston (UH), directed by Dr. Barbara Chapman, has been working with project partners, external collaborators and hardware vendors to increase the scalability and applicability of OpenMP for multi-core (and future manycore) platforms and for distributed memory systems by exploring different programming models, language extensions, compiler optimizations, as well as runtime library support

    Locality Awareness for Task Parallel Computation

    Get PDF
    The task parallel programming model allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the run time system. Efficient scheduling of tasks on multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and non-uniform memory access (NUMA) characteristics. In this dissertation, we study the performance impact of these issues and other performance factors that limit parallel speedup in task parallel program executions and propose new scheduling strategies to improve performance. Our performance model characterizes lost efficiency in terms of overhead time, idle time, and work time inflation due to increased data access costs. We introduce a hierarchical run time scheduler that combines the benefits of work stealing and parallel depth-first schedulers. Matching the scheduler design to the memory hierarchy of multicore NUMA systems limits costly remote data accesses while maintaining load balance and exploiting constructive data sharing among threads that share a cache. We also propose a locality- based scheduling framework based on locality domains and comprising an API for programmers to specify application locality and a scheduler that honors those specifications. Implementations of the hierarchical and locality-based schedulers in our OpenMP run time system exhibit performance improvements on several task parallel benchmark applications over existing scheduling strategies and production OpenMP run time systems.Doctor of Philosoph

    Enabling Locality-Aware Computations in OpenMP

    No full text
    Locality of computation is key to obtaining high performance on a broad variety of parallel architectures and applications. It is moreover an essential component of strategies for energy-efficient computing. OpenMP is a widely available industry standard for shared memory programming. With the pervasive deployment of multi-core computers and the steady growth in core count, a productive programming model such as OpenMP is increasingly expected to play an important role in adapting applications to this new hardware. However, OpenMP does not provide the programmer with explicit means to program for locality. Rather it presents the user with a “flat” memory model. In this paper, we discuss the need for explicit programmer control of locality within the context of OpenMP and present some ideas on how this might be accomplished. We describe potential extensions to OpenMP that would enable the user to manage a program's data layout and to align tasks and data in order to minimize the cost of data accesses. We give examples showing the intended use of the proposed features, describe our current implementation and present some experimental results. Our hope is that this work will lead to efforts that would help OpenMP to be a major player on emerging, multi- and many-core architectures
    corecore