202 research outputs found

    A ROSE-Based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries

    Full text link
    Abstract. OpenMP is a popular and evolving programming model for shared-memory platforms. It relies on compilers to target modern hard-ware architectures for optimal performance. A variety of extensible and robust research compilers are key to OpenMP’s sustainable success in the future. In this paper, we present our efforts to build an OpenMP 3.0 research compiler for C, C++, and Fortran using the ROSE source-to-source compiler framework. Our goal is to support OpenMP research for ourselves and others. We have extended ROSE’s internal representation to handle all OpenMP 3.0 constructs, thus facilitating experimenting with them. Since OpenMP research is often complicated by the tight coupling of the compiler translation and the runtime system, we present a set of rules to define a common OpenMP runtime library (XOMP) on top of multiple runtime libraries. These rules additionally define how to build a set of translations targeting XOMP. Our work demonstrates how to reuse OpenMP translations across different runtime libraries. This work simplifies OpenMP research by decoupling the problematic depen-dence between the compiler translations and the runtime libraries. We present an evaluation of our work by demonstrating an analysis tool for OpenMP correctness. We also show how XOMP can be defined using both GOMP and Omni. Our comparative performance results against other OpenMP compilers demonstrate that our flexible runtime support does not incur additional overhead.

    XcalableMP PGAS Programming Language

    Get PDF
    XcalableMP is a directive-based parallel programming language based on Fortran and C, supporting a Partitioned Global Address Space (PGAS) model for distributed memory parallel systems. This open access book presents XcalableMP language from its programming model and basic concept to the experience and performance of applications described in XcalableMP.  XcalableMP was taken as a parallel programming language project in the FLAGSHIP 2020 project, which was to develop the Japanese flagship supercomputer, Fugaku, for improving the productivity of parallel programing. XcalableMP is now available on Fugaku and its performance is enhanced by the Fugaku interconnect, Tofu-D. The global-view programming model of XcalableMP, inherited from High-Performance Fortran (HPF), provides an easy and useful solution to parallelize data-parallel programs with directives for distributed global array and work distribution and shadow communication. The local-view programming adopts coarray notation from Coarray Fortran (CAF) to describe explicit communication in a PGAS model. The language specification was designed and proposed by the XcalableMP Specification Working Group organized in the PC Consortium, Japan. The Omni XcalableMP compiler is a production-level reference implementation of XcalableMP compiler for C and Fortran 2008, developed by RIKEN CCS and the University of Tsukuba. The performance of the XcalableMP program was used in the Fugaku as well as the K computer. A performance study showed that XcalableMP enables a scalable performance comparable to the message passing interface (MPI) version with a clean and easy-to-understand programming style requiring little effort

    Compiler Optimization Techniques for OpenMP Programs

    Get PDF

    Scheduling Dynamic OpenMP Applications over Multicore Architectures

    Get PDF
    International audienceApproaching the theoretical performance of hierarchical multicore machines requires a very careful distribution of threads and data among the underlying non-uniform architecture in order to minimize cache misses and NUMA penalties. While it is acknowledged that OpenMP can enhance the quality of thread scheduling on such architectures in a portable way, by transmitting precious information about the affinities between threads and data to the underlying runtime system, most OpenMP runtime systems are actually unable to efficiently support highly irregular, massively parallel applications on NUMA machines. In this paper, we present a thread scheduling policy suited to the execution of OpenMP programs featuring irregular and massive nested parallelism over hierarchical architectures. Our policy enforces a distribution of threads that maximizes the proximity of threads belonging to the same parallel section, and uses a NUMA-aware work stealing strategy when load balancing is needed. It has been developed as a plug-in to the ForestGOMP OpenMP platform. We demonstrate the efficiency of our approach with a highly irregular recursive OpenMP program resulting from the generic parallelization of a surface reconstruction application. We achieve a speedup of 14 on a 16-core machine with no application-level optimization

    On Modern Offloading Parallelization Methods: A Critical Analysis of OpenMP

    Get PDF
    The very concept of offloading computationally complex routines to a graphics processing unit for general-purpose computing is a problem left wide open to the academic community, both in terms of application as well as implementation, with several different and popular interfaces exploding into popularity within the last twenty years. The OpenMP standard is among the elites in this category, standing as a parallelization interface that has stood the test of time. The goals that the inquiry presented herein seeks to answer are twofold: Firstly, we aim to assess the performance of common sorting algorithms parallelized and offloaded using OpenMP, offloaded to NVIDIA GPU hardware, and secondly, to critically analyze the programmer experience in using an implementation of the OpenMP standard (again, with offloading to NVIDIA GPU hardware) to implement these algorithms. For completeness, the empirical analysis contains a comparison to the unparallelized algorithms. From this data and the impression of the programming experience, strengths and weaknesses of usage of OpenMP for parallelizing and offloading sorting algorithms are derived. After discussing each benchmark in depth, as well as the data derived from the parallelized implementations of each, we found that OpenMP’s position as one of the forefront parallel programming standards is well-justified, with few, but notable, pitfalls for the average programmer. In terms of its performance in parallelizing common sorting algorithms with offloading to NVIDIA GPU hardware, it was found that OpenMP fails to deliver viable implementations of the algorithms that are advantageous over their single-threaded counterparts, though, this was found not to be the fault of OpenMP, but rather, of the inherent nature of offloading to NVIDIA GPU hardware

    Barcelona OpenMP tasks suite: a set of benchmarks targeting the exploitation of task parallelism in OpenMP

    Get PDF
    Traditional parallel applications have exploited regular parallelism, based on parallel loops. Only a few applications exploit sections parallelism. With the release of the new OpenMP specification (3.0), this programming model supports tasking. Parallel tasks allow the exploitation of irregular parallelism, but there is a lack of benchmarks exploiting tasks in OpenMP. With the current (and projected) multicore architectures that offer many more alternatives to execute parallel applications than traditional SMP machines, this kind of parallelism is increasingly important. And so, the need to have some set of benchmarks to evaluate it. In this paper, we motivate the need of having such a benchmarks suite, for irregular and/or recursive task parallelism. We present our proposal, the Barcelona OpenMP Tasks Suite (BOTS), with a set of applications exploiting regular and irregular parallelism, based on tasks. We present an overall evaluation of the BOTS benchmarks in an Altix system and we discuss some of the different experiments that can be done with the different compilation and runtime alternatives of the benchmarks.Peer ReviewedPostprint (published version

    Towards dynamic threading support for OpenMP

    Get PDF

    The AXIOM software layers

    Get PDF
    AXIOM project aims at developing a heterogeneous computing board (SMP-FPGA).The Software Layers developed at the AXIOM project are explained.OmpSs provides an easy way to execute heterogeneous codes in multiple cores. People and objects will soon share the same digital network for information exchange in a world named as the age of the cyber-physical systems. The general expectation is that people and systems will interact in real-time. This poses pressure onto systems design to support increasing demands on computational power, while keeping a low power envelop. Additionally, modular scaling and easy programmability are also important to ensure these systems to become widespread. The whole set of expectations impose scientific and technological challenges that need to be properly addressed.The AXIOM project (Agile, eXtensible, fast I/O Module) will research new hardware/software architectures for cyber-physical systems to meet such expectations. The technical approach aims at solving fundamental problems to enable easy programmability of heterogeneous multi-core multi-board systems. AXIOM proposes the use of the task-based OmpSs programming model, leveraging low-level communication interfaces provided by the hardware. Modular scalability will be possible thanks to a fast interconnect embedded into each module. To this aim, an innovative ARM and FPGA-based board will be designed, with enhanced capabilities for interfacing with the physical world. Its effectiveness will be demonstrated with key scenarios such as Smart Video-Surveillance and Smart Living/Home (domotics).Peer ReviewedPostprint (author's final draft

    Language-Centric Performance Analysis of OpenMP Programs with Aftermath

    Get PDF
    International audienceWe present a new set of tools for the language-centric performance analysis and debugging of OpenMP programs that allows programmers to relate dynamic information from parallel execution to OpenMP constructs. Users can visualize execution traces, examine aggregate met-rics on parallel loops and tasks, such as load imbalance or synchronization overhead, and obtain detailed information on specific events, such as the partitioning of a loop's iteration space, its distribution to workers according to the scheduling policy and fine-grain synchronization. Our work is based on the Aftermath performance analysis tool and a ready-to-use, instrumented version of the LLVM/clang OpenMP run-time with negligible overhead for tracing. By analyzing the performance of the MG application of the NPB suite, we show that language-centric performance analysis in general and our tools in particular can help improve the performance of large-scale OpenMP applications significantly
    corecore