353 research outputs found

    Design space exploration in heterogeneous platforms using OpenMP

    Get PDF
    In the fields of high performance computing (HPC) and embedded systems, the current trend is to employ heterogeneous platforms which integrate general purpose CPUs with specialized accelerators such as GPUs and FPGAs. Programming these architectures to approach their theoretical performance limits is a complex issue. In this article, we present a design methodology targeting heterogeneous platforms which combines a novel dynamic offloading mechanism for OpenMP and a scheduling strategy for assigning tasks to accelerator devices. The current OpenMP offloading model depends on the compiler supporting each target device, with many architectures still unsupported by the most popular compilers, such as GCC and Clang. In our approach, the software and/or hardware design flows for programming the accelerators are dissociated from the host OpenMP compiler and the device-specific implementations are dynamically loaded at runtime. Moreover, the assignment of tasks to computing resources is dynamically evaluated at runtime, with the aim of maximizing performance when using the available resources. The proposed methodology has been applied to a video processing system as a test case. The results demonstrate the flexibility of the proposal by exploiting different heterogeneous platforms and design particularities of devices, leading to a significant performance improvement.This work has been funded by FEDER/Ministerio de Ciencia, Innovación y Universidades – Agencia Estatal de Investigacion/TEC2017-86722-C4-3-R, also under the FitOptiVis Project (ECSEL2017-1-737451), which is funded by the EU (H2020) and Ministerio de Ciencia, Innovación y Universidades

    FOTV: a generic device offloading framework for OpenMP

    Get PDF
    Since the introduction of the “target” directive in the 4.0 specification, the usage of OpenMP for heterogeneous computing programming has increased significantly. However, the compiler support limits its usage because the code for the accelerated region has to be generated in compile time. This restricts the usage of accelerator-specific design flows (e.g. FPGA hardware synthesis) and the support of new devices that typically requires extending and modifying the compiler itself. This paper explores a solution to these limitations: a generic device that is supported by the OpenMP compiler but whose functionality is defined at runtime. The generic device framework has been integrated in an OpenMP compiler (LLVM/Clang). It acts as a device type for the compiler and interfaces with the physical devices to execute the accelerated code. The framework has an API that provides support for new devices and accelerated code without additional OpenMP compiler modifications. It also includes a code generator that extracts the source code of OpenMP target regions for external compilation chains. In order to evaluate the approach, we present a new device implementation that allows executing OpenCL code as an OpenMP target region. We study the overhead that the framework produces and show that it is minimal and comparable to other OpenMP devices.This work was done as part of the FitOptiVis project, funded by the ECSEL Joint Undertaking, grant H2020-ECSEL-2017–2-783162, and the Spanish MICINN, grant PCI2018–093057. It was partially funded by the Platino project, funded by the MICINN, grant TEC2017–86722-C4–3-R

    Experiences in porting mini-applications to OpenACC and OpenMP on heterogeneous systems

    Get PDF
    This article studies mini-applications—Minisweep, GenASiS, GPP, and FF—that use computational methods commonly encountered in HPC. We have ported these applications to develop OpenACC and OpenMP versions, and evaluated their performance on Titan (Cray XK7 with K20x GPUs), Cori (Cray XC40 with Intel KNL), Summit (IBM AC922 with Volta GPUs), and Cori-GPU (Cray CS-Storm 500NX with Intel Skylake and Volta GPUs). Our goals are for these new ports to be useful to both application and compiler developers, to document and describe the lessons learned and the methodology to create optimized OpenMP and OpenACC versions, and to provide a description of possible migration paths between the two specifications. Cases where specific directives or code patterns result in improved performance for a given architecture are highlighted. We also include discussions of the functionality and maturity of the latest compilers available on the above platforms with respect to OpenACC or OpenMP implementations

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis

    OpenMP to CUDA graphs: a compiler-based transformation to enhance the programmability of NVIDIA devices

    Get PDF
    Heterogeneous computing is increasingly being used in a diversity of computing systems, ranging from HPC to the real-time embedded domain, to cope with the performance requirements. Due to the variety of accelerators, e.g., FPGAs, GPUs, the use of high-level parallel programming models is desirable to exploit the performance capabilities of them, while maintaining an adequate productivity level. In that regard, OpenMP is a well-known high-level programming model that incorporates powerful task and accelerator models capable of efficiently exploiting structured and unstructured parallelism in heterogeneous computing. This paper presents a novel compiler transformation technique that automatically transforms OpenMP code into CUDA graphs, combining the benefits of programmability of a high-level programming model such as OpenMP, with the performance benefits of a low-level programming model such as CUDA. Evaluations have been performed on two NVIDIA GPUs from the HPC and embedded domains, i.e., the V100 and the Jetson AGX respectively.This work has been supported by the EU H2020 project AMPERE under the grant agreement no. 871669.Peer ReviewedPostprint (author's final draft

    Many-task computing on many-core architectures

    Get PDF
    Many-Task Computing (MTC) is a common scenario for multiple parallel systems, such as cluster, grids, cloud and supercomputers, but it is not so popular in shared memory parallel processors. In this sense and given the spectacular growth in performance and in number of cores integrated in many-core architectures, the study of MTC on such architectures is becoming more and more relevant. In this paper, authors present what are those programming mechanisms to take advantages of such massively parallel features for the particular target of MTC. Also, the hardware features of the two dominant many-core platforms (NVIDIA's GPUs and Intel Xeon Phi) are also analyzed for our specific framework. Given the important differences in terms of hardware and software in our two many-core platforms, we have considered different strategies based on CUDA (for GPUs) and OpenMP (for Intel Xeon Phi). We carried out several test cases based on an appropriate and widely studied problem for benchmarking as matrix multiplication. Essentially, this study consisted of comparing the time consumed for computing in parallel several tasks one by one (the whole computational resources are used just to compute one task at a time) with the time consumed for computing in parallel the same set of tasks simultaneously (the whole computational resources are used for computing the set of tasks at very same time). Finally, we compared both software-hardware scenarios to identify the most relevant computer features in each of our many-core architectures
    corecore