3 research outputs found

    Mixed-data-model heterogeneous compilation and OpenMP offloading

    Get PDF
    Heterogeneous computers combine a general-purpose host processor with domain-specific programmable many-core accelerators, uniting high versatility with high performance and energy efficiency. While the host manages ever-more application memory, accelerators are designed to work mainly on their local memory. This difference in addressed memory leads to a discrepancy between the optimal address width of the host and the accelerator. Today 64-bit host processors are commonplace, but few accelerators exceed 32-bit addressable local memory, a difference expected to increase with 128-bit hosts in the exascale era. Managing this discrepancy requires support for multiple data models in heterogeneous compilers. So far, compiler support for multiple data models has not been explored, which hampers the programmability of such systems and inhibits their adoption. In this work, we perform the first exploration of the feasibility and performance of implementing a mixed-data-mode heterogeneous system. To support this, we present and evaluate the first mixed-data-model compiler, supporting arbitrary address widths on host and accelerator. To hide the inherent complexity and to enable high programmer productivity, we implement transparent offloading on top of OpenMP. The proposed compiler techniques are implemented in LLVM and evaluated on a 64+32-bit heterogeneous SoC. Results on benchmarks from the PolyBench-ACC suite show that memory can be transparently shared between host and accelerator at overheads below 0.7 % compared to 32-bit-only execution, enabling mixed-data-model computers to execute at near-native performance

    Evaluation of GPU-specific device directives and multi-dimensional data structures in OpenMP

    Get PDF
    OpenMP target offload has been in the inception phase for some time but has been gaining traction in the recent years with more compilers supporting the constructs and optimising it at the general level. Its ease of programming compared to other models makes it quite desirable in the industry. This work investigates how different compilers are interacting with the different constructs at runtime and how the callbacks affect the performance of each compiler. We also dive into the programs in the Polybench benchmark test suite with the main focus on linear algebra to generate an OpenMP GPU target offload implementation with parallelization techniques obtained from DiscoPoP for the C Language. The main focus is on the DOALL and reduction loops which come under loop level parallelism. The problem encountered are issues with the device data mapping. We also analyze the compiler used and see its behaviour in code generation for OpenMP target offload. While converting these benchmarks, we faced a myriad of issues related to the data mapping of multi-dimensional data structures onto the target from the host while using the GCC compiler. The main work done to counter this issue was to suggest a code transformation algorithmic approach which efficiently resolves the issues without loosing the correctness of the said programs
    corecore