166 research outputs found

    Enhancing productivity and performance portability of opencl applications on heterogeneous systems using runtime optimizations

    Get PDF
    Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable Gate Arrays or other specialized processors have very different architectures. This puts an enormous strain on programming models and software developers to take full advantage of the computing power at hand. Because of this diversity and the unachievable flexibility and portability necessary to optimize for each target individually, heterogeneous systems remain typically vastly under-utilized. In this thesis, we explore two distinct ways to tackle this problem. Providing automated, non intrusive methods in the form of compiler tools and implementing efficient abstractions to automatically tune parameters for a restricted domain are two complementary approaches investigated to better utilize compute resources in heterogeneous systems. First, we explore a fully automated compiler based approach, where a runtime system analyzes the computation flow of an OpenCL application and optimizes it across multiple compute kernels. This method can be deployed on any existing application transparently and replaces significant software engineering effort spent to tune application for a particular system. We show that this technique achieves speedups of up to 3x over unoptimized code and an average of 1.4x over manually optimized code for highly dynamic applications. Second, a library based approach is designed to provide a high level abstraction for complex problems in a specific domain, stencil computation. Using domain specific techniques, the underlying framework optimizes the code aggressively. We show that even in a restricted domain, automatic tuning mechanisms and robust architectural abstraction are necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs and 3.6x on four

    Optimización del rendimiento y la eficiencia energética en sistemas masivamente paralelos

    Get PDF
    RESUMEN Los sistemas heterogéneos son cada vez más relevantes, debido a sus capacidades de rendimiento y eficiencia energética, estando presentes en todo tipo de plataformas de cómputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programación host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energético del sistema, además de dificultar la adaptación de las aplicaciones. La co-ejecución permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energía. No obstante, los programadores deben encargarse de toda la gestión de los dispositivos, la distribución de la carga y la portabilidad del código entre sistemas, complicando notablemente su programación. Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energética en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracción y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energética. Para ello, se proponen dos motores de ejecución con enfoques completamente distintos. EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la máxima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de dinámica molecular, como el utilizado en un centro de investigación internacional. Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecución a la tecnología oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotación de estrategias dinámicas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications. Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming. This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed. EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center. Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant), the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and PID2019-105660RB-C22. This work has also been partially supported by the Mont-Blanc 3: European Scalable and Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No. 671697) from the European Union’s Horizon 2020 Research and Innovation Programme (H2020 Programme). Some activities have also been funded by the Spanish Science and Technology Commission under contract TIN2016-81840-REDT (CAPAP-H6 network). The Integration II: Hybrid programming models of Chapter 4 has been partially performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme. In particular, the author gratefully acknowledges the support of the SPMT Department of the High Performance Computing Center Stuttgart (HLRS)

    Towards Performance Portable Programming for Distributed Heterogeneous Systems

    Full text link
    Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware in the future. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. This work introduces a runtime framework that enables effortless programming for heterogeneous systems while efficiently utilizing hardware resources. The framework is integrated within a distributed and scalable runtime system to facilitate performance portability across heterogeneous nodes. Along with the design, this paper describes the implementation and optimizations performed, achieving up to 300% improvement in a shared memory benchmark and up to 10 times in distributed device communication. Preliminary results indicate that our software incurs low overhead and achieves 40% improvement in a distributed Jacobi proxy application while hiding the idiosyncrasies of the hardware

    FLASH 1.0: A Software Framework for Rapid Parallel Deployment and Enhancing Host Code Portability in Heterogeneous Computing

    Full text link
    In this paper, we present FLASH 1.0, a C++-based software framework for rapid parallel deployment and enhancing host code portability in heterogeneous computing. FLASH takes a novel approach in describing kernels and dynamically dispatching them in a hardware-agnostic manner. FLASH features truly hardware-agnostic frontend interfaces, which not only unify the compile-time control flow but also enforces a portability-optimized code organization that imposes a demarcation between computational (performance-critical) and functional (non-performance-critical) codes as well as the separation of hardware-specific and hardware-agnostic codes in the host application. We use static code analysis to measure the hardware independence ratio of popular HPC applications and show that up to 99.72% code portability can be achieved with FLASH. Similarly, we measure the complexity of state-of-the-art portable programming models and show that a code reduction of up to 2.2x can be achieved for two common HPC kernels while maintaining 100% code portability with a normalized framework overhead between 1% - 13% of the total kernel runtime. The codes are available at https://github.com/PSCLab-ASU/FLASH.Comment: 12 page

    OpenMP to CUDA graphs: a compiler-based transformation to enhance the programmability of NVIDIA devices

    Get PDF
    Heterogeneous computing is increasingly being used in a diversity of computing systems, ranging from HPC to the real-time embedded domain, to cope with the performance requirements. Due to the variety of accelerators, e.g., FPGAs, GPUs, the use of high-level parallel programming models is desirable to exploit the performance capabilities of them, while maintaining an adequate productivity level. In that regard, OpenMP is a well-known high-level programming model that incorporates powerful task and accelerator models capable of efficiently exploiting structured and unstructured parallelism in heterogeneous computing. This paper presents a novel compiler transformation technique that automatically transforms OpenMP code into CUDA graphs, combining the benefits of programmability of a high-level programming model such as OpenMP, with the performance benefits of a low-level programming model such as CUDA. Evaluations have been performed on two NVIDIA GPUs from the HPC and embedded domains, i.e., the V100 and the Jetson AGX respectively.This work has been supported by the EU H2020 project AMPERE under the grant agreement no. 871669.Peer ReviewedPostprint (author's final draft

    PoCL-R: An Open Standard Based Offloading Layer for Heterogeneous Multi-Access Edge Computing with Server Side Scalability

    Full text link
    We propose a novel computing runtime that exposes remote compute devices via the cross-vendor open heterogeneous computing standard OpenCL and can execute compute tasks on the MEC cluster side across multiple servers in a scalable manner. Intermittent UE connection loss is handled gracefully even if the device's IP address changes on the way. Network-induced latency is minimized by transferring data and signaling command completions between remote devices in a peer-to-peer fashion directly to the target server with a streamlined TCP-based protocol that yields a command latency of only 60 microseconds on top of network round-trip latency in synthetic benchmarks. The runtime can utilize RDMA to speed up inter-server data transfers by an additional 60% compared to the TCP-based solution. The benefits of the proposed runtime in MEC applications are demonstrated with a smartphone-based augmented reality rendering case study. Measurements show up to 19x improvements to frame rate and 17x improvements to local energy consumption when using the proposed runtime to offload AR rendering from a smartphone. Scalability to multiple GPU servers in real-world applications is shown in a computational fluid dynamics simulation, which scales with the number of servers at roughly 80% efficiency which is comparable to an MPI port of the same simulation.Comment: 13 pages, 17 figure

    X-MAP A Performance Prediction Tool for Porting Algorithms and Applications to Accelerators

    Get PDF
    Most modern high-performance computing systems comprise of one or more accelerators with varying architectures in addition to traditional multicore Central Processing Units (CPUs). Examples of these accelerators include Graphic Processing Units (GPU) and Intel’s Many Integrated Cores architecture called Xeon Phi (PHI). These architectures provide massive parallel computation capabilities, which provide substantial performance benefits over traditional CPUs for a variety of scientific applications. We know that all accelerators are not similar because each of them has their own unique architecture. This difference in the underlying architecture plays a crucial role in determining if a given accelerator will provide a significant speedup over its competition. In addition to the architecture itself, one more differentiating factor for these accelerators is the programming language used to program them. For example, Nvidia GPUs can be programmed using Compute Unified Device Architecture (CUDA) and OpenCL while Intel Xeon PHIs can be programmed using OpenMP and OpenCL. The choice of programming language also plays a critical role in the speedup obtained depending on how close the language is to the hardware in addition to the level of optimization. With that said, it is thus very difficult for an application developer to choose the ideal accelerator to achieve the best possible speedup. In light of this, we present an easy to use Graphical User Interface (GUI) Tool called X-MAP which is a performance prediction tool for porting algorithms and applications to architectures which encompasses a Machine Learning based inference model to predict the performance of an applica-tion on a number of well-known accelerators and at the same time predict the best architecture and programming language for the application. We do this by collecting hardware counters from a given application and predicting run time by providing this data as inputs to a Neural Network Regressor based inference model. We predict the architecture and associated programming language by pro viding the hardware counters as inputs to an inference model based on Random Forest Classification Model. Finally, with a mean absolute prediction error of 8.52 and features such as syntax high-lighting for multiple programming languages, a function-wise breakdown of the entire application to understand bottlenecks and the ability for end users to submit their own prediction models to further improve the system, makes X-MAP a unique tool that has a significant edge over existing performance prediction solutions

    Software/Hardware Co-Design to Improve Productivity, Portability, and Performance of Loop-Task Parallel Applications

    Full text link
    Computer architects are increasingly turning to programmable accelerators tailored for narrower classes of applications in order to achieve high performance and energy efficiency. A continuing challenge with accelerators is enabling the programmer to easily extract maximum performance without intimate knowledge of the underlying microarchitecture. It is important to consider productivity and portability, in addition to performance, as first-class metrics when developing and evaluating modern computing platforms. Software-centric approaches to achieving 3P computing platforms are compelling, but sacrifice efficiency and flexibility by hiding parallel abstractions from hardware and limiting the scope of the application domain. This thesis proposes a new software/hardware co-design approach to achieving 3P platforms, called the loop-task accelerator (LTA) platform, that provides high productivity and portability without sacrificing performance or efficiency across a wide range of applications. The LTA platform addresses the weaknesses of existing approaches that are identified through detailed experimentation with and analysis of modern application development. Discussion of an early attempt at a hardware-centric approach to achieving 3P platforms provides insight into area-efficient accelerator designs and highlights the need for innovations in both software and hardware. The LTA platform focuses on exploiting loop-task parallelism by exposing loop-tasks as a common parallel abstraction at the programming API, runtime, ISA, and microarchitectural levels. The LTA programming API uses the parallel_for construct to express loop-tasks that can be exploited both across cores and within a core, the LTA runtime distributes loop-tasks across cores, and a new xpfor instruction explicitly encodes loop-tasks as functions applied to a range of loop iterations. This thesis introduces a novel task-coupling taxonomy that captures how tasks can be coupled in both space and time. The LTA engine template can be configured at design time with variable spatial and temporal task coupling to accelerate the execution of both regular and irregular loop-tasks within a core. The LTA platform is evaluated with respect to the 3P’s using a vertically integrated research methodology. Compared to an in-order multi-core baseline, the LTA platform yields average improvements of 5.5× in raw performance, 2.5× in performance per area, and 1.2× in energy efficiency, while offering high productivity and portability
    corecore