40 research outputs found

    Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking

    Get PDF
    In recent years, programmable many-core accelerators (PMCAs) have been introduced in embedded systems to satisfy stringent performance/Watt requirements. This has increased the urge for programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering high-level abstractions to outline abundant and irregular parallelism in embedded applications. However, efficiently supporting this programming paradigm on embedded PMCAs is challenging, due to the large time and space overheads it introduces. In this paper we describe a lightweight OpenMP tasking runtime environment (RTE) design for a state-of-the-art embedded PMCA, the Kalray MPPA 256. We provide an exhaustive characterization of the costs of our RTE, considering both synthetic workload and real programs, and we compare to several other tasking RTEs. Experimental results confirm that our solution achieves near-ideal parallelization speedups for tasks as small as 5K cycles, and an average speedup of 12 × for real benchmarks, which is 60% higher than what we observe with the original Kalray OpenMP implementation

    Design Space Exploration and Resource Management of Multi/Many-Core Systems

    Get PDF
    The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

    Energy Aware Runtime Systems for Elastic Stream Processing Platforms

    Get PDF
    Following an invariant growth in the required computational performance of processors, the multicore revolution started around 20 years ago. This revolution was mainly an answer to power dissipation constraints restricting the increase of clock frequency in single-core processors. The multicore revolution not only brought in the challenge of parallel programming, i.e. being able to develop software exploiting the entire capabilities of manycore architectures, but also the challenge of programming heterogeneous platforms. The question of “on which processing element to map a specific computational unit?”, is well known in the embedded community. With the introduction of general-purpose graphics processing units (GPGPUs), digital signal processors (DSPs) along with many-core processors on different system-on-chip platforms, heterogeneous parallel platforms are nowadays widespread over several domains, from consumer devices to media processing platforms for telecom operators. Finding mapping together with a suitable hardware architecture is a process called design-space exploration. This process is very challenging in heterogeneous many-core architectures, which promise to offer benefits in terms of energy efficiency. The main problem is the exponential explosion of space exploration. With the recent trend of increasing levels of heterogeneity in the chip, selecting the parameters to take into account when mapping software to hardware is still an open research topic in the embedded area. For example, the current Linux scheduler has poor performance when mapping tasks to computing elements available in hardware. The only metric considered is CPU workload, which as was shown in recent work does not match true performance demands from the applications. Doing so may produce an incorrect allocation of resources, resulting in a waste of energy. The origin of this research work comes from the observation that these approaches do not provide full support for the dynamic behavior of stream processing applications, especially if these behaviors are established only at runtime. This research will contribute to the general goal of developing energy-efficient solutions to design streaming applications on heterogeneous and parallel hardware platforms. Streaming applications are nowadays widely spread in the software domain. Their distinctive characiteristic is the retrieving of multiple streams of data and the need to process them in real time. The proposed work will develop new approaches to address the challenging problem of efficient runtime coordination of dynamic applications, focusing on energy and performance management.Efter en oföränderlig tillväxt i prestandakrav hos processorer, började den flerkärniga processor-revolutionen för ungefär 20 år sedan. Denna revolution skedde till största del som en lösning till begränsningar i energieffekten allt eftersom klockfrekvensen kontinuerligt höjdes i en-kärniga processorer. Den flerkärniga processor-revolutionen medförde inte enbart utmaningen gällande parallellprogrammering, m.a.o. förmågan att utveckla mjukvara som använder sig av alla delelement i de flerkärniga processorerna, men också utmaningen med programmering av heterogena plattformar. Frågeställningen ”på vilken processorelement skall en viss beräkning utföras?” är väl känt inom ramen för inbyggda datorsystem. Efter introduktionen av grafikprocessorer för allmänna beräkningar (GPGPU), signalprocesserings-processorer (DSP) samt flerkärniga processorer på olika system-on-chip plattformar, är heterogena parallella plattformar idag omfattande inom många domäner, från konsumtionsartiklar till mediaprocesseringsplattformar för telekommunikationsoperatörer. Processen att placera beräkningarna på en passande hårdvaruplattform kallas för utforskning av en designrymd (design-space exploration). Denna process är mycket utmanande för heterogena flerkärniga arkitekturer, och kan medföra fördelar när det gäller energieffektivitet. Det största problemet är att de olika valmöjligheterna i designrymden kan växa exponentiellt. Enligt den nuvarande trenden som förespår ökad heterogeniska aspekter i processorerna är utmaningen att hitta den mest passande placeringen av beräkningarna på hårdvaran ännu en forskningsfråga inom ramen för inbyggda datorsystem. Till exempel, den nuvarande schemaläggaren i Linux operativsystemet är inkapabel att hitta en effektiv placering av beräkningarna på den underliggande hårdvaran. Det enda mätsättet som används är processorns belastning vilket, som visats i tidigare forskning, inte motsvarar den verkliga prestandan i applikationen. Användning av detta mätsätt vid resursallokering resulterar i slöseri med energi. Denna forskning härstammar från observationerna att dessa tillvägagångssätt inte stöder det dynamiska beteendet hos ström-processeringsapplikationer (stream processing applications), speciellt om beteendena bara etableras vid körtid. Denna forskning kontribuerar till det allmänna målet att utveckla energieffektiva lösningar för ström-applikationer (streaming applications) på heterogena flerkärniga hårdvaruplattformar. Ström-applikationer är numera mycket vanliga i mjukvarudomän. Deras distinkta karaktär är inläsning av flertalet dataströmmar, och behov av att processera dem i realtid. Arbetet i denna forskning understöder utvecklingen av nya sätt för att lösa det utmanade problemet att effektivt koordinera dynamiska applikationer i realtid och fokus på energi- och prestandahantering

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis

    Exploiting Multi-Level Parallelism in Streaming Applications for Heterogeneous Platforms with GPUs

    Get PDF
    Heterogeneous computing platforms support the traditional types of parallelism, such as e.g., instruction-level, data, task, and pipeline parallelism, and provide the opportunity to exploit a combination of different types of parallelism at different platform levels. The architectural diversity of platform components makes tapping into the platform potential a challenging programming task. This thesis makes an important step in this direction by introducing a novel methodology for automatic generation of structured, multi-level parallel programs from sequential applications. We introduce a novel hierarchical intermediate program representation (HiPRDG) that captures the notions of structure and hierarchy in the polyhedral model used for compile-time program transformation and code generation. Using the HiPRDG as the starting point, we present a novel method for generation of multi-level programs (MLPs) featuring different types of parallelism, such as task, data, and pipeline parallelism. Moreover, we introduce concepts and techniques for data parallelism identification, GPU code generation, and asynchronous data-driven execution on heterogeneous platforms with efficient overlapping of host-accelerator communication and computation. By enabling the modular, hybrid parallelization of program model components via HiPRDG, this thesis opens the door for highly efficient tailor-made parallel program generation and auto-tuning for next generations of multi-level heterogeneous platforms with diverse accelerators.Computer Systems, Imagery and Medi

    GPU devices for safety-critical systems: a survey

    Get PDF
    Graphics Processing Unit (GPU) devices and their associated software programming languages and frameworks can deliver the computing performance required to facilitate the development of next-generation high-performance safety-critical systems such as autonomous driving systems. However, the integration of complex, parallel, and computationally demanding software functions with different safety-criticality levels on GPU devices with shared hardware resources contributes to several safety certification challenges. This survey categorizes and provides an overview of research contributions that address GPU devices’ random hardware failures, systematic failures, and independence of execution.This work has been partially supported by the European Research Council with Horizon 2020 (grant agreements No. 772773 and 871465), the Spanish Ministry of Science and Innovation under grant PID2019-107255GB, the HiPEAC Network of Excellence and the Basque Government under grant KK-2019-00035. The Spanish Ministry of Economy and Competitiveness has also partially supported Leonidas Kosmidis with a Juan de la Cierva Incorporación postdoctoral fellowship (FJCI-2020- 045931-I).Peer ReviewedPostprint (author's final draft

    Towards Computational Efficiency of Next Generation Multimedia Systems

    Get PDF
    To address throughput demands of complex applications (like Multimedia), a next-generation system designer needs to co-design and co-optimize the hardware and software layers. Hardware/software knobs must be tuned in synergy to increase the throughput efficiency. This thesis provides such algorithmic and architectural solutions, while considering the new technology challenges (power-cap and memory aging). The goal is to maximize the throughput efficiency, under timing- and hardware-constraints

    Automatically Parallelizing Embedded Legacy Software on Soft-Core SoCs

    Get PDF
    Nowadays, embedded systems are utilized in many areas and become omnipresent, making people's lives more comfortable. Embedded systems have to handle more and more functionality in many products. To maintain the often required low energy consumption, multi-core systems provide high performance at moderate energy consumption. The development started with dual-core processors and has today reached many-core designs with dozens and hundreds of processor cores. However, existing applications can barely leverage the potential of that many cores. Legacy applications are usually written sequentially and thus typically use only one processor core. Thus, these applications do not benefit from the advantages provided by modern many-core systems. Rewriting those applications to use multiple cores requires new skills from developers and it is also time-consuming and highly error prone. Dozens of languages, APIs and compilers have already been presented in the past decades to aid the user with parallelizing applications. Fully automatic parallelizing compilers are seen as the holy grail, since the user effort is kept minimal. However, automatic parallelizers often cannot extract parallelism as good as user aided approaches. Most of these parallelization tools are designed for desktop and high-performance systems and are thus not tuned or applicable for low performance embedded systems. To improve this situation, this work presents an automatic parallelizer for embedded systems, which is able to mostly deliver better quality than user aided approaches and if not allows easy manual fine-tuning. Parallelization tools extract concurrently executable tasks from an application. These tasks can then be executed on different processor cores. Parallelization tools and automatic parallelizers in particular often struggle to efficiently map the extracted parallelism to an existing multi-core processor. This work uses soft-core processors on FPGAs, which makes it possible to realize custom multi-core designs in hardware, within a few minutes. This allows to adapt the multi-core processor to the characteristics of the extracted parallelism. Especially, core-interconnects for communication can be optimized to fit the communication pattern of the parallel application. Embedded applications are often structured as follows: receive input data, (multiple) data processing steps, data output. The multiple processing steps are often realized as consecutive loosely coupled transformations. These steps naturally already model the structure of a processing pipeline. It is the goal of this work to extract this kind of pipeline-parallelism from an application and map it to multiple cores to increase the overall throughput of the system. Multiple cores forming a chain with direct communication channels ideally fit this pattern. The previously described, so called pipeline-parallelism is a barely addressed concept in most parallelization tools. Also, current multi-core designs often do not support the hardware flexibility provided by soft-cores, targeted in this approach. The main contribution of this work is an automatic parallelizer which is able to map different processing steps from the source-code of a sequential application to different cores in a multi-core pipeline. Users only specify the required processing speed after parallelization. The developed tool tries to extract a matching parallelized software design along with a custom multi-core design out of sequential embedded legacy applications. The automatically created multi-core system already contains used peripherals extracted from the source-code and is ready to be used. The presented parallelizer implements multi-objective optimization to generate a minimal hardware design, just fulfilling the user defined requirement. To the best of my knowledge, the possibility to generate such a multi-core pipeline defined by the demands of the parallelized software has never been presented before. The approach is implemented for two soft-core processors and evaluation shows for both targets high speedups of 12x and higher at a reasonable hardware overhead. Compared to other automatic parallelizers, which mainly focus on speedups through latency reduction, significantly higher speedups can be achieved depending on the given application structure

    Adding tightly-integrated task scheduling acceleration to a RISC-V multi-core processor

    Get PDF
    Task Parallelism is a parallel programming model that provides code annotation constructs to outline tasks and describe how their pointer parameters are accessed so that they might be executed in parallel, and asynchronously, by a runtime capable of inferring and honoring their data dependence relationships. It is supported by several parallelization frameworks, as OpenMP and StarSs. Overhead related to automatic dependence inference and to the scheduling of ready-to-run tasks is a major performance limiting factor of Task Parallel systems. To amortize this overhead, programmers usually trade the higher parallelism that could be leveraged from finer-grained work partitions for the higher runtime-efficiency of coarser-grained work partitions. Such problems are even more severe for systems with many cores, as the task spawning frequency required for preserving cores from starvation grows linearly with their number. To mitigate these problems, researchers have designed hardware accelerators to improve runtime performance. Nevertheless, the high CPU-accelerator communication overheads of these solutions hampered their gains. We thus propose a RISC-V based architecture that minimizes communication overhead between the HW Task Scheduler and the CPU by allowing Task Scheduling software to directly interact with the former through custom instructions. Empirical evaluation of the architecture is made possible by an FPGA prototype featuring an eight-core Linux-capable Rocket Chip implementing such instructions. To evaluate the prototype performance, we both (1) adapted Nanos, a mature Task Scheduling runtime, to benefit from the new task-scheduling-accelerating instructions; and (2) developed Phentos, a new HW-accelerated light weight Task Scheduling runtime. Our experiments show that task parallel programs using Nanos-RV --- the Nanos version ported to our system --- are on average 2.13 times faster than those being serviced by baseline Nanos, while programs running on Phentos are 13.19 times faster, considering geometric means. Using eight cores, Nanos-RV is able to deliver speedups with respect to serial execution of up to 5.62 times, while Phentos produces speedups of up to 5.72 times.This work was supported by the Spanish Government (projects SEV-2015-0493 and TIN2015-65316-P), the Generalitat de Catalunya (2017-SGR-1414 and 2017-SGR1328), FAPESP (grants 2017/02682-2, 2018/00687-0, and 2014/25694-8), CNPq (grant 408782/2016-1), and CAPES.Peer ReviewedPostprint (author's final draft

    Optimización del rendimiento y la eficiencia energética en sistemas masivamente paralelos

    Get PDF
    RESUMEN Los sistemas heterogéneos son cada vez más relevantes, debido a sus capacidades de rendimiento y eficiencia energética, estando presentes en todo tipo de plataformas de cómputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programación host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energético del sistema, además de dificultar la adaptación de las aplicaciones. La co-ejecución permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energía. No obstante, los programadores deben encargarse de toda la gestión de los dispositivos, la distribución de la carga y la portabilidad del código entre sistemas, complicando notablemente su programación. Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energética en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracción y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energética. Para ello, se proponen dos motores de ejecución con enfoques completamente distintos. EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la máxima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de dinámica molecular, como el utilizado en un centro de investigación internacional. Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecución a la tecnología oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotación de estrategias dinámicas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications. Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming. This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed. EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center. Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant), the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and PID2019-105660RB-C22. This work has also been partially supported by the Mont-Blanc 3: European Scalable and Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No. 671697) from the European Union’s Horizon 2020 Research and Innovation Programme (H2020 Programme). Some activities have also been funded by the Spanish Science and Technology Commission under contract TIN2016-81840-REDT (CAPAP-H6 network). The Integration II: Hybrid programming models of Chapter 4 has been partially performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme. In particular, the author gratefully acknowledges the support of the SPMT Department of the High Performance Computing Center Stuttgart (HLRS)
    corecore