133 research outputs found

    Performance analysis of a hardware accelerator of dependence management for taskbased dataflow programming models

    Get PDF
    Along with the popularity of multicore and manycore, task-based dataflow programming models obtain great attention for being able to extract high parallelism from applications without exposing the complexity to programmers. One of these pioneers is the OpenMP Superscalar (OmpSs). By implementing dynamic task dependence analysis, dataflow scheduling and out-of-order execution in runtime, OmpSs achieves high performance using coarse and medium granularity tasks. In theory, for the same application, the more parallel tasks can be exposed, the higher possible speedup can be achieved. Yet this factor is limited by task granularity, up to a point where the runtime overhead outweighs the performance increase and slows down the application. To overcome this handicap, Picos was proposed to support task-based dataflow programming models like OmpSs as a fast hardware accelerator for fine-grained task and dependence management, and a simulator was developed to perform design space exploration. This paper presents the very first functional hardware prototype inspired by Picos. An embedded system based on a Zynq 7000 All-Programmable SoC is developed to study its capabilities and possible bottlenecks. Initial scalability and hardware consumption studies of different Picos designs are performed to find the one with the highest performance and lowest hardware cost. A further thorough performance study is employed on both the prototype with the most balanced configuration and the OmpSs software-only alternative. Results show that our OmpSs runtime hardware support significantly outperforms the software-only implementation currently available in the runtime system for finegrained tasks.This work is supported by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project, by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Research Council RoMoL Grant Agreement number 321253. We also thank the Xilinx University Program for its hardware and software donations.Peer ReviewedPostprint (published version

    A hardware runtime for task-based programming models

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Task-based programming models such as OpenMP 5.0 and OmpSs are simple to use and powerful enough to exploit task parallelism of applications over multicore, manycore and heterogeneous systems. However, their software-only runtimes introduce relevant overhead when targeting fine-grained tasks, resulting in performance losses. To overcome this drawback, we present a hardware runtime Picos++ that accelerates critical runtime functions such as task dependence analysis, nested task support, and heterogeneous task scheduling. As a proof-of-concept, the Picos++ hardware runtime has been integrated with a compiler infrastructure that supports parallel task-based programming models. A FPGA SoC running Linux OS has been used to implement the hardware accelerated part of Picos++, integrated with a heterogeneous system composed of 4 symmetric multiprocessor (SMP) cores and several hardware functional accelerators (HwAccs) for task execution. Results show significant improvements on energy and performance compared to state-of-the-art parallel software-only runtimes. With Picos++, applications can achieve up to 7.6x speedup and save up to 90 percent of energy, when using 4 threads and up to 4 HwAccs, and even reach a speedup of 16x over the software alternative when using 12 HwAccs and small tasks.Peer ReviewedPostprint (author's final draft

    Exploring Processor and Memory Architectures for Multimedia

    Get PDF
    Multimedia has become one of the cornerstones of our 21st century society and, when combined with mobility, has enabled a tremendous evolution of our society. However, joining these two concepts introduces many technical challenges. These range from having sufficient performance for handling multimedia content to having the battery stamina for acceptable mobile usage. When taking a projection of where we are heading, we see these issues becoming ever more challenging by increased mobility as well as advancements in multimedia content, such as introduction of stereoscopic 3D and augmented reality. The increased performance needs for handling multimedia come not only from an ongoing step-up in resolution going from QVGA (320x240) to Full HD (1920x1080) a 27x increase in less than half a decade. On top of this, there is also codec evolution (MPEG-2 to H.264 AVC) that adds to the computational load increase. To meet these performance challenges there has been processing and memory architecture advances (SIMD, out-of-order superscalarity, multicore processing and heterogeneous multilevel memories) in the mobile domain, in conjunction with ever increasing operating frequencies (200MHz to 2GHz) and on-chip memory sizes (128KB to 2-3MB). At the same time there is an increase in requirements for mobility, placing higher demands on battery-powered systems despite the steady increase in battery capacity (500 to 2000mAh). This leaves negative net result in-terms of battery capacity versus performance advances. In order to make optimal use of these architectural advances and to meet the power limitations in mobile systems, there is a need for taking an overall approach on how to best utilize these systems. The right trade-off between performance and power is crucial. On top of these constraints, the flexibility aspects of the system need to be addressed. All this makes it very important to reach the right architectural balance in the system. The first goal for this thesis is to examine multimedia applications and propose a flexible solution that can meet the architectural requirements in a mobile system. Secondly, propose an automated methodology of optimally mapping multimedia data and instructions to a heterogeneous multilevel memory subsystem. The proposed methodology uses constraint programming for solving a multidimensional optimization problem. Results from this work indicate that using today’s most advanced mobile processor technology together with a multi-level heterogeneous on-chip memory subsystem can meet the performance requirements for handling multimedia. By utilizing the automated optimal memory mapping method presented in this thesis lower total power consumption can be achieved, whilst performance for multimedia applications is improved, by employing enhanced memory management. This is achieved through reduced external accesses and better reuse of memory objects. This automatic method shows high accuracy, up to 90%, for predicting multimedia memory accesses for a given architecture

    Hardware runtime management for task-based programming models

    Get PDF
    Task-based programming models allow programmers to express applications as a collection of tasks with dependences. They are simple to use and greatly improve programmability by using software runtimes to exploit task parallelism and heterogeneity over multi-core, many-core and heterogeneous platforms. In these programming models, the runtimes guarantee correct execution order by managing tasks using task-dependence graphs (TDGs). These runtimes are powerful enough to provide high performance with coarse-grained tasks although they impose overheads on the application execution to maintain all the information they need to do their work. However, as the current trend in processor architectures keeps including more cores and heterogeneity (in fact complexity) in the systems, coarse-grained parallelism is not enough to feed all the underlying resources. Instead, fine-grained tasks are preferable as they are able to expose higher parallelism in applications but the overheads introduced by the software runtimes under these conditions prevent an efficient exploitation of fine-grained parallelism. The two most critical runtime overheads are task dependence graph management and task scheduling to heterogeneous systems. We propose a hardware architecture Picos, consisting of a hardware task dependence manager including nested task support, and a heterogeneous task scheduler, to accelerate the critical runtime functions for task-based programming models. With Picos, we aim at extending the benefit of these programming models into exploiting fine-grained task parallelism and heterogeneity. As a proof-of-concept, Three prototypes of Picos have been designed in VHDL and implemented in a System-on-chip platform consisting of regular ARM SMP cores and an integrated FPGA. They also have been analyzed with real benchmarks with OmpSs running and Linux on the platform. The first prototype is a hardware task dependence manager, which has been implemented in a Xilinx Zynq 7000 series SoCs. It is connected to a 2-core ARM Cortex A9 processor, with bare-metal OS integration. With 24 simulated workers, and running real task-dependence analysis in Picos, it scales up to 21x speedup. The second prototype Picos++ extended Picos with an exciting new feature for nested task support in hardware. To the best of our knowledge, this is the first time that such a feature has been support fully in hardware task dependence managers. This prototype is fully integrated in not only hardware, but also with a State-of-the-Art parallel programming model, and with Linux. The third prototype includes both a hardware task dependence manager and a heterogeneous task scheduler. The heterogeneous task scheduler receives ready tasks from the task-dependence manager and then schedule them to hardware execution units that have the estimated earliest finish time. It is implemented in a Xilinx Zynq Ultrascale+ MPSoC chip. In a system with 4 threads and up to 15 HW accelerators, it achieves up to 16.2x speedup for real benchmarks, and saves up to 90% of energy.Los modelos de programación basados en tareas permiten a los programadores expresar las aplicaciones como una colección de tareas con dependencias entre ellas. Dichos modelos son simples de usar y mejoran enormemente la programabilidad. Para ello se valen del uso de una runtime que en tiempo de ejecución ayuda a explotar el paralelismo de las tareas cuando se ejecutan en plataformas multi-cores, many-cores y heterogéneas. En estos modelos de programación los runtimes garantizan la ejecución de las tareas en el orden correcto mediante el uso de gráficos de dependencias entre tareas (TDG). Actualmente, los runtimes son lo suficientemente potentes para proporcionar un alto rendimiento con tareas de granularidad gruesa a pesar de que para mantener toda la información que necesitan para hacer su trabajo, introducen un sobrecoste importante en la ejecución de las aplicaciones. El problema viene dado por la tendencia actual en arquitectura de computadores a seguir incluyendo más núcleos y heterogeneidad (de hecho, complejidad) en los sistemas de procesado con lo que el paralelismo de granularidad gruesa no es suficiente para alimentar todos los recursos. En estos entornos complejos las tareas de granularidad fina son preferibles ya que son capaces de exponer un mayor paralelismo de las aplicaciones. Sin embargo, con tareas de granularidad fina, los sobrecostes introducidos por los runtimes software son mayores debido a la necesidad de manejar muchas más tareas más rápido. En general los mayores sobrecostes introducidos por los runtimes son: la administración de los grafos de dependencias que relacionan las tareas y la gestión de las tareas en sistemas heterogéneos. Proponemos una arquitectura hardware, llamada Picos, que consiste en un administrador de dependencias entre tareas incluyendo soporte para tareas anidadas y planificación de tareas heterogéneas. La función principal de dicha arquitectura es acelerar las funciones críticas de los runtimes para modelos de programación basados en tareas. Con Picos, se pretende extender el beneficio de estos modelos de programación para explotar el paralelismo y la heterogeneidad ejecutando tareas de granularidad fina. Como prueba de concepto, tres prototipos de Picos han sido diseñado en VHDL e implementado en una plataforma System-on-chip que consta de varios núcleos ARM integrados junto con una FPGA, y ademas analizados con ejecuciones reales con OmpSs y con Linux. El primer prototipo es un administrador hardware de tareas con dependencias, que se ha implementado en un SoC Xilinx Zynq serie 7000. Está conectado a un procesador ARM Cortex A9 de 2 núcleos, e integrado con el SO. Con 24 núcleos simulados y realizando el análisis real de las dependencias entre tareas en Picos, obtiene hasta un 21x sobre las mismas ejecuciones usando el entorno software. El segundo prototipo, Picos++, amplió Picos incorporando el soporte para la gestión de tareas anidadas en hardware. Hasta donde llega nuestro conocimiento, esta es la primera vez que dicha característica ha sido propuesta y/o incorporada en un administrador hardware de dependencias entre tareas. El segundo prototipo está completamente integrado en el sistema, no solo en hardware, sino también con el modelo de programación paralelo y con el sistema operativo. El tercer prototipo, incluye un administrador y planificador de tareas heterogéneas. El planificador de tareas heterogéneas recibe dichas tareas listas del administrador de dependencias entre tareas y las programa en la unidad de ejecución de hardware adecuada que tenga el tiempo de finalización estimado más corto. Este prototipo se ha implementado en un chip MPSoC Xilinx Zynq Ultrascale+. En dicho sistema con cuatro núcleos ARM y hasta 15 aceleradores HW funcionales, logra una aceleración de hasta 16.2x, y ahorra hasta el 90% de la energía con respecto al software

    Hardware runtime management for task-based programming models

    Get PDF
    Task-based programming models allow programmers to express applications as a collection of tasks with dependences. They are simple to use and greatly improve programmability by using software runtimes to exploit task parallelism and heterogeneity over multi-core, many-core and heterogeneous platforms. In these programming models, the runtimes guarantee correct execution order by managing tasks using task-dependence graphs (TDGs). These runtimes are powerful enough to provide high performance with coarse-grained tasks although they impose overheads on the application execution to maintain all the information they need to do their work. However, as the current trend in processor architectures keeps including more cores and heterogeneity (in fact complexity) in the systems, coarse-grained parallelism is not enough to feed all the underlying resources. Instead, fine-grained tasks are preferable as they are able to expose higher parallelism in applications but the overheads introduced by the software runtimes under these conditions prevent an efficient exploitation of fine-grained parallelism. The two most critical runtime overheads are task dependence graph management and task scheduling to heterogeneous systems. We propose a hardware architecture Picos, consisting of a hardware task dependence manager including nested task support, and a heterogeneous task scheduler, to accelerate the critical runtime functions for task-based programming models. With Picos, we aim at extending the benefit of these programming models into exploiting fine-grained task parallelism and heterogeneity. As a proof-of-concept, Three prototypes of Picos have been designed in VHDL and implemented in a System-on-chip platform consisting of regular ARM SMP cores and an integrated FPGA. They also have been analyzed with real benchmarks with OmpSs running and Linux on the platform. The first prototype is a hardware task dependence manager, which has been implemented in a Xilinx Zynq 7000 series SoCs. It is connected to a 2-core ARM Cortex A9 processor, with bare-metal OS integration. With 24 simulated workers, and running real task-dependence analysis in Picos, it scales up to 21x speedup. The second prototype Picos++ extended Picos with an exciting new feature for nested task support in hardware. To the best of our knowledge, this is the first time that such a feature has been support fully in hardware task dependence managers. This prototype is fully integrated in not only hardware, but also with a State-of-the-Art parallel programming model, and with Linux. The third prototype includes both a hardware task dependence manager and a heterogeneous task scheduler. The heterogeneous task scheduler receives ready tasks from the task-dependence manager and then schedule them to hardware execution units that have the estimated earliest finish time. It is implemented in a Xilinx Zynq Ultrascale+ MPSoC chip. In a system with 4 threads and up to 15 HW accelerators, it achieves up to 16.2x speedup for real benchmarks, and saves up to 90% of energy.Los modelos de programación basados en tareas permiten a los programadores expresar las aplicaciones como una colección de tareas con dependencias entre ellas. Dichos modelos son simples de usar y mejoran enormemente la programabilidad. Para ello se valen del uso de una runtime que en tiempo de ejecución ayuda a explotar el paralelismo de las tareas cuando se ejecutan en plataformas multi-cores, many-cores y heterogéneas. En estos modelos de programación los runtimes garantizan la ejecución de las tareas en el orden correcto mediante el uso de gráficos de dependencias entre tareas (TDG). Actualmente, los runtimes son lo suficientemente potentes para proporcionar un alto rendimiento con tareas de granularidad gruesa a pesar de que para mantener toda la información que necesitan para hacer su trabajo, introducen un sobrecoste importante en la ejecución de las aplicaciones. El problema viene dado por la tendencia actual en arquitectura de computadores a seguir incluyendo más núcleos y heterogeneidad (de hecho, complejidad) en los sistemas de procesado con lo que el paralelismo de granularidad gruesa no es suficiente para alimentar todos los recursos. En estos entornos complejos las tareas de granularidad fina son preferibles ya que son capaces de exponer un mayor paralelismo de las aplicaciones. Sin embargo, con tareas de granularidad fina, los sobrecostes introducidos por los runtimes software son mayores debido a la necesidad de manejar muchas más tareas más rápido. En general los mayores sobrecostes introducidos por los runtimes son: la administración de los grafos de dependencias que relacionan las tareas y la gestión de las tareas en sistemas heterogéneos. Proponemos una arquitectura hardware, llamada Picos, que consiste en un administrador de dependencias entre tareas incluyendo soporte para tareas anidadas y planificación de tareas heterogéneas. La función principal de dicha arquitectura es acelerar las funciones críticas de los runtimes para modelos de programación basados en tareas. Con Picos, se pretende extender el beneficio de estos modelos de programación para explotar el paralelismo y la heterogeneidad ejecutando tareas de granularidad fina. Como prueba de concepto, tres prototipos de Picos han sido diseñado en VHDL e implementado en una plataforma System-on-chip que consta de varios núcleos ARM integrados junto con una FPGA, y ademas analizados con ejecuciones reales con OmpSs y con Linux. El primer prototipo es un administrador hardware de tareas con dependencias, que se ha implementado en un SoC Xilinx Zynq serie 7000. Está conectado a un procesador ARM Cortex A9 de 2 núcleos, e integrado con el SO. Con 24 núcleos simulados y realizando el análisis real de las dependencias entre tareas en Picos, obtiene hasta un 21x sobre las mismas ejecuciones usando el entorno software. El segundo prototipo, Picos++, amplió Picos incorporando el soporte para la gestión de tareas anidadas en hardware. Hasta donde llega nuestro conocimiento, esta es la primera vez que dicha característica ha sido propuesta y/o incorporada en un administrador hardware de dependencias entre tareas. El segundo prototipo está completamente integrado en el sistema, no solo en hardware, sino también con el modelo de programación paralelo y con el sistema operativo. El tercer prototipo, incluye un administrador y planificador de tareas heterogéneas. El planificador de tareas heterogéneas recibe dichas tareas listas del administrador de dependencias entre tareas y las programa en la unidad de ejecución de hardware adecuada que tenga el tiempo de finalización estimado más corto. Este prototipo se ha implementado en un chip MPSoC Xilinx Zynq Ultrascale+. En dicho sistema con cuatro núcleos ARM y hasta 15 aceleradores HW funcionales, logra una aceleración de hasta 16.2x, y ahorra hasta el 90% de la energía con respecto al software.Postprint (published version

    Efficient design space exploration of embedded microprocessors

    Get PDF

    Combining FPGA prototyping and high-level simulation approaches for Design Space Exploration of MPSoCs

    Get PDF
    Modern embedded systems are parallel, component-based, heterogeneous and finely tuned on the basis of the workload that must be executed on them. To improve design reuse, Application Specific Instruction-set Processors (ASIPs) are often employed as building blocks in such systems, as a solution capable of satisfying the required functional and physical constraints (e.g. throughput, latency, power or energy consumption etc.), while providing, at the same time, high flexibility and adaptability. Composing a multi-processor architecture including ASIPs and mapping parallel applications onto it is a design activity that require an extensive Design Space Exploration process (DSE), to result in cost-effective systems. The work described here aims at defining novel methodologies for the application-driven customizations of such highly heterogeneous embedded systems. The issue is tackled at different levels, integrating different tools. High-level event-based simulation is a widely used technique that offers speed and flexibility as main points of strength, but needs, as a preliminary input and periodically during the iteration process, calibration data that must be acquired by means of more accurate evaluation methods. Typically, this calibration is performed using instruction-level cycleaccurate simulators that, however, turn out to be very slow, especially when complete multiprocessor systems must be evaluated or when the grain of the calibration is too fine, while FPGA approaches have shown to performbetter for this particular applications. FPGA-based emulation techniques have been proposed in the recent past as an alternative solution to the software-based simulation approach, but some further steps are needed before they can be effectively exploitedwithin architectural design space exploration. Firstly, some kind of technology-awareness must be introduced, to enable the translation of the emulation results into a pre-estimation of a prospective ASIC implementation of the design. Moreover, when performing architectural DSE, a significant number of different candidate design points has to be evaluated and compared. In this case, if no countermeasures are taken, the advantages achievable with FPGAs, in terms of emulation speed, are counterbalanced by the overhead introduced by the time needed to go through the physical synthesis and implementation flow. Developed FPGA-based prototyping platform overcomes such limitations, enabling the use of FPGA-based prototyping for micro-architectural design space exploration of ASIP processors. In this approach, to increase the emulation speed-up, two different methods are proposed: the first is based on automatic instantiation of additional hardware modules, able to reconfigure at runtime the prototype, while the second leverages manipulation of application binary code, compiled for a custom VLIW ASIP architecture, that is transformed into code executable on a different configuration. This allows to prototype a whole set of ASIP solutions after one single FPGA implementation flow, mitigating the afore-mentioned overhead.A short overview on the tools used throughout the work will also be offered, covering basic aspects of Intel-Silicon Hive ASIP development toolchain, SESAME framework general description, along with a review of state-of-art simulation and prototyping techniques for complex multi-processor systems. Each proposed approach will be validated through a real-world use case, confirming the validity of this solution
    corecore