7 research outputs found

    Asynchronous Runtime with Distributed Manager for Task-based Programming Models

    Full text link
    Parallel task-based programming models, like OpenMP, allow application developers to easily create a parallel version of their sequential codes. The standard OpenMP 4.0 introduced the possibility of describing a set of data dependences per task that the runtime uses to order the tasks execution. This order is calculated using shared graphs, which are updated by all threads in exclusive access using synchronization mechanisms (locks) to ensure the dependence management correctness. The contention in the access to these structures becomes critical in many-core systems because several threads may be wasting computation resources waiting their turn. This paper proposes an asynchronous management of the runtime structures, like task dependence graphs, suitable for task-based programming model runtimes. In such organization, the threads request actions to the runtime instead of doing them directly. The requests are then handled by a distributed runtime manager (DDAST) which does not require dedicated resources. Instead, the manager uses the idle threads to modify the runtime structures. The paper also presents an implementation, analysis and performance evaluation of such runtime organization. The performance results show that the proposed asynchronous organization outperforms the speedup obtained by the original runtime for different benchmarks and different many-core architectures.Comment: 2020 Parallel Computin

    Asynchronous runtime for task-based dataflow programming models

    Get PDF
    The importance of parallel programming is increasing year after year since the power wall popularized multi-core processors, and with them, shared memory parallel programming models. In particular, task-based programming models, like the standard OpenMP 4.0, have become more and more important. They allow describing a set of data dependences per task that the runtime uses to order the execution of tasks. This order is calculated using shared graphs, which are updated by all threads but in exclusive access using synchronization mechanisms (locks) to ensure the dependences correctness. Although exclusive accesses are necessary to avoid data race conditions, those may imply contention that limits the application parallelism. This becomes critical in many-core systems because several threads may be wasting computation resources waiting to access the runtime structures. This master thesis introduces the concept of an asynchronous runtime management suitable for task-based programming model runtimes. The runtime proposal is based on the asynchronous management of the runtime structures like task dependence graphs. Therefore, the application threads request actions to the runtime instead of directly executing the needed modifications. The requests are then handled by a runtime manager which can be implemented in different ways. This master thesis presents an extension to a previously implemented centralized runtime manager and presents a novel implementation of a distributed runtime manager. On one hand, the runtime design based on a centralized manager [1] is extended to dynamically adapt the runtime behavior according to the manager load with the objective of being as fast as possible. On the other hand, a novel runtime design based on a distributed manager implementation is proposed to overcome the limitations observed in the centralized design. The distributed runtime implementation allows any thread to become a runtime manager thread if it helps to exploit the application parallelism. That is achieved using a new runtime feature, also implemented in this master thesis, for runtime functionality dispatching through a callback system. The proposals are evaluated in different many-core architectures and their performance is compared against the baseline runtimes used to implement the asynchronous versions. Results show that the centralized manager extension can overcome the hard limitations of the initial basic implementation, that the distributed manager fixes the observed problems in previous implementation, and the proposed asynchronous organization significantly outperforms the speedup obtained by the original runtime for real benchmarks

    Asynchronous runtime for task-based dataflow programming models

    Get PDF
    The importance of parallel programming is increasing year after year since the power wall popularized multi-core processors, and with them, shared memory parallel programming models. In particular, task-based programming models, like the standard OpenMP 4.0, have become more and more important. They allow describing a set of data dependences per task that the runtime uses to order the execution of tasks. This order is calculated using shared graphs, which are updated by all threads but in exclusive access using synchronization mechanisms (locks) to ensure the dependences correctness. Although exclusive accesses are necessary to avoid data race conditions, those may imply contention that limits the application parallelism. This becomes critical in many-core systems because several threads may be wasting computation resources waiting to access the runtime structures. This master thesis introduces the concept of an asynchronous runtime management suitable for task-based programming model runtimes. The runtime proposal is based on the asynchronous management of the runtime structures like task dependence graphs. Therefore, the application threads request actions to the runtime instead of directly executing the needed modifications. The requests are then handled by a runtime manager which can be implemented in different ways. This master thesis presents an extension to a previously implemented centralized runtime manager and presents a novel implementation of a distributed runtime manager. On one hand, the runtime design based on a centralized manager [1] is extended to dynamically adapt the runtime behavior according to the manager load with the objective of being as fast as possible. On the other hand, a novel runtime design based on a distributed manager implementation is proposed to overcome the limitations observed in the centralized design. The distributed runtime implementation allows any thread to become a runtime manager thread if it helps to exploit the application parallelism. That is achieved using a new runtime feature, also implemented in this master thesis, for runtime functionality dispatching through a callback system. The proposals are evaluated in different many-core architectures and their performance is compared against the baseline runtimes used to implement the asynchronous versions. Results show that the centralized manager extension can overcome the hard limitations of the initial basic implementation, that the distributed manager fixes the observed problems in previous implementation, and the proposed asynchronous organization significantly outperforms the speedup obtained by the original runtime for real benchmarks

    Characterizing and improving the performance of many-core task-based parallel programming runtimes

    No full text
    Parallel task-based programming models like OpenMP support the declaration of task data dependences. This information is used to delay the task execution until the task data is available. The dependences between tasks are calculated at runtime using shared graphs that are updated concurrently by all threads. However, only one thread can modify the task graph at a time to ensure correctness; others need to wait before doing their modifications. This waiting limits the application's parallelism and becomes critical in many-core systems. This paper characterizes this behavior, analyzing how it hinders performance and presenting an alternative organization suitable for the runtimes of task-based programming models. This organization allows managing the runtime structures asynchronously or synchronously, adapting the runtime to reduce the waste of computation resources and increase theperformance. Results show that the new runtime structure outperforms the peak speedup of the original runtime model whencontention is huge and achieves similar or better performance results for real applications.This work is partially supported by the European Union H2020 Research and Innovation Action through the Mont-Blanc 3 project (GA 671697) and HiPEAC (GA 687698), by the Spanish Government (projects SEV-2015-0493 and TIN2015-65316-P), and by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272).Peer ReviewedPostprint (published version

    Characterizing and improving the performance of many-core task-based parallel programming runtimes

    No full text
    Parallel task-based programming models like OpenMP support the declaration of task data dependences. This information is used to delay the task execution until the task data is available. The dependences between tasks are calculated at runtime using shared graphs that are updated concurrently by all threads. However, only one thread can modify the task graph at a time to ensure correctness; others need to wait before doing their modifications. This waiting limits the application's parallelism and becomes critical in many-core systems. This paper characterizes this behavior, analyzing how it hinders performance and presenting an alternative organization suitable for the runtimes of task-based programming models. This organization allows managing the runtime structures asynchronously or synchronously, adapting the runtime to reduce the waste of computation resources and increase theperformance. Results show that the new runtime structure outperforms the peak speedup of the original runtime model whencontention is huge and achieves similar or better performance results for real applications.This work is partially supported by the European Union H2020 Research and Innovation Action through the Mont-Blanc 3 project (GA 671697) and HiPEAC (GA 687698), by the Spanish Government (projects SEV-2015-0493 and TIN2015-65316-P), and by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272).Peer Reviewe

    Hardware runtime management for task-based programming models

    Get PDF
    Task-based programming models allow programmers to express applications as a collection of tasks with dependences. They are simple to use and greatly improve programmability by using software runtimes to exploit task parallelism and heterogeneity over multi-core, many-core and heterogeneous platforms. In these programming models, the runtimes guarantee correct execution order by managing tasks using task-dependence graphs (TDGs). These runtimes are powerful enough to provide high performance with coarse-grained tasks although they impose overheads on the application execution to maintain all the information they need to do their work. However, as the current trend in processor architectures keeps including more cores and heterogeneity (in fact complexity) in the systems, coarse-grained parallelism is not enough to feed all the underlying resources. Instead, fine-grained tasks are preferable as they are able to expose higher parallelism in applications but the overheads introduced by the software runtimes under these conditions prevent an efficient exploitation of fine-grained parallelism. The two most critical runtime overheads are task dependence graph management and task scheduling to heterogeneous systems. We propose a hardware architecture Picos, consisting of a hardware task dependence manager including nested task support, and a heterogeneous task scheduler, to accelerate the critical runtime functions for task-based programming models. With Picos, we aim at extending the benefit of these programming models into exploiting fine-grained task parallelism and heterogeneity. As a proof-of-concept, Three prototypes of Picos have been designed in VHDL and implemented in a System-on-chip platform consisting of regular ARM SMP cores and an integrated FPGA. They also have been analyzed with real benchmarks with OmpSs running and Linux on the platform. The first prototype is a hardware task dependence manager, which has been implemented in a Xilinx Zynq 7000 series SoCs. It is connected to a 2-core ARM Cortex A9 processor, with bare-metal OS integration. With 24 simulated workers, and running real task-dependence analysis in Picos, it scales up to 21x speedup. The second prototype Picos++ extended Picos with an exciting new feature for nested task support in hardware. To the best of our knowledge, this is the first time that such a feature has been support fully in hardware task dependence managers. This prototype is fully integrated in not only hardware, but also with a State-of-the-Art parallel programming model, and with Linux. The third prototype includes both a hardware task dependence manager and a heterogeneous task scheduler. The heterogeneous task scheduler receives ready tasks from the task-dependence manager and then schedule them to hardware execution units that have the estimated earliest finish time. It is implemented in a Xilinx Zynq Ultrascale+ MPSoC chip. In a system with 4 threads and up to 15 HW accelerators, it achieves up to 16.2x speedup for real benchmarks, and saves up to 90% of energy.Los modelos de programación basados en tareas permiten a los programadores expresar las aplicaciones como una colección de tareas con dependencias entre ellas. Dichos modelos son simples de usar y mejoran enormemente la programabilidad. Para ello se valen del uso de una runtime que en tiempo de ejecución ayuda a explotar el paralelismo de las tareas cuando se ejecutan en plataformas multi-cores, many-cores y heterogéneas. En estos modelos de programación los runtimes garantizan la ejecución de las tareas en el orden correcto mediante el uso de gráficos de dependencias entre tareas (TDG). Actualmente, los runtimes son lo suficientemente potentes para proporcionar un alto rendimiento con tareas de granularidad gruesa a pesar de que para mantener toda la información que necesitan para hacer su trabajo, introducen un sobrecoste importante en la ejecución de las aplicaciones. El problema viene dado por la tendencia actual en arquitectura de computadores a seguir incluyendo más núcleos y heterogeneidad (de hecho, complejidad) en los sistemas de procesado con lo que el paralelismo de granularidad gruesa no es suficiente para alimentar todos los recursos. En estos entornos complejos las tareas de granularidad fina son preferibles ya que son capaces de exponer un mayor paralelismo de las aplicaciones. Sin embargo, con tareas de granularidad fina, los sobrecostes introducidos por los runtimes software son mayores debido a la necesidad de manejar muchas más tareas más rápido. En general los mayores sobrecostes introducidos por los runtimes son: la administración de los grafos de dependencias que relacionan las tareas y la gestión de las tareas en sistemas heterogéneos. Proponemos una arquitectura hardware, llamada Picos, que consiste en un administrador de dependencias entre tareas incluyendo soporte para tareas anidadas y planificación de tareas heterogéneas. La función principal de dicha arquitectura es acelerar las funciones críticas de los runtimes para modelos de programación basados en tareas. Con Picos, se pretende extender el beneficio de estos modelos de programación para explotar el paralelismo y la heterogeneidad ejecutando tareas de granularidad fina. Como prueba de concepto, tres prototipos de Picos han sido diseñado en VHDL e implementado en una plataforma System-on-chip que consta de varios núcleos ARM integrados junto con una FPGA, y ademas analizados con ejecuciones reales con OmpSs y con Linux. El primer prototipo es un administrador hardware de tareas con dependencias, que se ha implementado en un SoC Xilinx Zynq serie 7000. Está conectado a un procesador ARM Cortex A9 de 2 núcleos, e integrado con el SO. Con 24 núcleos simulados y realizando el análisis real de las dependencias entre tareas en Picos, obtiene hasta un 21x sobre las mismas ejecuciones usando el entorno software. El segundo prototipo, Picos++, amplió Picos incorporando el soporte para la gestión de tareas anidadas en hardware. Hasta donde llega nuestro conocimiento, esta es la primera vez que dicha característica ha sido propuesta y/o incorporada en un administrador hardware de dependencias entre tareas. El segundo prototipo está completamente integrado en el sistema, no solo en hardware, sino también con el modelo de programación paralelo y con el sistema operativo. El tercer prototipo, incluye un administrador y planificador de tareas heterogéneas. El planificador de tareas heterogéneas recibe dichas tareas listas del administrador de dependencias entre tareas y las programa en la unidad de ejecución de hardware adecuada que tenga el tiempo de finalización estimado más corto. Este prototipo se ha implementado en un chip MPSoC Xilinx Zynq Ultrascale+. En dicho sistema con cuatro núcleos ARM y hasta 15 aceleradores HW funcionales, logra una aceleración de hasta 16.2x, y ahorra hasta el 90% de la energía con respecto al software.Postprint (published version

    Hardware runtime management for task-based programming models

    Get PDF
    Task-based programming models allow programmers to express applications as a collection of tasks with dependences. They are simple to use and greatly improve programmability by using software runtimes to exploit task parallelism and heterogeneity over multi-core, many-core and heterogeneous platforms. In these programming models, the runtimes guarantee correct execution order by managing tasks using task-dependence graphs (TDGs). These runtimes are powerful enough to provide high performance with coarse-grained tasks although they impose overheads on the application execution to maintain all the information they need to do their work. However, as the current trend in processor architectures keeps including more cores and heterogeneity (in fact complexity) in the systems, coarse-grained parallelism is not enough to feed all the underlying resources. Instead, fine-grained tasks are preferable as they are able to expose higher parallelism in applications but the overheads introduced by the software runtimes under these conditions prevent an efficient exploitation of fine-grained parallelism. The two most critical runtime overheads are task dependence graph management and task scheduling to heterogeneous systems. We propose a hardware architecture Picos, consisting of a hardware task dependence manager including nested task support, and a heterogeneous task scheduler, to accelerate the critical runtime functions for task-based programming models. With Picos, we aim at extending the benefit of these programming models into exploiting fine-grained task parallelism and heterogeneity. As a proof-of-concept, Three prototypes of Picos have been designed in VHDL and implemented in a System-on-chip platform consisting of regular ARM SMP cores and an integrated FPGA. They also have been analyzed with real benchmarks with OmpSs running and Linux on the platform. The first prototype is a hardware task dependence manager, which has been implemented in a Xilinx Zynq 7000 series SoCs. It is connected to a 2-core ARM Cortex A9 processor, with bare-metal OS integration. With 24 simulated workers, and running real task-dependence analysis in Picos, it scales up to 21x speedup. The second prototype Picos++ extended Picos with an exciting new feature for nested task support in hardware. To the best of our knowledge, this is the first time that such a feature has been support fully in hardware task dependence managers. This prototype is fully integrated in not only hardware, but also with a State-of-the-Art parallel programming model, and with Linux. The third prototype includes both a hardware task dependence manager and a heterogeneous task scheduler. The heterogeneous task scheduler receives ready tasks from the task-dependence manager and then schedule them to hardware execution units that have the estimated earliest finish time. It is implemented in a Xilinx Zynq Ultrascale+ MPSoC chip. In a system with 4 threads and up to 15 HW accelerators, it achieves up to 16.2x speedup for real benchmarks, and saves up to 90% of energy.Los modelos de programación basados en tareas permiten a los programadores expresar las aplicaciones como una colección de tareas con dependencias entre ellas. Dichos modelos son simples de usar y mejoran enormemente la programabilidad. Para ello se valen del uso de una runtime que en tiempo de ejecución ayuda a explotar el paralelismo de las tareas cuando se ejecutan en plataformas multi-cores, many-cores y heterogéneas. En estos modelos de programación los runtimes garantizan la ejecución de las tareas en el orden correcto mediante el uso de gráficos de dependencias entre tareas (TDG). Actualmente, los runtimes son lo suficientemente potentes para proporcionar un alto rendimiento con tareas de granularidad gruesa a pesar de que para mantener toda la información que necesitan para hacer su trabajo, introducen un sobrecoste importante en la ejecución de las aplicaciones. El problema viene dado por la tendencia actual en arquitectura de computadores a seguir incluyendo más núcleos y heterogeneidad (de hecho, complejidad) en los sistemas de procesado con lo que el paralelismo de granularidad gruesa no es suficiente para alimentar todos los recursos. En estos entornos complejos las tareas de granularidad fina son preferibles ya que son capaces de exponer un mayor paralelismo de las aplicaciones. Sin embargo, con tareas de granularidad fina, los sobrecostes introducidos por los runtimes software son mayores debido a la necesidad de manejar muchas más tareas más rápido. En general los mayores sobrecostes introducidos por los runtimes son: la administración de los grafos de dependencias que relacionan las tareas y la gestión de las tareas en sistemas heterogéneos. Proponemos una arquitectura hardware, llamada Picos, que consiste en un administrador de dependencias entre tareas incluyendo soporte para tareas anidadas y planificación de tareas heterogéneas. La función principal de dicha arquitectura es acelerar las funciones críticas de los runtimes para modelos de programación basados en tareas. Con Picos, se pretende extender el beneficio de estos modelos de programación para explotar el paralelismo y la heterogeneidad ejecutando tareas de granularidad fina. Como prueba de concepto, tres prototipos de Picos han sido diseñado en VHDL e implementado en una plataforma System-on-chip que consta de varios núcleos ARM integrados junto con una FPGA, y ademas analizados con ejecuciones reales con OmpSs y con Linux. El primer prototipo es un administrador hardware de tareas con dependencias, que se ha implementado en un SoC Xilinx Zynq serie 7000. Está conectado a un procesador ARM Cortex A9 de 2 núcleos, e integrado con el SO. Con 24 núcleos simulados y realizando el análisis real de las dependencias entre tareas en Picos, obtiene hasta un 21x sobre las mismas ejecuciones usando el entorno software. El segundo prototipo, Picos++, amplió Picos incorporando el soporte para la gestión de tareas anidadas en hardware. Hasta donde llega nuestro conocimiento, esta es la primera vez que dicha característica ha sido propuesta y/o incorporada en un administrador hardware de dependencias entre tareas. El segundo prototipo está completamente integrado en el sistema, no solo en hardware, sino también con el modelo de programación paralelo y con el sistema operativo. El tercer prototipo, incluye un administrador y planificador de tareas heterogéneas. El planificador de tareas heterogéneas recibe dichas tareas listas del administrador de dependencias entre tareas y las programa en la unidad de ejecución de hardware adecuada que tenga el tiempo de finalización estimado más corto. Este prototipo se ha implementado en un chip MPSoC Xilinx Zynq Ultrascale+. En dicho sistema con cuatro núcleos ARM y hasta 15 aceleradores HW funcionales, logra una aceleración de hasta 16.2x, y ahorra hasta el 90% de la energía con respecto al software
    corecore