2,052 research outputs found

    Picos, a hardware task-dependence manager for task-based dataflow programming models

    Get PDF
    Task-based programming Task-based programming models such as OpenMP, Intel TBB and OmpSs are widely used to extract high level of parallelism of applications executed on multi-core and manycore platforms. These programming models allow applications to be expressed as a set of tasks with dependences to drive their execution at runtime. While managing these dependences for task with coarse granularity proves to be highly beneficial, it introduces noticeable overheads when targeting fine-grained tasks, diminishing the potential speedups or even introducing performance losses. To overcome this drawback, we propose a hardware/software co-design Picos that manages inter-task dependences efficiently. In this paper we describe the main ideas of our proposal and a prototype implementation. This prototype is integrated with a parallel task- based programming model and evaluated with real executions in Linux embedded system with two ARM Cortex-A9 and a FPGA. When compared with a software runtime, our solution results in more than 1.8x speedup and 40% of energy savings with only 2 threads.This work is supported by the projects SEV-2015-0493 and TIN2015-65316-P, by the project 2014-SGR-1051 and 2014-SGR-1272, by the RoMoL GA 321253 and by the project cooperation agreement with LG Electronics, and thank the Xilinx University Program.Postprint (published version

    Hardware runtime management for task-based programming models

    Get PDF
    Task-based programming models allow programmers to express applications as a collection of tasks with dependences. They are simple to use and greatly improve programmability by using software runtimes to exploit task parallelism and heterogeneity over multi-core, many-core and heterogeneous platforms. In these programming models, the runtimes guarantee correct execution order by managing tasks using task-dependence graphs (TDGs). These runtimes are powerful enough to provide high performance with coarse-grained tasks although they impose overheads on the application execution to maintain all the information they need to do their work. However, as the current trend in processor architectures keeps including more cores and heterogeneity (in fact complexity) in the systems, coarse-grained parallelism is not enough to feed all the underlying resources. Instead, fine-grained tasks are preferable as they are able to expose higher parallelism in applications but the overheads introduced by the software runtimes under these conditions prevent an efficient exploitation of fine-grained parallelism. The two most critical runtime overheads are task dependence graph management and task scheduling to heterogeneous systems. We propose a hardware architecture Picos, consisting of a hardware task dependence manager including nested task support, and a heterogeneous task scheduler, to accelerate the critical runtime functions for task-based programming models. With Picos, we aim at extending the benefit of these programming models into exploiting fine-grained task parallelism and heterogeneity. As a proof-of-concept, Three prototypes of Picos have been designed in VHDL and implemented in a System-on-chip platform consisting of regular ARM SMP cores and an integrated FPGA. They also have been analyzed with real benchmarks with OmpSs running and Linux on the platform. The first prototype is a hardware task dependence manager, which has been implemented in a Xilinx Zynq 7000 series SoCs. It is connected to a 2-core ARM Cortex A9 processor, with bare-metal OS integration. With 24 simulated workers, and running real task-dependence analysis in Picos, it scales up to 21x speedup. The second prototype Picos++ extended Picos with an exciting new feature for nested task support in hardware. To the best of our knowledge, this is the first time that such a feature has been support fully in hardware task dependence managers. This prototype is fully integrated in not only hardware, but also with a State-of-the-Art parallel programming model, and with Linux. The third prototype includes both a hardware task dependence manager and a heterogeneous task scheduler. The heterogeneous task scheduler receives ready tasks from the task-dependence manager and then schedule them to hardware execution units that have the estimated earliest finish time. It is implemented in a Xilinx Zynq Ultrascale+ MPSoC chip. In a system with 4 threads and up to 15 HW accelerators, it achieves up to 16.2x speedup for real benchmarks, and saves up to 90% of energy.Los modelos de programación basados en tareas permiten a los programadores expresar las aplicaciones como una colección de tareas con dependencias entre ellas. Dichos modelos son simples de usar y mejoran enormemente la programabilidad. Para ello se valen del uso de una runtime que en tiempo de ejecución ayuda a explotar el paralelismo de las tareas cuando se ejecutan en plataformas multi-cores, many-cores y heterogéneas. En estos modelos de programación los runtimes garantizan la ejecución de las tareas en el orden correcto mediante el uso de gráficos de dependencias entre tareas (TDG). Actualmente, los runtimes son lo suficientemente potentes para proporcionar un alto rendimiento con tareas de granularidad gruesa a pesar de que para mantener toda la información que necesitan para hacer su trabajo, introducen un sobrecoste importante en la ejecución de las aplicaciones. El problema viene dado por la tendencia actual en arquitectura de computadores a seguir incluyendo más núcleos y heterogeneidad (de hecho, complejidad) en los sistemas de procesado con lo que el paralelismo de granularidad gruesa no es suficiente para alimentar todos los recursos. En estos entornos complejos las tareas de granularidad fina son preferibles ya que son capaces de exponer un mayor paralelismo de las aplicaciones. Sin embargo, con tareas de granularidad fina, los sobrecostes introducidos por los runtimes software son mayores debido a la necesidad de manejar muchas más tareas más rápido. En general los mayores sobrecostes introducidos por los runtimes son: la administración de los grafos de dependencias que relacionan las tareas y la gestión de las tareas en sistemas heterogéneos. Proponemos una arquitectura hardware, llamada Picos, que consiste en un administrador de dependencias entre tareas incluyendo soporte para tareas anidadas y planificación de tareas heterogéneas. La función principal de dicha arquitectura es acelerar las funciones críticas de los runtimes para modelos de programación basados en tareas. Con Picos, se pretende extender el beneficio de estos modelos de programación para explotar el paralelismo y la heterogeneidad ejecutando tareas de granularidad fina. Como prueba de concepto, tres prototipos de Picos han sido diseñado en VHDL e implementado en una plataforma System-on-chip que consta de varios núcleos ARM integrados junto con una FPGA, y ademas analizados con ejecuciones reales con OmpSs y con Linux. El primer prototipo es un administrador hardware de tareas con dependencias, que se ha implementado en un SoC Xilinx Zynq serie 7000. Está conectado a un procesador ARM Cortex A9 de 2 núcleos, e integrado con el SO. Con 24 núcleos simulados y realizando el análisis real de las dependencias entre tareas en Picos, obtiene hasta un 21x sobre las mismas ejecuciones usando el entorno software. El segundo prototipo, Picos++, amplió Picos incorporando el soporte para la gestión de tareas anidadas en hardware. Hasta donde llega nuestro conocimiento, esta es la primera vez que dicha característica ha sido propuesta y/o incorporada en un administrador hardware de dependencias entre tareas. El segundo prototipo está completamente integrado en el sistema, no solo en hardware, sino también con el modelo de programación paralelo y con el sistema operativo. El tercer prototipo, incluye un administrador y planificador de tareas heterogéneas. El planificador de tareas heterogéneas recibe dichas tareas listas del administrador de dependencias entre tareas y las programa en la unidad de ejecución de hardware adecuada que tenga el tiempo de finalización estimado más corto. Este prototipo se ha implementado en un chip MPSoC Xilinx Zynq Ultrascale+. En dicho sistema con cuatro núcleos ARM y hasta 15 aceleradores HW funcionales, logra una aceleración de hasta 16.2x, y ahorra hasta el 90% de la energía con respecto al software.Postprint (published version

    Hardware runtime management for task-based programming models

    Get PDF
    Task-based programming models allow programmers to express applications as a collection of tasks with dependences. They are simple to use and greatly improve programmability by using software runtimes to exploit task parallelism and heterogeneity over multi-core, many-core and heterogeneous platforms. In these programming models, the runtimes guarantee correct execution order by managing tasks using task-dependence graphs (TDGs). These runtimes are powerful enough to provide high performance with coarse-grained tasks although they impose overheads on the application execution to maintain all the information they need to do their work. However, as the current trend in processor architectures keeps including more cores and heterogeneity (in fact complexity) in the systems, coarse-grained parallelism is not enough to feed all the underlying resources. Instead, fine-grained tasks are preferable as they are able to expose higher parallelism in applications but the overheads introduced by the software runtimes under these conditions prevent an efficient exploitation of fine-grained parallelism. The two most critical runtime overheads are task dependence graph management and task scheduling to heterogeneous systems. We propose a hardware architecture Picos, consisting of a hardware task dependence manager including nested task support, and a heterogeneous task scheduler, to accelerate the critical runtime functions for task-based programming models. With Picos, we aim at extending the benefit of these programming models into exploiting fine-grained task parallelism and heterogeneity. As a proof-of-concept, Three prototypes of Picos have been designed in VHDL and implemented in a System-on-chip platform consisting of regular ARM SMP cores and an integrated FPGA. They also have been analyzed with real benchmarks with OmpSs running and Linux on the platform. The first prototype is a hardware task dependence manager, which has been implemented in a Xilinx Zynq 7000 series SoCs. It is connected to a 2-core ARM Cortex A9 processor, with bare-metal OS integration. With 24 simulated workers, and running real task-dependence analysis in Picos, it scales up to 21x speedup. The second prototype Picos++ extended Picos with an exciting new feature for nested task support in hardware. To the best of our knowledge, this is the first time that such a feature has been support fully in hardware task dependence managers. This prototype is fully integrated in not only hardware, but also with a State-of-the-Art parallel programming model, and with Linux. The third prototype includes both a hardware task dependence manager and a heterogeneous task scheduler. The heterogeneous task scheduler receives ready tasks from the task-dependence manager and then schedule them to hardware execution units that have the estimated earliest finish time. It is implemented in a Xilinx Zynq Ultrascale+ MPSoC chip. In a system with 4 threads and up to 15 HW accelerators, it achieves up to 16.2x speedup for real benchmarks, and saves up to 90% of energy.Los modelos de programación basados en tareas permiten a los programadores expresar las aplicaciones como una colección de tareas con dependencias entre ellas. Dichos modelos son simples de usar y mejoran enormemente la programabilidad. Para ello se valen del uso de una runtime que en tiempo de ejecución ayuda a explotar el paralelismo de las tareas cuando se ejecutan en plataformas multi-cores, many-cores y heterogéneas. En estos modelos de programación los runtimes garantizan la ejecución de las tareas en el orden correcto mediante el uso de gráficos de dependencias entre tareas (TDG). Actualmente, los runtimes son lo suficientemente potentes para proporcionar un alto rendimiento con tareas de granularidad gruesa a pesar de que para mantener toda la información que necesitan para hacer su trabajo, introducen un sobrecoste importante en la ejecución de las aplicaciones. El problema viene dado por la tendencia actual en arquitectura de computadores a seguir incluyendo más núcleos y heterogeneidad (de hecho, complejidad) en los sistemas de procesado con lo que el paralelismo de granularidad gruesa no es suficiente para alimentar todos los recursos. En estos entornos complejos las tareas de granularidad fina son preferibles ya que son capaces de exponer un mayor paralelismo de las aplicaciones. Sin embargo, con tareas de granularidad fina, los sobrecostes introducidos por los runtimes software son mayores debido a la necesidad de manejar muchas más tareas más rápido. En general los mayores sobrecostes introducidos por los runtimes son: la administración de los grafos de dependencias que relacionan las tareas y la gestión de las tareas en sistemas heterogéneos. Proponemos una arquitectura hardware, llamada Picos, que consiste en un administrador de dependencias entre tareas incluyendo soporte para tareas anidadas y planificación de tareas heterogéneas. La función principal de dicha arquitectura es acelerar las funciones críticas de los runtimes para modelos de programación basados en tareas. Con Picos, se pretende extender el beneficio de estos modelos de programación para explotar el paralelismo y la heterogeneidad ejecutando tareas de granularidad fina. Como prueba de concepto, tres prototipos de Picos han sido diseñado en VHDL e implementado en una plataforma System-on-chip que consta de varios núcleos ARM integrados junto con una FPGA, y ademas analizados con ejecuciones reales con OmpSs y con Linux. El primer prototipo es un administrador hardware de tareas con dependencias, que se ha implementado en un SoC Xilinx Zynq serie 7000. Está conectado a un procesador ARM Cortex A9 de 2 núcleos, e integrado con el SO. Con 24 núcleos simulados y realizando el análisis real de las dependencias entre tareas en Picos, obtiene hasta un 21x sobre las mismas ejecuciones usando el entorno software. El segundo prototipo, Picos++, amplió Picos incorporando el soporte para la gestión de tareas anidadas en hardware. Hasta donde llega nuestro conocimiento, esta es la primera vez que dicha característica ha sido propuesta y/o incorporada en un administrador hardware de dependencias entre tareas. El segundo prototipo está completamente integrado en el sistema, no solo en hardware, sino también con el modelo de programación paralelo y con el sistema operativo. El tercer prototipo, incluye un administrador y planificador de tareas heterogéneas. El planificador de tareas heterogéneas recibe dichas tareas listas del administrador de dependencias entre tareas y las programa en la unidad de ejecución de hardware adecuada que tenga el tiempo de finalización estimado más corto. Este prototipo se ha implementado en un chip MPSoC Xilinx Zynq Ultrascale+. En dicho sistema con cuatro núcleos ARM y hasta 15 aceleradores HW funcionales, logra una aceleración de hasta 16.2x, y ahorra hasta el 90% de la energía con respecto al software

    Parallel Architectures for Planetary Exploration Requirements (PAPER)

    Get PDF
    The Parallel Architectures for Planetary Exploration Requirements (PAPER) project is essentially research oriented towards technology insertion issues for NASA's unmanned planetary probes. It was initiated to complement and augment the long-term efforts for space exploration with particular reference to NASA/LaRC's (NASA Langley Research Center) research needs for planetary exploration missions of the mid and late 1990s. The requirements for space missions as given in the somewhat dated Advanced Information Processing Systems (AIPS) requirements document are contrasted with the new requirements from JPL/Caltech involving sensor data capture and scene analysis. It is shown that more stringent requirements have arisen as a result of technological advancements. Two possible architectures, the AIPS Proof of Concept (POC) configuration and the MAX Fault-tolerant dataflow multiprocessor, were evaluated. The main observation was that the AIPS design is biased towards fault tolerance and may not be an ideal architecture for planetary and deep space probes due to high cost and complexity. The MAX concepts appears to be a promising candidate, except that more detailed information is required. The feasibility for adding neural computation capability to this architecture needs to be studied. Key impact issues for architectural design of computing systems meant for planetary missions were also identified

    Dynamic Control Flow in Large-Scale Machine Learning

    Full text link
    Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability.Comment: Appeared in EuroSys 2018. 14 pages, 16 figure

    TD-NUCA: runtime driven management of NUCA caches in task dataflow programming models

    Get PDF
    In high performance processors, the design of on-chip memory hierarchies is crucial for performance and energy efficiency. Current processors rely on large shared Non-Uniform Cache Architectures (NUCA) to improve performance and reduce data movement. Multiple solutions exploit information available at the microarchitecture level or in the operating system to optimize NUCA performance. However, existing methods have not taken advantage of the information captured by task dataflow programming models to guide the management of NUCA caches. In this paper we propose TD-NUCA, a hardware/software co-designed approach that leverages information present in the runtime system of task dataflow programming models to efficiently manage NUCA caches. TD-NUCA identifies the data access and reuse patterns of parallel applications in the runtime system and guides the operation of the NUCA caches in the hardware. As a result, TD-NUCA achieves a 1.18x average speedup over the baseline S-NUCA while requiring only 0.62x the data movement.This work has been supported by the Spanish Ministry of Science and Technology (contract PID2019-107255GB-C21) and the Generalitat de Catalunya (contract 2017-SGR-1414). M. Casas has been partially supported by the Grant RYC- 2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF ‘Investing in your future’. M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship No. RYC-2016-21104.Peer ReviewedPostprint (published version

    Parallel architectures and runtime systems co-design for task-based programming models

    Get PDF
    The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design. This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cómputo modernos ha provocado la necesidad de una visión holística en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programación y las aplicaciones. Hoy en día el diseño de los computadores consiste en diferentes capas de abstracción con una interfaz bien definida entre ellas. Las limitaciones de esta aproximación junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayoría de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducción del tamaño del canal del transistor, lo cual permite chips más rápidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnología actual está alcanzando limitaciones físicas donde no será posible reducir el tamaño de los transistores motivando así un cambio de paradigma en la construcción de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecución del modelo de programación sean capaces de intercambiar información para alcanzar una meta común: La mejora del rendimiento y la reducción del consumo energético. Haciendo que la arquitectura sea consciente de la información disponible en el modelo de programación, como puede ser el grafo de dependencias entre tareas en los modelos de programación dataflow, es posible reducir el consumo energético explotando el camino critico del grafo. Además, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecución para realizar una mejor planificación de las comunicaciones y creando nuevas oportunidades de solapamiento entre cómputo y comunicación que no eran posibles anteriormente. Esta tesis aporta una evaluación de todas estas propuestas, así como una metodología para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version
    corecore