1,789 research outputs found

    Pipelined Asynchronous High Level Synthesis for General Programs

    Get PDF
    High-level synthesis (HLS) translates algorithms from software programming language into hardware. We use the dataflow HLS methodology to translate programs into asynchronous circuits by implementing programs using asynchronous dataflow elements as hardware building blocks. We extend the prior work in dataflow synthesis in the following aspects:i) we propose Fluid to synthesize pipelined dataflow circuits for real-world programs with complex control flows, which are not supported in the previous work; ii) we propose PipeLink to permit pipelined access to shared resources in the dataflow circuit. Dataflow circuit results in distributed control and an implicitly pipelined implementation. However, resource sharing in the presence of pipelining is challenging in this context due to the absence of a global scheduler. Traditional solutions to this problem impose restrictions on pipelining to guarantee mutually exclusive access to the shared resource, but PipeLink removes such restrictions and can generate pipelined asynchronous dataflow circuits for shared function calls, pipelined memory accesses and function pointers; iii) we apply several dataflow optimizations to improve the quality of the synthesized dataflow circuits; iv) we implement our system (Fluid + PipeLink) on the LLVM compiler framework, which allows us to take advantage of the optimization efforts from the compiler community; v) we compare our system with a widely-used academic HLS tool and two commercial HLS tools. Compared to commercial (academic) HLS tools, our system results in 12X (20X) reduction in energy, 1.29X (1.64X) improvement in throughput, 1.27X (1.61X) improvement in latency at a cost of 2.4X (1.61X) increase in the area

    Why High-Performance Modelling and Simulation for Big Data Applications Matters

    Get PDF
    Modelling and Simulation (M&S) offer adequate abstractions to manage the complexity of analysing big data in scientific and engineering domains. Unfortunately, big data problems are often not easily amenable to efficient and effective use of High Performance Computing (HPC) facilities and technologies. Furthermore, M&S communities typically lack the detailed expertise required to exploit the full potential of HPC solutions while HPC specialists may not be fully aware of specific modelling and simulation requirements and applications. The COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications has created a strategic framework to foster interaction between M&S experts from various application domains on the one hand and HPC experts on the other hand to develop effective solutions for big data applications. One of the tangible outcomes of the COST Action is a collection of case studies from various computing domains. Each case study brought together both HPC and M&S experts, giving witness of the effective cross-pollination facilitated by the COST Action. In this introductory article we argue why joining forces between M&S and HPC communities is both timely in the big data era and crucial for success in many application domains. Moreover, we provide an overview on the state of the art in the various research areas concerned

    OptiTrust: an Interactive Framework for Source-to-Source Transformations

    Get PDF
    This paper presents an interactive framework for developing high-performance C code via series of source-to-source transformations. Optimization steps are described in transformation scripts, expressed as OCaml programs. The programmer can interactively visualize the textual differences associated with any step of the script. We demonstrate the effectiveness of OptiTrust by reproducing a manually optimized Particle-In-Cell numerical simulation, starting from a direct, unoptimized version of the algorithm. This case study covers many state-of-the-art optimization patterns that appear in numerical simulation codes. We argue that, compared with optimizing code by hand, deriving high performance code using a transformation script makes the code easier to review, easier to debug, and easier to maintain as the intended program or as the target hardware evolves

    FPGAs for the Masses: Affordable Hardware Synthesis from Domain-Specific Languages

    Get PDF
    Field Programmable Gate Arrays (FPGAs) have the unique ability to be configured into application-specific architectures that are well suited to specific computing problems. This enables them to achieve performances and energy efficiencies that outclass other processor-based architectures, such as Chip Multiprocessors (CMPs), Graphic Processing Units (GPUs) and Digital Signal Processors (DSPs). Despite this, FPGAs are yet to gain widespread adoption, especially among application and software developers, because of their laborious application development process that requires hardware design expertise. In some application areas, domain-specific hardware synthesis tools alleviate this problem by using a Domain-Specific Language (DSL) to hide the low-level hardware details and also improve productivity of the developer. Additionally, these tools leverage domain knowledge to perform optimizations and produce high-quality hardware designs. While this approach holds great promise, the significant effort and cost of developing such domain-specific tools make it unaffordable in many application areas. In this thesis, we develop techniques to reduce the effort and cost of developing domain-specific hardware synthesis tools. To demonstrate our approach, we develop a toolchain to generate complete hardware systems from high-level functional specifications written in a DSL. Firstly, our approach uses language embedding and type-directed staging to develop a DSL and compiler in a cost-effective manner. To further reduce effort, we develop this compiler by composing reusable optimization modules, and integrate it with existing hardware synthesis tools. However, most synthesis tools require users to have hardware design knowledge to produce high-quality results. Therefore, secondly, to facilitate people without hardware design skills to develop domain-specific tools, we develop a methodology to generate high-quality hardware designs from well known computational patterns, such as map, zipWith, reduce and foreach; computational patterns are algorithmic methods that capture the nature of computation and communication and can be easily understood and used without expert knowledge. In our approach, we decompose the DSL specifications into constituent computational patterns and exploit the properties of these patterns, such as degree of parallelism, interdependence between operations and data-access characteristics, to generate high-quality hardware modules to implement them, and compose them into a complete system design. Lastly, we extended our methodology to automatically parallelize computations across multiple hardware modules to benefit from the spatial parallelism of the FPGA as well as overcome performance problems caused by non-sequential data access patterns and long access latency to external memory. To achieve this, we utilize the data-access properties of the computational patterns to automatically identify synchronization requirements and generate such multi-module designs from the same high-level functional specifications. Driven by power and performance constraints, today the world is turning to reconfigurable technology (i.e., FPGAs) to meet the computational needs of tomorrow. In this light, this work addresses the cardinal problem of making tomorrow's computing infrastructure programmable to application developers

    Hardware runtime management for task-based programming models

    Get PDF
    Task-based programming models allow programmers to express applications as a collection of tasks with dependences. They are simple to use and greatly improve programmability by using software runtimes to exploit task parallelism and heterogeneity over multi-core, many-core and heterogeneous platforms. In these programming models, the runtimes guarantee correct execution order by managing tasks using task-dependence graphs (TDGs). These runtimes are powerful enough to provide high performance with coarse-grained tasks although they impose overheads on the application execution to maintain all the information they need to do their work. However, as the current trend in processor architectures keeps including more cores and heterogeneity (in fact complexity) in the systems, coarse-grained parallelism is not enough to feed all the underlying resources. Instead, fine-grained tasks are preferable as they are able to expose higher parallelism in applications but the overheads introduced by the software runtimes under these conditions prevent an efficient exploitation of fine-grained parallelism. The two most critical runtime overheads are task dependence graph management and task scheduling to heterogeneous systems. We propose a hardware architecture Picos, consisting of a hardware task dependence manager including nested task support, and a heterogeneous task scheduler, to accelerate the critical runtime functions for task-based programming models. With Picos, we aim at extending the benefit of these programming models into exploiting fine-grained task parallelism and heterogeneity. As a proof-of-concept, Three prototypes of Picos have been designed in VHDL and implemented in a System-on-chip platform consisting of regular ARM SMP cores and an integrated FPGA. They also have been analyzed with real benchmarks with OmpSs running and Linux on the platform. The first prototype is a hardware task dependence manager, which has been implemented in a Xilinx Zynq 7000 series SoCs. It is connected to a 2-core ARM Cortex A9 processor, with bare-metal OS integration. With 24 simulated workers, and running real task-dependence analysis in Picos, it scales up to 21x speedup. The second prototype Picos++ extended Picos with an exciting new feature for nested task support in hardware. To the best of our knowledge, this is the first time that such a feature has been support fully in hardware task dependence managers. This prototype is fully integrated in not only hardware, but also with a State-of-the-Art parallel programming model, and with Linux. The third prototype includes both a hardware task dependence manager and a heterogeneous task scheduler. The heterogeneous task scheduler receives ready tasks from the task-dependence manager and then schedule them to hardware execution units that have the estimated earliest finish time. It is implemented in a Xilinx Zynq Ultrascale+ MPSoC chip. In a system with 4 threads and up to 15 HW accelerators, it achieves up to 16.2x speedup for real benchmarks, and saves up to 90% of energy.Los modelos de programación basados en tareas permiten a los programadores expresar las aplicaciones como una colección de tareas con dependencias entre ellas. Dichos modelos son simples de usar y mejoran enormemente la programabilidad. Para ello se valen del uso de una runtime que en tiempo de ejecución ayuda a explotar el paralelismo de las tareas cuando se ejecutan en plataformas multi-cores, many-cores y heterogéneas. En estos modelos de programación los runtimes garantizan la ejecución de las tareas en el orden correcto mediante el uso de gráficos de dependencias entre tareas (TDG). Actualmente, los runtimes son lo suficientemente potentes para proporcionar un alto rendimiento con tareas de granularidad gruesa a pesar de que para mantener toda la información que necesitan para hacer su trabajo, introducen un sobrecoste importante en la ejecución de las aplicaciones. El problema viene dado por la tendencia actual en arquitectura de computadores a seguir incluyendo más núcleos y heterogeneidad (de hecho, complejidad) en los sistemas de procesado con lo que el paralelismo de granularidad gruesa no es suficiente para alimentar todos los recursos. En estos entornos complejos las tareas de granularidad fina son preferibles ya que son capaces de exponer un mayor paralelismo de las aplicaciones. Sin embargo, con tareas de granularidad fina, los sobrecostes introducidos por los runtimes software son mayores debido a la necesidad de manejar muchas más tareas más rápido. En general los mayores sobrecostes introducidos por los runtimes son: la administración de los grafos de dependencias que relacionan las tareas y la gestión de las tareas en sistemas heterogéneos. Proponemos una arquitectura hardware, llamada Picos, que consiste en un administrador de dependencias entre tareas incluyendo soporte para tareas anidadas y planificación de tareas heterogéneas. La función principal de dicha arquitectura es acelerar las funciones críticas de los runtimes para modelos de programación basados en tareas. Con Picos, se pretende extender el beneficio de estos modelos de programación para explotar el paralelismo y la heterogeneidad ejecutando tareas de granularidad fina. Como prueba de concepto, tres prototipos de Picos han sido diseñado en VHDL e implementado en una plataforma System-on-chip que consta de varios núcleos ARM integrados junto con una FPGA, y ademas analizados con ejecuciones reales con OmpSs y con Linux. El primer prototipo es un administrador hardware de tareas con dependencias, que se ha implementado en un SoC Xilinx Zynq serie 7000. Está conectado a un procesador ARM Cortex A9 de 2 núcleos, e integrado con el SO. Con 24 núcleos simulados y realizando el análisis real de las dependencias entre tareas en Picos, obtiene hasta un 21x sobre las mismas ejecuciones usando el entorno software. El segundo prototipo, Picos++, amplió Picos incorporando el soporte para la gestión de tareas anidadas en hardware. Hasta donde llega nuestro conocimiento, esta es la primera vez que dicha característica ha sido propuesta y/o incorporada en un administrador hardware de dependencias entre tareas. El segundo prototipo está completamente integrado en el sistema, no solo en hardware, sino también con el modelo de programación paralelo y con el sistema operativo. El tercer prototipo, incluye un administrador y planificador de tareas heterogéneas. El planificador de tareas heterogéneas recibe dichas tareas listas del administrador de dependencias entre tareas y las programa en la unidad de ejecución de hardware adecuada que tenga el tiempo de finalización estimado más corto. Este prototipo se ha implementado en un chip MPSoC Xilinx Zynq Ultrascale+. En dicho sistema con cuatro núcleos ARM y hasta 15 aceleradores HW funcionales, logra una aceleración de hasta 16.2x, y ahorra hasta el 90% de la energía con respecto al software.Postprint (published version

    Hardware runtime management for task-based programming models

    Get PDF
    Task-based programming models allow programmers to express applications as a collection of tasks with dependences. They are simple to use and greatly improve programmability by using software runtimes to exploit task parallelism and heterogeneity over multi-core, many-core and heterogeneous platforms. In these programming models, the runtimes guarantee correct execution order by managing tasks using task-dependence graphs (TDGs). These runtimes are powerful enough to provide high performance with coarse-grained tasks although they impose overheads on the application execution to maintain all the information they need to do their work. However, as the current trend in processor architectures keeps including more cores and heterogeneity (in fact complexity) in the systems, coarse-grained parallelism is not enough to feed all the underlying resources. Instead, fine-grained tasks are preferable as they are able to expose higher parallelism in applications but the overheads introduced by the software runtimes under these conditions prevent an efficient exploitation of fine-grained parallelism. The two most critical runtime overheads are task dependence graph management and task scheduling to heterogeneous systems. We propose a hardware architecture Picos, consisting of a hardware task dependence manager including nested task support, and a heterogeneous task scheduler, to accelerate the critical runtime functions for task-based programming models. With Picos, we aim at extending the benefit of these programming models into exploiting fine-grained task parallelism and heterogeneity. As a proof-of-concept, Three prototypes of Picos have been designed in VHDL and implemented in a System-on-chip platform consisting of regular ARM SMP cores and an integrated FPGA. They also have been analyzed with real benchmarks with OmpSs running and Linux on the platform. The first prototype is a hardware task dependence manager, which has been implemented in a Xilinx Zynq 7000 series SoCs. It is connected to a 2-core ARM Cortex A9 processor, with bare-metal OS integration. With 24 simulated workers, and running real task-dependence analysis in Picos, it scales up to 21x speedup. The second prototype Picos++ extended Picos with an exciting new feature for nested task support in hardware. To the best of our knowledge, this is the first time that such a feature has been support fully in hardware task dependence managers. This prototype is fully integrated in not only hardware, but also with a State-of-the-Art parallel programming model, and with Linux. The third prototype includes both a hardware task dependence manager and a heterogeneous task scheduler. The heterogeneous task scheduler receives ready tasks from the task-dependence manager and then schedule them to hardware execution units that have the estimated earliest finish time. It is implemented in a Xilinx Zynq Ultrascale+ MPSoC chip. In a system with 4 threads and up to 15 HW accelerators, it achieves up to 16.2x speedup for real benchmarks, and saves up to 90% of energy.Los modelos de programación basados en tareas permiten a los programadores expresar las aplicaciones como una colección de tareas con dependencias entre ellas. Dichos modelos son simples de usar y mejoran enormemente la programabilidad. Para ello se valen del uso de una runtime que en tiempo de ejecución ayuda a explotar el paralelismo de las tareas cuando se ejecutan en plataformas multi-cores, many-cores y heterogéneas. En estos modelos de programación los runtimes garantizan la ejecución de las tareas en el orden correcto mediante el uso de gráficos de dependencias entre tareas (TDG). Actualmente, los runtimes son lo suficientemente potentes para proporcionar un alto rendimiento con tareas de granularidad gruesa a pesar de que para mantener toda la información que necesitan para hacer su trabajo, introducen un sobrecoste importante en la ejecución de las aplicaciones. El problema viene dado por la tendencia actual en arquitectura de computadores a seguir incluyendo más núcleos y heterogeneidad (de hecho, complejidad) en los sistemas de procesado con lo que el paralelismo de granularidad gruesa no es suficiente para alimentar todos los recursos. En estos entornos complejos las tareas de granularidad fina son preferibles ya que son capaces de exponer un mayor paralelismo de las aplicaciones. Sin embargo, con tareas de granularidad fina, los sobrecostes introducidos por los runtimes software son mayores debido a la necesidad de manejar muchas más tareas más rápido. En general los mayores sobrecostes introducidos por los runtimes son: la administración de los grafos de dependencias que relacionan las tareas y la gestión de las tareas en sistemas heterogéneos. Proponemos una arquitectura hardware, llamada Picos, que consiste en un administrador de dependencias entre tareas incluyendo soporte para tareas anidadas y planificación de tareas heterogéneas. La función principal de dicha arquitectura es acelerar las funciones críticas de los runtimes para modelos de programación basados en tareas. Con Picos, se pretende extender el beneficio de estos modelos de programación para explotar el paralelismo y la heterogeneidad ejecutando tareas de granularidad fina. Como prueba de concepto, tres prototipos de Picos han sido diseñado en VHDL e implementado en una plataforma System-on-chip que consta de varios núcleos ARM integrados junto con una FPGA, y ademas analizados con ejecuciones reales con OmpSs y con Linux. El primer prototipo es un administrador hardware de tareas con dependencias, que se ha implementado en un SoC Xilinx Zynq serie 7000. Está conectado a un procesador ARM Cortex A9 de 2 núcleos, e integrado con el SO. Con 24 núcleos simulados y realizando el análisis real de las dependencias entre tareas en Picos, obtiene hasta un 21x sobre las mismas ejecuciones usando el entorno software. El segundo prototipo, Picos++, amplió Picos incorporando el soporte para la gestión de tareas anidadas en hardware. Hasta donde llega nuestro conocimiento, esta es la primera vez que dicha característica ha sido propuesta y/o incorporada en un administrador hardware de dependencias entre tareas. El segundo prototipo está completamente integrado en el sistema, no solo en hardware, sino también con el modelo de programación paralelo y con el sistema operativo. El tercer prototipo, incluye un administrador y planificador de tareas heterogéneas. El planificador de tareas heterogéneas recibe dichas tareas listas del administrador de dependencias entre tareas y las programa en la unidad de ejecución de hardware adecuada que tenga el tiempo de finalización estimado más corto. Este prototipo se ha implementado en un chip MPSoC Xilinx Zynq Ultrascale+. En dicho sistema con cuatro núcleos ARM y hasta 15 aceleradores HW funcionales, logra una aceleración de hasta 16.2x, y ahorra hasta el 90% de la energía con respecto al software

    Data layout types : a type-based approach to automatic data layout transformations for improved SIMD vectorisation

    Get PDF
    The increasing complexity of modern hardware requires sophisticated programming techniques for programs to run efficiently. At the same time, increased power of modern hardware enables more advanced analyses to be included in compilers. This thesis focuses on one particular optimisation technique that improves utilisation of vector units. The foundation of this technique is the ability to chose memory mappings for data structures of a given program. Usually programming languages use a fixed layout for logical data structures in physical memory. Such a static mapping often has a negative effect on usability of vector units. In this thesis we consider a compiler for a programming language that allows every data structure in a program to have its own data layout. We make sure that data layouts across the program are sound, and most importantly we solve a problem of automatic data layout reconstruction. To consistently do this, we formulate this as a type inference problem, where type encodes a data layout for a given structure as well as implied program transformations. We prove that type-implied transformations preserve semantics of the original programs and we demonstrate significant performance improvements when targeting SIMD-capable architectures
    corecore