83 research outputs found

    Empowering parallel computing with field programmable gate arrays

    Get PDF
    After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements

    Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL

    Get PDF
    Heterogeneous systems are the core architecture of most of the high-performance computing nodes, due to their excellent performance and energy efficiency. However, a key challenge that remains is programmability, specifically, releasing the programmer from the burden of managing data and devices with different architectures. To this end, we extend EngineCL to support FPGA devices. Based on OpenCL, EngineCL is a high-level framework providing load balancing among devices. Our proposal fully integrates FPGAs into the framework, enabling effective cooperation between CPU, GPU, and FPGA. With command overlapping and judicious data management, our work improves performance by up to 96% compared with single-device execution and delivers energy-delay gains of up to 37%. In addition, adopting FPGAs does not require programmers to make big changes in their applications because the extensions do not modify the user-facing interface of EngineCL

    Hardware Parallelization of Cores Accessing Memory with Irregular Access Patterns

    Full text link
    This project studies FPGA-based heterogeneous computing architectures with the objective of discovering their ability to optimize the performances of algorithms characterized by irregular memory access patterns. The example used to achieve this is a graph algorithm known as Triad Census Algorithm, whose implementation has been developed and tested. First of all, the triad census algorithm is presented, explaining the possible variants and reviewing the existing implementations upon different architectures. The analysis focuses on the parallelization techniques which have allowed to boost performance, thus reducing execution time. Besides, the study tackles the OpenCL programming model, the standard used to develop the final application. Special attention is paid to the language details that have motivated some of the most important design decisions. The dissertation continues with the description of the project implementation, including the application objectives, the system design, and the different variants developed to enhance algorithm performance. Finally, some of the experimental results are presented and discussed. All implemented versions are evaluated and compared to decide which is the best in terms of scalability and execution time

    FPGA Accelerators on Heterogeneous Systems: An Approach Using High Level Synthesis

    Get PDF
    La evolución de las FPGAs como dispositivos para el procesamiento con alta eficiencia energética y baja latencia de control, comparada con dispositivos como las CPUs y las GPUs, las han hecho atractivas en el ámbito de la computación de alto rendimiento (HPC).A pesar de las inumerables ventajas de las FPGAs, su inclusión en HPC presenta varios retos. El primero, la complejidad que supone la programación de las FPGAs comparada con dispositivos como las CPUs y las GPUs. Segundo, el tiempo de desarrollo es alto debido al proceso de síntesis del hardware. Y tercero, trabajar con más arquitecturas en HPC requiere el manejo y la sintonización de los detalles de cada dispositivo, lo que añade complejidad.Esta tesis aborda estos 3 problemas en diferentes niveles con el objetivo de mejorar y facilitar la adopción de las FPGAs usando la síntesis de alto nivel(HLS) en sistemas HPC.En un nivel próximo al hardware, en esta tesis se desarrolla un modelo analítico para las aplicaciones limitadas en memoria, que es una situación común en aplicaciones de HPC. El modelo, desarrollado para kernels programados usando HLS, puede predecir el tiempo de ejecución con alta precisión y buena adaptabilidad ante cambios en la tecnología de la memoria, como las DDR4 y HBM2, y en las variaciones en la frecuencia del kernel. Esta solución puede aumentar potencialmente la productividad de las personas que programan, reduciendo el tiempo de desarrollo y optimización de las aplicaciones.Entender los detalles de bajo nivel puede ser complejo para las programadoras promedio, y el desempeño de las aplicaciones para FPGA aún requiere un alto nivel en las habilidades de programación. Por ello, nuestra segunda propuesta está enfocada en la extensión de las bibliotecas con una propuesta para cómputo en visión artificial que sea portable entre diferentes fabricantes de FPGAs. La biblioteca se ha diseñado basada en templates, lo que permite una biblioteca que da flexibilidad a la generación del hardware y oculta decisiones de diseño críticas como la comunicación entre nodos, el modelo de concurrencia, y la integración de las aplicaciones en el sistema heterogéneo para facilitar el desarrollo de grafos de visión artificial que pueden ser complejos.Finalmente, en el runtime del host del sistema heterogéneo, hemos integrado la FPGA para usarla de forma trasparente como un dispositivo acelerador para la co-ejecución en sistemas heterogéneos. Hemos hecho una serie propuestas de altonivel de abstracción que abarca los mecanismos de sincronización y políticas de balanceo en un sistema altamente heterogéneo compuesto por una CPU, una GPU y una FPGA. Se presentan los principales retos que han inspirado esta investigación y los beneficios de la inclusión de una FPGA en rendimiento y energía.En conclusión, esta tesis contribuye a la adopción de las FPGAs para entornos HPC, aportando soluciones que ayudan a reducir el tiempo de desarrollo y mejoran el desempeño y la eficiencia energética del sistema.---------------------------------------------The emergence of FPGAs in the High-Performance Computing domain is arising thanks to their promise of better energy efficiency and low control latency, compared with other devices such as CPUs or GPUs.Albeit these benefits, their complete inclusion into HPC systems still faces several challenges. First, FPGA complexity means its programming more difficult compared to devices such as CPU and GPU. Second, the development time is longer due to the required synthesis effort. And third, working with multiple devices increments the details that should be managed and increase hardware complexity.This thesis tackles these 3 problems at different stack levels to improve and to make easier the adoption of FPGAs using High-Level Synthesis on HPC systems. At a close to the hardware level, this thesis contributes with a new analytical model for memory-bound applications, an usual situation for HPC applications. The model for HLS kernels can anticipate application performance before place and route, reducing the design development time. Our results show a high precision and adaptable model for external memory technologies such as DDR4 and HBM2, and kernel frequency changes. This solution potentially increases productivity, reducing application development time.Understanding low-level implementation details is difficult for average programmers, and the development of FPGA applications still requires high proficiency program- ming skills. For this reason, the second proposal is focused on the extension of a computer vision library to be portable among two of the main FPGA vendors. The template-based library allows hardware flexibility and hides design decisions such as the communication among nodes, the concurrency programming model, and the application’s integration in the heterogeneous system, to develop complex vision graphs easily.Finally, we have transparently integrated the FPGA in a high level framework for co-execution with other devices. We propose a set of high level abstractions covering synchronization mechanism and load balancing policies in a highly heterogeneous system with CPU, GPU, and FPGA devices. We present the main challenges that inspired this research and the benefits of the FPGA use demonstrating performance and energy improvements.<br /

    High-level synthesis of fine-grained weakly consistent C concurrency

    Get PDF
    High-level synthesis (HLS) is the process of automatically compiling high-level programs into a netlist (collection of gates). Given an input program, HLS tools exploit its inherent parallelism and pipelining opportunities to generate efficient customised hardware. C-based programs are the most popular input for HLS tools, but these tools historically only synthesise sequential C programs. As the appeal for software concurrency rises, HLS tools are beginning to synthesise concurrent C programs, such as C/C++ pthreads and OpenCL. Although supporting software concurrency leads to better hardware parallelism, shared memory synchronisation is typically serialised to ensure correct memory behaviour, via locks. Locks are safety resources that ensure exclusive access of shared memory, eliminating data races and providing synchronisation guarantees for programmers.  As an alternative to lock-based synchronisation, the C memory model also defines the possibility of lock-free synchronisation via fine-grained atomic operations (`atomics'). However, most HLS tools either do not support atomics at all or implement atomics using locks. Instead, we treat the synthesis of atomics as a scheduling problem. We show that we can augment the intra-thread memory constraints during memory scheduling of concurrent programs to support atomics. On average, hardware generated by our method is 7.5x faster than the state-of-the-art, for our set of experiments. Our method of synthesising atomics enables several unique possibilities. Chiefly, we are capable of supporting weakly consistent (`weak') atomics, which necessitate fewer ordering constraints compared to sequentially consistent (SC) atomics. However, implementing weak atomics is complex and error-prone and hence we formally verify our methods via automated model checking to ensure our generated hardware is correct. Furthermore, since the C memory model defines memory behaviour globally, we can globally analyse the entire program to generate its memory constraints. Additionally, we can also support loop pipelining by extending our methods to generate inter-iteration memory constraints. On average, weak atomics, global analysis and loop pipelining improve performance by 1.6x, 3.4x and 1.4x respectively, for our set of experiments. Finally, we present a case study of a real-world example via an HLS-based Google PageRank algorithm, whose performance improves by 4.4x via lock-free streaming and work-stealing.Open Acces

    Heterogeneous CPU/GPU Memory Hierarchy Analysis and Optimization

    Get PDF
    In this master thesis, we propose a scheduling reordering for heterogeneous processors based on a hysteresis detector to give some fairness and speedup to the memory request threads taking advantage of the bank level parallelism at the memory system organization
    corecore