43 research outputs found

    Operating System Concepts for Reconfigurable Computing: Review and Survey

    Get PDF
    One of the key future challenges for reconfigurable computing is to enable higher design productivity and a more easy way to use reconfigurable computing systems for users that are unfamiliar with the underlying concepts. One way of doing this is to provide standardization and abstraction, usually supported and enforced by an operating system. This article gives historical review and a summary on ideas and key concepts to include reconfigurable computing aspects in operating systems. The article also presents an overview on published and available operating systems targeting the area of reconfigurable computing. The purpose of this article is to identify and summarize common patterns among those systems that can be seen as de facto standard. Furthermore, open problems, not covered by these already available systems, are identified

    An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor

    Get PDF
    Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration

    An Introduction to the MISD Technology

    Get PDF
    The growth of data volume, velocity, and variety will be the global IT challenges in the next decade. To overcome performance limits, the most effective innovations such as cognitive computing, GPU, FPGA acceleration, and heterogeneous computing have to be integrated with the traditional microprocessor technology. As the fundamental part of most computational challenges, the discrete mathematics should be supported both by the computer hardware and software. But for now, the optimization methods on graphs and big data sets are generally based on software technology, while hardware support is promising to give a better result. \ \ In this paper, the new computing technology with direct hardware support of discrete mathematic functions is presented. The new non-Von Neumann microprocessor named Structure Processing Unit (SPU) to perform operations over large data sets, data structures, and graphs was developed and verified in Bauman Moscow State Technical University. The basic principles of SPU implementation in the computer system with multiple instruction and single data stream (MISD) are presented. We then introduce the programming techniques for such a system with CPU and SPU included. The experimental results and performance tests for the universal MISD computer are shown

    Enabling Runtime Profiling to Hide and Exploit Heterogeneity within Chip Heterogeneous Multiprocessor Systems (CHMPS)

    Get PDF
    The heterogeneity of multiprocessor systems on chip (MPSoC) has presented unique opportunities for furthering today鈥檚 diverse application needs. FPGA-based MPSoCs have the potential of bridging the gap between generality and specialization but has traditionally been limited to device experts. The flexibility of these systems can enable computation without compromise but can only be realized if this flexibility extends throughout the software stack. At the top of this stack, there has been significant effort for leveraging the heterogeneity of the architecture. However, the betterment of these abstractions are limited to what the bottom of the stack exposes: the runtime system. The runtime system is conveniently positioned between the heterogeneity of the hardware, and the diverse mix of both programming languages and applications. As a result, it is an important enabler of realizing the flexibility of an FPGA-base MPSoC. The runtime system can provide the abstractions of how to make use of the hardware. However, it is also important to know when and which hardware to use. This is a non-issue for a homogeneous system, but is an important challenge to overcome for heterogeneous systems. This thesis presents a self-aware runtime system that is able to adapt to the application鈥檚 hardware needs with a runtime overhead that is comparable to a naive approach. It achieves this through a combination of pre-generated offline data, and the utilization of runtime data. For systems with diminishing hardware, the results confirmed that the runtime system provided high resource efficiency. This thesis also explored different runtime metrics that can affect the application on a heterogeneous system and offers concluding remarks on future work

    Optimizaci贸n del rendimiento y la eficiencia energ茅tica en sistemas masivamente paralelos

    Get PDF
    RESUMEN Los sistemas heterog茅neos son cada vez m谩s relevantes, debido a sus capacidades de rendimiento y eficiencia energ茅tica, estando presentes en todo tipo de plataformas de c贸mputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programaci贸n host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energ茅tico del sistema, adem谩s de dificultar la adaptaci贸n de las aplicaciones. La co-ejecuci贸n permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energ铆a. No obstante, los programadores deben encargarse de toda la gesti贸n de los dispositivos, la distribuci贸n de la carga y la portabilidad del c贸digo entre sistemas, complicando notablemente su programaci贸n. Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energ茅tica en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracci贸n y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energ茅tica. Para ello, se proponen dos motores de ejecuci贸n con enfoques completamente distintos. EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la m谩xima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de din谩mica molecular, como el utilizado en un centro de investigaci贸n internacional. Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecuci贸n a la tecnolog铆a oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotaci贸n de estrategias din谩micas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications. Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming. This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed. EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center. Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant), the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and PID2019-105660RB-C22. This work has also been partially supported by the Mont-Blanc 3: European Scalable and Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No. 671697) from the European Union鈥檚 Horizon 2020 Research and Innovation Programme (H2020 Programme). Some activities have also been funded by the Spanish Science and Technology Commission under contract TIN2016-81840-REDT (CAPAP-H6 network). The Integration II: Hybrid programming models of Chapter 4 has been partially performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme. In particular, the author gratefully acknowledges the support of the SPMT Department of the High Performance Computing Center Stuttgart (HLRS)

    FPGA Accelerators on Heterogeneous Systems: An Approach Using High Level Synthesis

    Get PDF
    La evoluci贸n de las FPGAs como dispositivos para el procesamiento con alta eficiencia energ茅tica y baja latencia de control, comparada con dispositivos como las CPUs y las GPUs, las han hecho atractivas en el 谩mbito de la computaci贸n de alto rendimiento (HPC).A pesar de las inumerables ventajas de las FPGAs, su inclusi贸n en HPC presenta varios retos. El primero, la complejidad que supone la programaci贸n de las FPGAs comparada con dispositivos como las CPUs y las GPUs. Segundo, el tiempo de desarrollo es alto debido al proceso de s铆ntesis del hardware. Y tercero, trabajar con m谩s arquitecturas en HPC requiere el manejo y la sintonizaci贸n de los detalles de cada dispositivo, lo que a帽ade complejidad.Esta tesis aborda estos 3 problemas en diferentes niveles con el objetivo de mejorar y facilitar la adopci贸n de las FPGAs usando la s铆ntesis de alto nivel(HLS) en sistemas HPC.En un nivel pr贸ximo al hardware, en esta tesis se desarrolla un modelo anal铆tico para las aplicaciones limitadas en memoria, que es una situaci贸n com煤n en aplicaciones de HPC. El modelo, desarrollado para kernels programados usando HLS, puede predecir el tiempo de ejecuci贸n con alta precisi贸n y buena adaptabilidad ante cambios en la tecnolog铆a de la memoria, como las DDR4 y HBM2, y en las variaciones en la frecuencia del kernel. Esta soluci贸n puede aumentar potencialmente la productividad de las personas que programan, reduciendo el tiempo de desarrollo y optimizaci贸n de las aplicaciones.Entender los detalles de bajo nivel puede ser complejo para las programadoras promedio, y el desempe帽o de las aplicaciones para FPGA a煤n requiere un alto nivel en las habilidades de programaci贸n. Por ello, nuestra segunda propuesta est谩 enfocada en la extensi贸n de las bibliotecas con una propuesta para c贸mputo en visi贸n artificial que sea portable entre diferentes fabricantes de FPGAs. La biblioteca se ha dise帽ado basada en templates, lo que permite una biblioteca que da flexibilidad a la generaci贸n del hardware y oculta decisiones de dise帽o cr铆ticas como la comunicaci贸n entre nodos, el modelo de concurrencia, y la integraci贸n de las aplicaciones en el sistema heterog茅neo para facilitar el desarrollo de grafos de visi贸n artificial que pueden ser complejos.Finalmente, en el runtime del host del sistema heterog茅neo, hemos integrado la FPGA para usarla de forma trasparente como un dispositivo acelerador para la co-ejecuci贸n en sistemas heterog茅neos. Hemos hecho una serie propuestas de altonivel de abstracci贸n que abarca los mecanismos de sincronizaci贸n y pol铆ticas de balanceo en un sistema altamente heterog茅neo compuesto por una CPU, una GPU y una FPGA. Se presentan los principales retos que han inspirado esta investigaci贸n y los beneficios de la inclusi贸n de una FPGA en rendimiento y energ铆a.En conclusi贸n, esta tesis contribuye a la adopci贸n de las FPGAs para entornos HPC, aportando soluciones que ayudan a reducir el tiempo de desarrollo y mejoran el desempe帽o y la eficiencia energ茅tica del sistema.---------------------------------------------The emergence of FPGAs in the High-Performance Computing domain is arising thanks to their promise of better energy efficiency and low control latency, compared with other devices such as CPUs or GPUs.Albeit these benefits, their complete inclusion into HPC systems still faces several challenges. First, FPGA complexity means its programming more difficult compared to devices such as CPU and GPU. Second, the development time is longer due to the required synthesis effort. And third, working with multiple devices increments the details that should be managed and increase hardware complexity.This thesis tackles these 3 problems at different stack levels to improve and to make easier the adoption of FPGAs using High-Level Synthesis on HPC systems. At a close to the hardware level, this thesis contributes with a new analytical model for memory-bound applications, an usual situation for HPC applications. The model for HLS kernels can anticipate application performance before place and route, reducing the design development time. Our results show a high precision and adaptable model for external memory technologies such as DDR4 and HBM2, and kernel frequency changes. This solution potentially increases productivity, reducing application development time.Understanding low-level implementation details is difficult for average programmers, and the development of FPGA applications still requires high proficiency program- ming skills. For this reason, the second proposal is focused on the extension of a computer vision library to be portable among two of the main FPGA vendors. The template-based library allows hardware flexibility and hides design decisions such as the communication among nodes, the concurrency programming model, and the application鈥檚 integration in the heterogeneous system, to develop complex vision graphs easily.Finally, we have transparently integrated the FPGA in a high level framework for co-execution with other devices. We propose a set of high level abstractions covering synchronization mechanism and load balancing policies in a highly heterogeneous system with CPU, GPU, and FPGA devices. We present the main challenges that inspired this research and the benefits of the FPGA use demonstrating performance and energy improvements.<br /

    Optimization of Deep Convolutional Neural Network with the Integrated Batch Normalization and Global pooling

    Get PDF
    Deep convolutional neural networks (DCNN) have made significant progress in a wide range of applications in recent years, which include image identification, audio recognition, and translation of machine information. These tasks assist machine intelligence in a variety of ways. However, because of the large number of parameters, float manipulations and conversion of machine terminal remains difficult. To handle this issue, optimization of convolution in the DCNN is initiated that adjusts the characteristics of the neural network, and the loss of information is minimized with enriched performance. Minimization of convolution function addresses the optimization issues. Initially, batch normalization is completed, and instead of lowering neighborhood values, a full feature map is minimized to a single value using the global pooling approach. Traditional convolution is split into depth and pointwise to decrease the model size and calculations. The optimized convolution-based DCNN's performance is evaluated with the assistance of accuracy and occurrence of error. The optimized DCNN is compared with the existing state-of-the-art techniques, and the optimized DCNN outperforms the existing technique
    corecore