13 research outputs found

    Compiling Geometric Algebra Computations into Reconfigurable Hardware Accelerators

    Get PDF
    Geometric Algebra (GA), a generalization of quaternions and complex numbers, is a very powerful framework for intuitively expressing and manipulating the complex geometric relationships common to engineering problems. However, actual processing of GA expressions is very compute intensive, and acceleration is generally required for practical use. GPUs and FPGAs offer such acceleration, while requiring only low-power per operation. In this paper, we present key components of a proof-of-concept compile flow combining symbolic and hardware optimization techniques to automatically generate hardware accelerators from the abstract GA descriptions that are suitable for high-performance embedded computing

    White Paper from Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020): HPC and Computer Arithmetic toward Minimal-Precision Computing

    Full text link
    In numerical computations, precision of floating-point computations is a key factor to determine the performance (speed and energy-efficiency) as well as the reliability (accuracy and reproducibility). However, precision generally plays a contrary role for both. Therefore, the ultimate concept for maximizing both at the same time is the minimal-precision computing through precision-tuning, which adjusts the optimal precision for each operation and data. Several studies have been already conducted for it so far (e.g. Precimoniuos and Verrou), but the scope of those studies is limited to the precision-tuning alone. Hence, we aim to propose a broader concept of the minimal-precision computing system with precision-tuning, involving both hardware and software stack. In 2019, we have started the Minimal-Precision Computing project to propose a more broad concept of the minimal-precision computing system with precision-tuning, involving both hardware and software stack. Specifically, our system combines (1) a precision-tuning method based on Discrete Stochastic Arithmetic (DSA), (2) arbitrary-precision arithmetic libraries, (3) fast and accurate numerical libraries, and (4) Field-Programmable Gate Array (FPGA) with High-Level Synthesis (HLS). In this white paper, we aim to provide an overview of various technologies related to minimal- and mixed-precision, to outline the future direction of the project, as well as to discuss current challenges together with our project members and guest speakers at the LSPANC 2020 workshop; https://www.r-ccs.riken.jp/labs/lpnctrt/lspanc2020jan/

    An Execution Model and High-Level-Synthesis System for Generating SIMT Multi-Threaded Hardware from C Source Code

    Get PDF
    The performance improvement of conventional processor has begun to stagnate in recent years. Because of this, researchers are looking for new possibilities to improve the performance of computing systems. Heterogeneous systems turned out to be a powerful possibility. In the context of this thesis, a heterogeneous system consists of a software-programmable processor and a FPGA based configurable hardware accelerator. By using an accelerator specifically tailored to a particular application, heterogeneous system can achieve a higher performance that conventional processors. Due to their increased complexity, it is more complicated to develop applications for heterogeneous systems than for conventional systems based on a software-programmable processor. For programming the software and hardware parts, different languages have to be used and additional specialised hardware-knowledge is required. Both factors increase the development cost. This work presents the compiler framework Nymble which allows to program a heterogeneous system with only a single high-level language. In the high-level language the developer only has to select which parts of the application should be executed in hardware. Nymble then generates a program for the software-processor, the configuration of the hardware, and all interfaces between software and hardware. All heterogeneous systems supported by Nymble have in common that the software and hardware parts of an application have access to a shared memory. As this memory is external RAM with high access latency, it is necessary to insert a cache between the memory and hardware. With this cache, memory accesses can vary between very short or long access latency depending on whether the data is available in the cache. To hide long latencies, this thesis presents a novel execution model which allows the simultaneous execution of multiple threads in a single accelerator. Additionally, the model enables threads to be dynamically reordered at specific points in the common accelerator pipeline. This capability is used to let other (non-waiting) threads overtake a thread which is waiting for a memory access. Thus, these other threads can execute their calculations independently of the waiting thread to bridge the latency of memory accesses. Previous works are using execution models which only allow a single thread to be active in the accelerator. In case of a memory access with long latency, the thread is exchanged with another non-waiting thread. This design of the hardware often causes many resources to lie idle for a significant amount of time. In contrast, the presented novel execution model dynamically spreads multiple threads over the pipeline. This results in a higher utilisation of the resources by using resources more effectively. Furthermore, the simultaneous execution of multiple threads can achieve similar throughput as multiple copies of a single-threaded accelerator running in parallel. The new execution model makes it possible to combine the improved throughput of multiple copies with the increased efficiency of simultaneous threads in a single accelerator. Thread reordering allows the new model to be effectively used with a cached shared-memory. In comparison, between four copies of a single-threaded accelerator and a multi-thread accelerator with four thread (both created by Nymble), a resource efficiency of up to factor 2.6x can be achieved. At the same time, four simultaneous threads can be up to 4x as fast as four threads executed consecutively on a single accelerator. Compared to other, more optimised compilers, Nymble can still achieve up to 2x faster runtime with 1.5x resource efficiency

    An Execution Model and High-Level-Synthesis System for Generating SIMT Multi-Threaded Hardware from C Source Code

    No full text
    The performance improvement of conventional processor has begun to stagnate in recent years. Because of this, researchers are looking for new possibilities to improve the performance of computing systems. Heterogeneous systems turned out to be a powerful possibility. In the context of this thesis, a heterogeneous system consists of a software-programmable processor and a FPGA based configurable hardware accelerator. By using an accelerator specifically tailored to a particular application, heterogeneous system can achieve a higher performance that conventional processors. Due to their increased complexity, it is more complicated to develop applications for heterogeneous systems than for conventional systems based on a software-programmable processor. For programming the software and hardware parts, different languages have to be used and additional specialised hardware-knowledge is required. Both factors increase the development cost. This work presents the compiler framework Nymble which allows to program a heterogeneous system with only a single high-level language. In the high-level language the developer only has to select which parts of the application should be executed in hardware. Nymble then generates a program for the software-processor, the configuration of the hardware, and all interfaces between software and hardware. All heterogeneous systems supported by Nymble have in common that the software and hardware parts of an application have access to a shared memory. As this memory is external RAM with high access latency, it is necessary to insert a cache between the memory and hardware. With this cache, memory accesses can vary between very short or long access latency depending on whether the data is available in the cache. To hide long latencies, this thesis presents a novel execution model which allows the simultaneous execution of multiple threads in a single accelerator. Additionally, the model enables threads to be dynamically reordered at specific points in the common accelerator pipeline. This capability is used to let other (non-waiting) threads overtake a thread which is waiting for a memory access. Thus, these other threads can execute their calculations independently of the waiting thread to bridge the latency of memory accesses. Previous works are using execution models which only allow a single thread to be active in the accelerator. In case of a memory access with long latency, the thread is exchanged with another non-waiting thread. This design of the hardware often causes many resources to lie idle for a significant amount of time. In contrast, the presented novel execution model dynamically spreads multiple threads over the pipeline. This results in a higher utilisation of the resources by using resources more effectively. Furthermore, the simultaneous execution of multiple threads can achieve similar throughput as multiple copies of a single-threaded accelerator running in parallel. The new execution model makes it possible to combine the improved throughput of multiple copies with the increased efficiency of simultaneous threads in a single accelerator. Thread reordering allows the new model to be effectively used with a cached shared-memory. In comparison, between four copies of a single-threaded accelerator and a multi-thread accelerator with four thread (both created by Nymble), a resource efficiency of up to factor 2.6x can be achieved. At the same time, four simultaneous threads can be up to 4x as fast as four threads executed consecutively on a single accelerator. Compared to other, more optimised compilers, Nymble can still achieve up to 2x faster runtime with 1.5x resource efficiency

    A.: Accelerating high-level engineering computations by automatic compilation of geometric algebra to hardware accelerators

    No full text
    Abstract-Geometric Algebra (GA), a generalization of quaternions, is a very powerful form for intuitively expressing and manipulating complex geometric relationships common to engineering problems. The actual evaluation of GA expressions, though, is extremely compute intensive due to the high-dimensionality of data being processed. On standard desktop CPUs, GA evaluations take considerably longer than conventional mathematical formulations. GPUs do offer sufficient throughput to make the use of concise GA formulations practical, but require power far exceeding the budgets for most embedded applications. While the suitability of low-power reconfigurable accelerators for evaluating specific GA computations has already been demonstrated, these often required a significant manual design effort. We present a proof-of-concept compile flow combining symbolic and hardware optimization techniques to automatically generate accelerators from the abstract GA descriptions without user intervention that are suitable for high-performance embedded computing

    Minimal-Precision Computing for High-Performance, Energy-Efficient, and Reliable Computations

    No full text
    International audienceWe propose a new systematic approach for minimal-precisioncomputations. This approach is reliable, general, comprehensive,high-performance, and realistic. Although the proposed systemis still in development, this presentation shows that the systemcould be constructed by combining already available (developed)in-house technologies as well as extending them
    corecore