13 research outputs found
Compiling Geometric Algebra Computations into Reconfigurable Hardware Accelerators
Geometric Algebra (GA), a generalization of quaternions and complex numbers, is a very
powerful framework for intuitively expressing and manipulating the complex
geometric relationships common to engineering problems.
However, actual processing of GA expressions is very compute intensive, and
acceleration is generally required for practical use. GPUs and FPGAs offer
such acceleration, while requiring only low-power per operation.
In this paper, we present key components of a proof-of-concept compile flow
combining symbolic and hardware optimization techniques to
automatically generate hardware accelerators from the abstract GA descriptions that are suitable for high-performance embedded computing
White Paper from Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020): HPC and Computer Arithmetic toward Minimal-Precision Computing
In numerical computations, precision of floating-point computations is a key
factor to determine the performance (speed and energy-efficiency) as well as
the reliability (accuracy and reproducibility). However, precision generally
plays a contrary role for both. Therefore, the ultimate concept for maximizing
both at the same time is the minimal-precision computing through
precision-tuning, which adjusts the optimal precision for each operation and
data. Several studies have been already conducted for it so far (e.g.
Precimoniuos and Verrou), but the scope of those studies is limited to the
precision-tuning alone. Hence, we aim to propose a broader concept of the
minimal-precision computing system with precision-tuning, involving both
hardware and software stack.
In 2019, we have started the Minimal-Precision Computing project to propose a
more broad concept of the minimal-precision computing system with
precision-tuning, involving both hardware and software stack. Specifically, our
system combines (1) a precision-tuning method based on Discrete Stochastic
Arithmetic (DSA), (2) arbitrary-precision arithmetic libraries, (3) fast and
accurate numerical libraries, and (4) Field-Programmable Gate Array (FPGA) with
High-Level Synthesis (HLS).
In this white paper, we aim to provide an overview of various technologies
related to minimal- and mixed-precision, to outline the future direction of the
project, as well as to discuss current challenges together with our project
members and guest speakers at the LSPANC 2020 workshop;
https://www.r-ccs.riken.jp/labs/lpnctrt/lspanc2020jan/
An Execution Model and High-Level-Synthesis System for Generating SIMT Multi-Threaded Hardware from C Source Code
The performance improvement of conventional processor has begun to stagnate in recent years. Because of this, researchers are looking for new possibilities to improve the performance of computing systems.
Heterogeneous systems turned out to be a powerful possibility. In the context of this thesis, a heterogeneous system consists of a software-programmable processor and a FPGA based configurable hardware accelerator. By using an accelerator specifically tailored to a particular application, heterogeneous system can achieve a higher performance that conventional processors.
Due to their increased complexity, it is more complicated to develop applications for heterogeneous systems than for conventional systems based on a software-programmable processor. For programming the software and hardware parts, different languages have to be used and additional specialised hardware-knowledge is required. Both factors increase the development cost.
This work presents the compiler framework Nymble which allows to program a heterogeneous system with only a single high-level language.
In the high-level language the developer only has to select which parts of the application should be executed in hardware. Nymble then generates a program for the software-processor, the configuration of the hardware, and all interfaces between software and hardware.
All heterogeneous systems supported by Nymble have in common that the software and hardware parts of an application have access to a shared memory. As this memory is external RAM with high access latency, it is necessary to insert a cache between the memory and hardware. With this cache, memory accesses can vary between very short or long access latency depending on whether the data is available in the cache.
To hide long latencies, this thesis presents a novel execution model which allows the simultaneous execution of multiple threads in a single accelerator. Additionally, the model enables threads to be dynamically reordered at specific points in the common accelerator pipeline. This capability is used to let other (non-waiting) threads overtake a thread which is waiting for a memory access. Thus, these other threads can execute their calculations independently of the waiting thread to bridge the latency of memory accesses.
Previous works are using execution models which only allow a single thread to be active in the accelerator. In case of a memory access with long latency, the thread is exchanged with another non-waiting thread. This design of the hardware often causes many resources to lie idle for a significant amount of time.
In contrast, the presented novel execution model dynamically spreads multiple threads over the pipeline. This results in a higher utilisation of the resources by using resources more effectively. Furthermore, the simultaneous execution of multiple threads can achieve similar throughput as multiple copies of a single-threaded accelerator running in parallel.
The new execution model makes it possible to combine the improved throughput of multiple copies with the increased efficiency of simultaneous threads in a single accelerator. Thread reordering allows the new model to be effectively used with a cached shared-memory.
In comparison, between four copies of a single-threaded accelerator and a multi-thread accelerator with four thread (both created by Nymble), a resource efficiency of up to factor 2.6x can be achieved. At the same time, four simultaneous threads can be up to 4x as fast as four threads executed consecutively on a single accelerator. Compared to other, more optimised compilers, Nymble can still achieve up to 2x faster runtime with 1.5x resource efficiency
An Execution Model and High-Level-Synthesis System for Generating SIMT Multi-Threaded Hardware from C Source Code
The performance improvement of conventional processor has begun to stagnate in recent years. Because of this, researchers are looking for new possibilities to improve the performance of computing systems.
Heterogeneous systems turned out to be a powerful possibility. In the context of this thesis, a heterogeneous system consists of a software-programmable processor and a FPGA based configurable hardware accelerator. By using an accelerator specifically tailored to a particular application, heterogeneous system can achieve a higher performance that conventional processors.
Due to their increased complexity, it is more complicated to develop applications for heterogeneous systems than for conventional systems based on a software-programmable processor. For programming the software and hardware parts, different languages have to be used and additional specialised hardware-knowledge is required. Both factors increase the development cost.
This work presents the compiler framework Nymble which allows to program a heterogeneous system with only a single high-level language.
In the high-level language the developer only has to select which parts of the application should be executed in hardware. Nymble then generates a program for the software-processor, the configuration of the hardware, and all interfaces between software and hardware.
All heterogeneous systems supported by Nymble have in common that the software and hardware parts of an application have access to a shared memory. As this memory is external RAM with high access latency, it is necessary to insert a cache between the memory and hardware. With this cache, memory accesses can vary between very short or long access latency depending on whether the data is available in the cache.
To hide long latencies, this thesis presents a novel execution model which allows the simultaneous execution of multiple threads in a single accelerator. Additionally, the model enables threads to be dynamically reordered at specific points in the common accelerator pipeline. This capability is used to let other (non-waiting) threads overtake a thread which is waiting for a memory access. Thus, these other threads can execute their calculations independently of the waiting thread to bridge the latency of memory accesses.
Previous works are using execution models which only allow a single thread to be active in the accelerator. In case of a memory access with long latency, the thread is exchanged with another non-waiting thread. This design of the hardware often causes many resources to lie idle for a significant amount of time.
In contrast, the presented novel execution model dynamically spreads multiple threads over the pipeline. This results in a higher utilisation of the resources by using resources more effectively. Furthermore, the simultaneous execution of multiple threads can achieve similar throughput as multiple copies of a single-threaded accelerator running in parallel.
The new execution model makes it possible to combine the improved throughput of multiple copies with the increased efficiency of simultaneous threads in a single accelerator. Thread reordering allows the new model to be effectively used with a cached shared-memory.
In comparison, between four copies of a single-threaded accelerator and a multi-thread accelerator with four thread (both created by Nymble), a resource efficiency of up to factor 2.6x can be achieved. At the same time, four simultaneous threads can be up to 4x as fast as four threads executed consecutively on a single accelerator. Compared to other, more optimised compilers, Nymble can still achieve up to 2x faster runtime with 1.5x resource efficiency
A.: Accelerating high-level engineering computations by automatic compilation of geometric algebra to hardware accelerators
Abstract-Geometric Algebra (GA), a generalization of quaternions, is a very powerful form for intuitively expressing and manipulating complex geometric relationships common to engineering problems. The actual evaluation of GA expressions, though, is extremely compute intensive due to the high-dimensionality of data being processed. On standard desktop CPUs, GA evaluations take considerably longer than conventional mathematical formulations. GPUs do offer sufficient throughput to make the use of concise GA formulations practical, but require power far exceeding the budgets for most embedded applications. While the suitability of low-power reconfigurable accelerators for evaluating specific GA computations has already been demonstrated, these often required a significant manual design effort. We present a proof-of-concept compile flow combining symbolic and hardware optimization techniques to automatically generate accelerators from the abstract GA descriptions without user intervention that are suitable for high-performance embedded computing
Optimizing Precision for High-Performance, Robust, and Energy-Efficient Computations
International audienc
Minimal-Precision Computing for High-Performance, Energy-Efficient, and Reliable Computations
International audienceWe propose a new systematic approach for minimal-precisioncomputations. This approach is reliable, general, comprehensive,high-performance, and realistic. Although the proposed systemis still in development, this presentation shows that the systemcould be constructed by combining already available (developed)in-house technologies as well as extending them