27 research outputs found
A generator of numerically-tailored and high-throughput accelerators for batched GEMMs
We propose a hardware generator of GEMM accelerators. Our generator produces vendor-agnostic HDL describing highly customizable systolic arrays guided by accuracy and energy efficiency goals. The generated arrays have three main novel aspects. First, the accelerators handle a large variety of computer number formats using intermediate representations based on our Sign Scale Significand (S3) format. Second, the processing elements perform all intermediate dot-product arithmetic operations required by the GEMM kernel without any intermediate rounding, which makes it possible to deliver better energy efficiency than state-of-the-art approaches while offering more accuracy and reproducible results. Third, our accelerators feature the Half-Speed Sink Down (HSSD) mechanism, which maximizes the overlap of host-accelerator data transfers with GEMM computations.We evaluate our automatically generated designs in a cutting-edge setup composed of a POWER9 host, CAPI (Coherent Accelerator Processor Interface) link, and a Virtex Ultrascale Plus FPGA. Arrays can operate at the speed of the link and saturate it to reach a 13GB/s throughput. Our fine-grain customization approach allows to cover a wide range of accuracy versus efficiency scenarios and can reach 0.65GOps/s/W while producing 1024 accurate bits or 148.7GOps/s/W with 6 accurate bits. Our configurations achieve up to 1613GOps/s system performance and power efficiencies of up to 240GOps/s/W for the FPGA. This automatic generator is the first being able to produce such a variety of designs. We improve the single-precision energy efficiency of state-of-the-art FPGA GEMM accelerators by 1.86×.This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 955606 Marc Casas is supported by Grant RYC-2017-23269 funded by MCIN/AEI/ 10.13039/501100011033 and by “ESF Investing in your future”Peer ReviewedPostprint (author's final draft
Computing and Compressing Electron Repulsion Integrals on FPGAs
The computation of electron repulsion integrals (ERIs) over Gaussian-type
orbitals (GTOs) is a challenging problem in quantum-mechanics-based atomistic
simulations. In practical simulations, several trillions of ERIs may have to be
computed for every time step.
In this work, we investigate FPGAs as accelerators for the ERI computation.
We use template parameters, here within the Intel oneAPI tool flow, to create
customized designs for 256 different ERI quartet classes, based on their
orbitals. To maximize data reuse, all intermediates are buffered in FPGA
on-chip memory with customized layout. The pre-calculation of intermediates
also helps to overcome data dependencies caused by multi-dimensional recurrence
relations. The involved loop structures are partially or even fully unrolled
for high throughput of FPGA kernels. Furthermore, a lossy compression algorithm
utilizing arbitrary bitwidth integers is integrated in the FPGA kernels. To our
best knowledge, this is the first work on ERI computation on FPGAs that
supports more than just the single most basic quartet class. Also, the
integration of ERI computation and compression it a novelty that is not even
covered by CPU or GPU libraries so far.
Our evaluation shows that using 16-bit integer for the ERI compression, the
fastest FPGA kernels exceed the performance of 10 GERIS ( ERIs
per second) on one Intel Stratix 10 GX 2800 FPGA, with maximum absolute errors
around - Hartree. The measured throughput can be accurately
explained by a performance model. The FPGA kernels deployed on 2 FPGAs
outperform similar computations using the widely used libint reference on a
two-socket server with 40 Xeon Gold 6148 CPU cores of the same process
technology by factors up to 6.0x and on a new two-socket server with 128 EPYC
7713 CPU cores by up to 1.9x
Χρήση μοντέλου παράλληλου προγραμματισμού για σύνθεση αρχιτεκτονικών
The problem of automatically generating hardware modules from high level application representations has been at the forefront of EDA research during the last few years. In this Dissertation we introduce a methodology to automatically synthesize hardware accelerators from OpenCL applications. OpenCL is a recent industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our methodology maps OpenCL kernels into hardware accelerators based on architectural templates that explicitly decouple computation from memory communication whenever this is possible. The templates can be tuned to provide a wide repertoire of accelerators that meet user performance requirements and FPGA device characteristics. Furthermore a set of high- and low-level compiler optimizations is applied to generate optimized accelerators. Our experimental evaluation shows that the generated accelerators are tuned efficiently to match the applications memory access pattern and computational complexity and to achieve user performance requirements. An important objective of our tool is to expand the FPGA development user base to software engineers thereby expanding the scope of FPGAs beyond the realm of hardware design.To πρόβλημα της αυτόματης δημιουργίας μονάδων υλικό από παραστάσεις υψηλού επιπέδου εφαρμογής είναι στην πρώτη γραμμή της EDA έρευνας κατά τη διάρκεια των τελευταίων ετών. Σε αυτή την διατριβή παρουσιάζουμε μια μεθοδολογία για τη αυτόματη σύνθεση επιταχυντές υλικού από εφαρμογές OpenCL. OpenCL είναι ένα πρόσφατο πρότυπο για τη σύνταξη των προγραμμάτων που εκτελούνται σε πλατφόρμες πολλαπλών πυρήνων και επιταχυντές όπως GPUs. Η μεθοδολογία μας μετατρέπει προγράμματα OpenCL σε επιταχυντές υλικού με βάση αρχιτεκτονικά πρότυπα που ρητά αποσυνδέει τους υπολογισμούς από την μεταφορά δεδομένων από/προς την μνήμη όποτε αυτό είναι δυνατό. Τα πρότυπα μπορούν να συντονιστούν ώστε να παρέχουν ένα ευρύ ρεπερτόριο από επιταχυντές που πληρούν τις απαιτήσεις απόδοσης των χρηστών και τα χαρακτηριστικά της συσκευής FPGA. Επιπλέον ένα σύνολο υψηλής και χαμηλής στάθμης βελτιστοποιήσεις μεταγλωττιστή εφαρμόζεται για να παράγει βελτιστοποιημένα επιταχυντές. Η πειραματική αξιολόγηση δείχνει ότι οι επιταχυντές που δημιουργούνται αποτελεσματικά συντονισμένοι για να ταιριάζει με το μοτίβο πρόσβασης στην μνήμη κάθε εφαρμογής και την υπολογιστική πολυπλοκότητα και να επιτύχουν τις απαιτήσεις απόδοσης των χρηστών. Ένας σημαντικός στόχος του εργαλείου μας είναι η επέκταση της βάσης χρηστών πλατφόρμες FPGA για μηχανικούς λογισμικού ώστε να γίνει ανάπτυξη FPGA συστήματα από μηχανικούς λογισμικού χωρίς την ανάγκη για εμπειρία σχεδιασμού υλικού
FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threaded C Software
A deep-learning inference accelerator is synthesized from a C-language
software program parallelized with Pthreads. The software implementation uses
the well-known producer/consumer model with parallel threads interconnected by
FIFO queues. The LegUp high-level synthesis (HLS) tool synthesizes threads into
parallel FPGA hardware, translating software parallelism into spatial
parallelism. A complete system is generated where convolution, pooling and
padding are realized in the synthesized accelerator, with remaining tasks
executing on an embedded ARM processor. The accelerator incorporates reduced
precision, and a novel approach for zero-weight-skipping in convolution. On a
mid-sized Intel Arria 10 SoC FPGA, peak performance on VGG-16 is 138 effective
GOPS
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
Fully Homomorphic Encryption is a technique that allows computation on
encrypted data. It has the potential to change privacy considerations in the
cloud, but computational and memory overheads are preventing its adoption. TFHE
is a promising Torus-based FHE scheme that relies on bootstrapping, the
noise-removal tool invoked after each encrypted logical/arithmetical operation.
We present FPT, a Fixed-Point FPGA accelerator for TFHE bootstrapping. FPT is
the first hardware accelerator to exploit the inherent noise present in FHE
calculations. Instead of double or single-precision floating-point arithmetic,
it implements TFHE bootstrapping entirely with approximate fixed-point
arithmetic. Using an in-depth analysis of noise propagation in bootstrapping
FFT computations, FPT is able to use noise-trimmed fixed-point representations
that are up to 50% smaller than prior implementations.
FPT is built as a streaming processor inspired by traditional streaming DSPs:
it instantiates directly cascaded high-throughput computational stages, with
minimal control logic and routing networks. We explore throughput-balanced
compositions of streaming kernels with a user-configurable streaming width in
order to construct a full bootstrapping pipeline. Our approach allows 100%
utilization of arithmetic units and requires only a small bootstrapping key
cache, enabling an entirely compute-bound bootstrapping throughput of 1 BS /
35us. This is in stark contrast to the classical CPU approach to FHE
bootstrapping acceleration, which is typically constrained by memory and
bandwidth.
FPT is implemented and evaluated as a bootstrapping FPGA kernel for an Alveo
U280 datacenter accelerator card. FPT achieves two to three orders of magnitude
higher bootstrapping throughput than existing CPU-based implementations, and
2.5x higher throughput compared to recent ASIC emulation experiments.Comment: ACM CCS 202
A Survey and Evaluation of FPGA High-Level Synthesis Tools
High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing today's system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is particularly interesting for designing field-programmable gate array circuits, where hardware implementations can be easily refined and replaced in the target device. Recent years have seen much activity in the HLS research community, with a plethora of HLS tool offerings, from both industry and academia. All these tools may have different input languages, perform different internal optimizations, and produce results of different quality, even for the very same input description. Hence, it is challenging to compare their performance and understand which is the best for the hardware to be implemented. We present a comprehensive analysis of recent HLS tools, as well as overview the areas of active interest in the HLS research community. We also present a first-published methodology to evaluate different HLS tools. We use our methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources
Rapid Prototyping and Exploration Environment for Generating C-to-Hardware-Compilers
There is today an ever-increasing demand for more computational power
coupled with a desire to minimize energy requirements. Hardware
accelerators currently appear to be the best solution to this problem.
While general purpose computation with GPUs seem to be
very successful in this area, they perform adequately only in those cases where the data
access patterns and utilized algorithms fit the underlying
architecture. ASICs on the other hand can yield even better results
in terms of performance and energy consumption, but are very
inflexible, as they are manufactured with an application specific
circuitry. Field Programmable Gate Arrays (FPGAs) represent a combination of approaches: With their application specific hardware they provide
high computational power while requiring, for many applications, less
energy than a CPU or a GPU. On the other hand they are far more
flexible than an ASIC due to their reconfigurability.
The only remaining problem is the programming of
the FPGAs, as they are far more difficult to program compared to
regular software. To allow common software developers, who have at
best very limited
knowledge in hardware design, to make use of these devices, tools were
developed that take a regular high level language and generate
hardware from it.
Among such tools, C-to-HDL compilers are a particularly wide-spread approach. These compilers attempt to translate common C code into a hardware description language from which a
datapath is generated. Most of these compilers have many
restrictions for the input and differ in their underlying generated
micro architecture, their scheduling method, their applied
optimizations, their execution model and even their target
hardware. Thus, a comparison of a certain aspect alone, like their
implemented scheduling method or their generated micro architecture,
is almost impossible, as they differ in so many other aspects.
This work provides a survey of the existing C-to-HDL compilers and
presents a new approach to evaluating and exploring
different micro architectures for dynamic scheduling used by such
compilers. From a mathematically formulated rule set the Triad
compiler generates a backend for the Scale compiler framework,
which then implements a hardware generation backend with described dynamic
scheduling.
While more than a factor of four slower than hardware from highly
optimized compilers, this environment allows easy comparison and
exploration of different rule sets and the micro architecture for the
dynamically scheduled datapaths generated from them. For demonstration
purposes a rule set modeling the COCOMA token flow model from
the COMRADE 2.0 compiler was implemented. Multiple variants of
it were explored: Savings of up to 11% of the required hardware
resources were possible
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
Fully Homomorphic Encryption (FHE) is a technique that allows computation on encrypted data. It has the potential to drastically change privacy considerations in the cloud, but high computational and memory overheads are preventing its broad adoption. TFHE is a promising Torus-based FHE scheme that heavily relies on bootstrapping, the noise-removal tool invoked after each encrypted logical/arithmetical operation.
We present FPT, a Fixed-Point FPGA accelerator for TFHE bootstrapping. FPT is the first hardware accelerator to heavily exploit the inherent noise present in FHE calculations. Instead of double or single-precision floating-point arithmetic, it implements TFHE bootstrapping entirely with approximate fixed-point arithmetic. Using an in-depth analysis of noise propagation in bootstrapping FFT computations, FPT is able to use noise-trimmed fixed-point representations that are up to 50% smaller than prior implementations that prefer floating-point or integer FFTs.
FPT is built as a streaming processor inspired by traditional streaming DSPs: it instantiates directly cascaded high-throughput computational stages, with minimal control logic and routing networks. We explore different throughput-balanced compositions of streaming kernels with a user-configurable streaming width in order to construct a full bootstrapping pipeline. Our proposed approach allows 100% utilization of arithmetic units and requires only a small bootstrapping key cache, enabling an entirely compute-bound bootstrapping throughput of 1 BS / 35s. This is in stark contrast to the established classical CPU approach to FHE bootstrapping acceleration, which is typically constrained by memory and bandwidth.
FPT is fully implemented and evaluated as a bootstrapping FPGA kernel for an Alveo U280 datacenter accelerator card. FPT achieves two to three orders of magnitude higher bootstrapping throughput than existing CPU-based implementations, and 2.5 higher throughput compared to recent ASIC emulation experiments