21 research outputs found
BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing
Matrix-matrix multiplication is a key computational kernel for numerous
applications in science and engineering, with ample parallelism and data
locality that lends itself well to high-performance implementations. Many
matrix multiplication-dependent applications can use reduced-precision integer
or fixed-point representations to increase their performance and energy
efficiency while still offering adequate quality of results. However, precision
requirements may vary between different application phases or depend on input
data, rendering constant-precision solutions ineffective. We present BISMO, a
vectorized bit-serial matrix multiplication overlay for reconfigurable
computing. BISMO utilizes the excellent binary-operation performance of FPGAs
to offer a matrix multiplication performance that scales with required
precision and parallelism. We characterize the resource usage and performance
of BISMO across a range of parameters to build a hardware cost model, and
demonstrate a peak performance of 6.5 TOPS on the Xilinx PYNQ-Z1 board.Comment: To appear at FPL'1
A2Q+: Improving Accumulator-Aware Weight Quantization
Quantization techniques commonly reduce the inference costs of neural
networks by restricting the precision of weights and activations. Recent
studies show that also reducing the precision of the accumulator can further
improve hardware efficiency at the risk of numerical overflow, which introduces
arithmetic errors that can degrade model accuracy. To avoid numerical overflow
while maintaining accuracy, recent work proposed accumulator-aware quantization
(A2Q), a quantization-aware training method that constrains model weights
during training to safely use a target accumulator bit width during inference.
Although this shows promise, we demonstrate that A2Q relies on an overly
restrictive constraint and a sub-optimal weight initialization strategy that
each introduce superfluous quantization error. To address these shortcomings,
we introduce: (1) an improved bound that alleviates accumulator constraints
without compromising overflow avoidance; and (2) a new strategy for
initializing quantized weights from pre-trained floating-point checkpoints. We
combine these contributions with weight normalization to introduce A2Q+. We
support our analysis with experiments that show A2Q+ significantly improves the
trade-off between accumulator bit width and model accuracy and characterize new
trade-offs that arise as a consequence of accumulator constraints
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
Research has shown that convolutional neural networks contain significant
redundancy, and high classification accuracy can be obtained even when weights
and activations are reduced from floating point to binary values. In this
paper, we present FINN, a framework for building fast and flexible FPGA
accelerators using a flexible heterogeneous streaming architecture. By
utilizing a novel set of optimizations that enable efficient mapping of
binarized neural networks to hardware, we implement fully connected,
convolutional and pooling layers, with per-layer compute resources being
tailored to user-provided throughput requirements. On a ZC706 embedded FPGA
platform drawing less than 25 W total system power, we demonstrate up to 12.3
million image classifications per second with 0.31 {\mu}s latency on the MNIST
dataset with 95.8% accuracy, and 21906 image classifications per second with
283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1%
and 94.9% accuracy. To the best of our knowledge, ours are the fastest
classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable
Gate Arrays, February 201
Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing
Matrix-matrix multiplication is a key computational kernel for numerous
applications in science and engineering, with ample parallelism and data
locality that lends itself well to high-performance implementations. Many
matrix multiplication-dependent applications can use reduced-precision integer
or fixed-point representations to increase their performance and energy
efficiency while still offering adequate quality of results. However, precision
requirements may vary between different application phases or depend on input
data, rendering constant-precision solutions ineffective. BISMO, a vectorized
bit-serial matrix multiplication overlay for reconfigurable computing,
previously utilized the excellent binary-operation performance of FPGAs to
offer a matrix multiplication performance that scales with required precision
and parallelism. We show how BISMO can be scaled up on Xilinx FPGAs using an
arithmetic architecture that better utilizes 6-LUTs. The improved BISMO
achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a
Xilinx UltraScale+ MPSoC.Comment: Invited paper at ACM TRETS as extension of FPL'18 paper
arXiv:1806.0886
Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference
Efficient machine learning implementations optimized for inference in
hardware have wide-ranging benefits, depending on the application, from lower
inference latency to higher data throughput and reduced energy consumption. Two
popular techniques for reducing computation in neural networks are pruning,
removing insignificant synapses, and quantization, reducing the precision of
the calculations. In this work, we explore the interplay between pruning and
quantization during the training of neural networks for ultra low latency
applications targeting high energy physics use cases. Techniques developed for
this study have potential applications across many other domains. We study
various configurations of pruning during quantization-aware training, which we
term quantization-aware pruning, and the effect of techniques like
regularization, batch normalization, and different pruning schemes on
performance, computational complexity, and information content metrics. We find
that quantization-aware pruning yields more computationally efficient models
than either pruning or quantization alone for our task. Further,
quantization-aware pruning typically performs similar to or better in terms of
computational efficiency compared to other neural architecture search
techniques like Bayesian optimization. Surprisingly, while networks with
different training configurations can have similar performance for the
benchmark application, the information content in the network can vary
significantly, affecting its generalizability.Comment: 22 pages, 7 Figures, 1 Tabl
LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations
We propose two tiers of modifications to FPGA logic cell architecture to
deliver a variety of performance and utilization benefits with only minor area
overheads. In the irst tier, we augment existing commercial logic cell
datapaths with a 6-input XOR gate in order to improve the expressiveness of
each element, while maintaining backward compatibility. This new architecture
is vendor-agnostic, and we refer to it as LUXOR. We also consider a secondary
tier of vendor-speciic modifications to both Xilinx and Intel FPGAs, which we
refer to as X-LUXOR+ and I-LUXOR+ respectively. We demonstrate that compressor
tree synthesis using generalized parallel counters (GPCs) is further improved
with the proposed modifications. Using both the Intel adaptive logic module and
the Xilinx slice at the 65nm technology node for a comparative study, it is
shown that the silicon area overhead is less than 0.5% for LUXOR and 5-6% for
LUXOR+, while the delay increments are 1-6% and 3-9% respectively. We
demonstrate that LUXOR can deliver an average reduction of 13-19% in logic
utilization on micro-benchmarks from a variety of domains.BNN benchmarks
benefit the most with an average reduction of 37-47% in logic utilization,
which is due to the highly-efficient mapping of the XnorPopcount operation on
our proposed LUXOR+ logic cells.Comment: In Proceedings of the 2020 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA'20), February 23-25, 2020, Seaside, CA,
US