234 research outputs found
Exploring the Potential of Flexible 8-bit Format: Design and Algorithm
Neural network quantization is widely used to reduce model inference
complexity in real-world deployments. However, traditional integer quantization
suffers from accuracy degradation when adapting to various dynamic ranges.
Recent research has focused on a new 8-bit format, FP8, with hardware support
for both training and inference of neural networks but lacks guidance for
hardware design. In this paper, we analyze the benefits of using FP8
quantization and provide a comprehensive comparison of FP8 with INT
quantization. Then we propose a flexible mixed-precision quantization framework
that supports various number systems, enabling optimal selection of the most
appropriate quantization format for different neural network architectures.
Experimental results demonstrate that our proposed framework achieves
competitive performance compared to full precision on various tasks, including
image classification, object detection, segmentation, and natural language
understanding. Our work furnishes critical insights into the tangible benefits
and feasibility of employing FP8 quantization, paving the way for heightened
neural network efficiency in tangible scenarios. Our code is available in the
supplementary material
FireFly: A High-Throughput and Reconfigurable Hardware Accelerator for Spiking Neural Networks
Spiking neural networks (SNNs) have been widely used due to their strong
biological interpretability and high energy efficiency. With the introduction
of the backpropagation algorithm and surrogate gradient, the structure of
spiking neural networks has become more complex, and the performance gap with
artificial neural networks has gradually decreased. However, most SNN hardware
implementations for field-programmable gate arrays (FPGAs) cannot meet
arithmetic or memory efficiency requirements, which significantly restricts the
development of SNNs. They do not delve into the arithmetic operations between
the binary spikes and synaptic weights or assume unlimited on-chip RAM
resources by using overly expensive devices on small tasks. To improve
arithmetic efficiency, we analyze the neural dynamics of spiking neurons,
generalize the SNN arithmetic operation to the multiplex-accumulate operation,
and propose a high-performance implementation of such operation by utilizing
the DSP48E2 hard block in Xilinx Ultrascale FPGAs. To improve memory
efficiency, we design a memory system to enable efficient synaptic weights and
membrane voltage memory access with reasonable on-chip RAM consumption.
Combining the above two improvements, we propose an FPGA accelerator that can
process spikes generated by the firing neuron on-the-fly (FireFly). FireFly is
implemented on several FPGA edge devices with limited resources but still
guarantees a peak performance of 5.53TSOP/s at 300MHz. As a lightweight
accelerator, FireFly achieves the highest computational density efficiency
compared with existing research using large FPGA devices
- …