2 research outputs found
Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis
Data movement is the dominating factor affecting performance and energy in
modern computing systems. Consequently, many algorithms have been developed to
minimize the number of I/O operations for common computing patterns. Matrix
multiplication is no exception, and lower bounds have been proven and
implemented both for shared and distributed memory systems. Reconfigurable
hardware platforms are a lucrative target for I/O minimizing algorithms, as
they offer full control of memory accesses to the programmer. While bounds
developed in the context of fixed architectures still apply to these platforms,
the spatially distributed nature of their computational and memory resources
requires a decentralized approach to optimize algorithms for maximum hardware
utilization. We present a model to optimize matrix multiplication for FPGA
platforms, simultaneously targeting maximum performance and minimum off-chip
data movement, within constraints set by the hardware. We map the model to a
concrete architecture using a high-level synthesis tool, maintaining a high
level of abstraction, allowing us to support arbitrary data types, and enables
maintainability and portability across FPGA devices. Kernels generated from our
architecture are shown to offer competitive performance in practice, scaling
with both compute and memory resources. We offer our design as an open source
project to encourage the open development of linear algebra and I/O minimizing
algorithms on reconfigurable hardware platforms
FireFly v2: Advancing Hardware Support for High-Performance Spiking Neural Network with a Spatiotemporal FPGA Accelerator
Spiking Neural Networks (SNNs) are expected to be a promising alternative to
Artificial Neural Networks (ANNs) due to their strong biological
interpretability and high energy efficiency. Specialized SNN hardware offers
clear advantages over general-purpose devices in terms of power and
performance. However, there's still room to advance hardware support for
state-of-the-art (SOTA) SNN algorithms and improve computation and memory
efficiency. As a further step in supporting high-performance SNNs on
specialized hardware, we introduce FireFly v2, an FPGA SNN accelerator that can
address the issue of non-spike operation in current SOTA SNN algorithms, which
presents an obstacle in the end-to-end deployment onto existing SNN hardware.
To more effectively align with the SNN characteristics, we design a
spatiotemporal dataflow that allows four dimensions of parallelism and
eliminates the need for membrane potential storage, enabling on-the-fly spike
processing and spike generation. To further improve hardware acceleration
performance, we develop a high-performance spike computing engine as a backend
based on a systolic array operating at 500-600MHz. To the best of our
knowledge, FireFly v2 achieves the highest clock frequency among all FPGA-based
implementations. Furthermore, it stands as the first SNN accelerator capable of
supporting non-spike operations, which are commonly used in advanced SNN
algorithms. FireFly v2 has doubled the throughput and DSP efficiency when
compared to our previous version of FireFly and it exhibits 1.33 times the DSP
efficiency and 1.42 times the power efficiency compared to the current most
advanced FPGA accelerators