1 research outputs found
Managing HBM’s bandwidth in Multi-Die FPGAs using Overlay NoCs
We can improve HBM bandwidth distribution and utilization on a multi-die FPGA like
Xilinx Alveo U280 by using Overlay Network-on-Chips (NoCs). HBM in Xilinx Alveo U280
offers 8GBs of memory capacity with a theoretical maximum bandwidth of 460 GBps, but
all the thirty-two HBM ports in Xilinx Alveo U280 are exposed to the FPGA fabric in
only one die. As a result, processing elements assigned to other dies must use the scarcely
available and challenging to use Super Long Lines (SLL) to access the HBM’s bandwidth.
Furthermore, HBM is fractured internally into thirty-two smaller memories called pseudo
channels. They are connected together by a hardened and flawed cross-bar, which enables
global accesses from any of the HBM ports, but introduces several throughput bottlenecks,
degrading the achievable throughput when the entire memory space is used.
An Overlay Hybrid NoC combining the features of Hoplite and Butterfly Fat Trees
(BFT) NoC offers a high-frequency solution for distributing HBM’s bandwidth across all
three dies, as well as overcoming the throughput bottleneck introduced by the internal
cross-bar. The Hybrid NoC combines multiple high-frequency Ring NoCs for inter-die
communication and Butterfly Fat tree NoCs for intra-die communication. In addition, the
routing capability of the NoC can be modified to supplant the HBM’s internal cross-bar
for global accesses. We demonstrate this in Xilinx Alveo 280 using synthetic benchmarks
and two application-based benchmarks, Dense matrix-matrix multiplication (DMM) and
Sparse Matrix-Vector multiplication (SPMV). Our experiments show that NoCs can improve throughput utilization by as much as ×8.6 for single-flit global accesses,×1.7 for
multi-flit global accesses with burst length 16, and as much as ×1.4 for SpMV benchmark