Medusa: A Scalable Interconnect for Many-Port DNN Accelerators and Wide
  DRAM Controller Interfaces by Shen, Yongming et al.
Medusa: A Scalable Interconnect for Many-Port
DNN Accelerators and Wide DRAM Controller
Interfaces
Yongming Shen
Stony Brook University
yoshen@cs.stonybrook.edu
Tianchu Ji
Stony Brook University
tianchu.ji@stonybrook.edu
Michael Ferdman
Stony Brook University
mferdman@cs.stonybrook.edu
Peter Milder
Stony Brook University
peter.milder@stonybrook.edu
Abstract—To cope with the increasing demand and computa-
tional intensity of deep neural networks (DNNs), industry and
academia have turned to accelerator technologies. In particular,
FPGAs have been shown to provide a good balance between
performance and energy efficiency for accelerating DNNs. While
significant research has focused on how to build efficient layer
processors, the computational building blocks of DNN acceler-
ators, relatively little attention has been paid to the on-chip
interconnects that sit between the layer processors and the
FPGA’s DRAM controller.
We observe a disparity between DNN accelerator interfaces,
which tend to comprise many narrow ports, and FPGA DRAM
controller interfaces, which tend to be wide buses. This mismatch
causes traditional interconnects to consume significant FPGA
resources. To address this problem, we designed Medusa: an
optimized FPGA memory interconnect which transposes data in
the interconnect fabric, tailoring the interconnect to the needs
of DNN layer processors. Compared to a traditional FPGA
interconnect, our design can reduce LUT and FF use by 4.7x
and 6.0x, and improves frequency by 1.8x.
I. INTRODUCTION
Deep neural networks (DNNs) [1], [2], [3], [4] are used to
solve challenging machine learning problems. However, CPUs
are failing to meet the high computational demand of DNNs.
GPUs provide sufficient performance, but are limited by their
high power consumption. In contrast, research has shown that
FPGAs strike a good balance between performance and energy
efficiency for accelerating DNNs.
A DNN comprises a pipeline of computing layers (3D
convolution, sub-sampling, nonlinear activation, etc.). Cor-
respondingly, an FPGA-based DNN accelerator comprises
one or more layer processors, where each is specialized for
computing one or more layers of the target DNN [5], [6],
[7], [8], [9]. For large DNNs, DRAM is needed to store DNN
parameters and layer inputs and outputs. Prior work has shown
that DNN computation is highly bandwidth intensive [9], [10].
It is thus essential for the layer processors to fully utilize the
available DRAM bandwidth. However, there exists a mismatch
between the interface of an FPGA DRAM controller and the
layer processors. The nature of FPGAs tends to restrict the
frequency of layer processors, which results in the DRAM
controller using a wide interface to expose the full DRAM
bandwidth to the layer processors (512-bits for a single DDR3
channel). On the other hand, many state-of-the-art FPGA-
based DNN accelerators [8], [9] assume the availability of
many narrow read and write ports (8 or 16 bits), each with
independent DRAM access. This is because narrow ports offer
the most flexibility in optimizing the layer processors for the
target DNN [8]. As such, a memory interconnect must be used
to multiplex the wide DRAM controller interface to a large
number of narrow read and write ports, while maintaining
maximum bandwidth efficiency.
A memory interconnect performs data transfer as well as
request arbitration. The challenge of multiplexing a wide
DRAM controller interface lies in data transfer, which will be
our focus. Mainstream designs of memory interconnects [11],
[12] use a 1-to-N crossbar to multiplex the wide DRAM
controller interface to N narrower ports. The crossbar needs to
have the same width as the DRAM controller to ensure that the
memory bandwidth is fully utilized. Each of the N endpoints
of the crossbar must then connect to a FIFO to buffer burst
transfers and a data-width converter to present a narrow port
to the DNN accelerator. While straightforward, such designs
are severely over-provisioned: the wide crossbar allows the
full DRAM bandwidth to be directed to any narrow port on
any cycle, but each narrow port only uses a fraction of the
full bandwidth. This excessive flexibility of the interconnect
consumes significant logic and wiring resources that can
otherwise be used by the DNN accelerator.
To overcome this over-provisioning, a memory interconnect
should be optimized to take advantage of the data transfer
characteristics of DNN layer processors. In this regard, we
make two critical observations. First, the narrow ports used by
layer processors are all of the same width, and are all expected
to be able to supply one word per cycle. This means that
DRAM bandwidth should be statically and evenly partitioned
across the narrow ports. Second, a layer processor knows its
access pattern and can perform perfect prefetch for future data
access, which means that a moderate latency increase in the
memory interconnect will not affect system performance.
Based on our observations, we designed Medusa, a resource-
efficient, performant, and scalable memory interconnect. In
our design, the crossbar, FIFOs, and data-width converters are
replaced with a transposition unit. Within the transposition
unit, a shifter replaces the crossbar and data-width converters,
resulting in significant logic simplification. Moreover, instead
ar
X
iv
:1
80
7.
04
01
3v
1 
 [c
s.A
R]
  1
1 J
ul 
20
18
DRAM 
controller
512
FIFO512
FIFO
FIFO
FIFO
512
512
512
Datawidth 
Converter
Datawidth 
Converter
Datawidth 
Converter
Datawidth 
Converter
512
512
512
16
16
16
...... ......
512
DNN
Accelerator
16
32 FIFOs 32 Datawidth Converters
D
e
m
u
x
Fig. 1. The baseline memory read data transfer network.
of a shallow FIFO per port, the transposition unit uses a
deep shared buffer. This allows BRAMs to be efficiently used
for buffering, freeing up LUTs and wires for other uses.
Importantly, with only a minor constant latency increase,
Medusa guarantees the same data transfer characteristics as
the traditional interconnects, and can be used as a drop-in
replacement without changing the layer processor or memory
request arbiter design.
Compared to a traditional interconnect, Medusa multiplexes
a 512-bit DRAM controller interface across 32 16-bit read
ports and 32 16-bit write ports using 4.7x and 6.0x fewer LUTs
and FFs, while also improving frequency by 1.8x. For a 1024-
bit DRAM controller interface, Medusa runs at 225MHz, while
routing congestion limits traditional designs to under 25MHz.
II. TRADITIONAL MEMORY INTERCONNECTS
In this section, we present a baseline memory interconnect
which is representative of existing designs [11], [12], and we
qualitatively discuss its scalability challenges.
A. Baseline Data Transfer Logic
The baseline interconnect’s data transfer logic has two parts:
one for memory read, and the other for memory write.
1) Baseline Memory Read Data Transfer: Figure 1 shows
the baseline memory read data transfer network. In this exam-
ple, 32 16-bit accelerator read ports share access to one 512-bit
DRAM controller interface. The design uses a 1-to-N demux
to route input data from the memory controller to N FIFOs,
where each FIFO has the same width as the memory interface.
This means that the demux can accept a new input from the
memory controller on every cycle, allowing the maximum
memory bandwidth to be consumed. Each FIFO is provisioned
to be large enough to hold the largest burst that a narrow read
port can request, so that burst transfer to a single narrow port
does not create back pressure. The output of each FIFO is
connected to a data width converter, which converts data from
the memory interface width to the narrow read port width.
2) Baseline Memory Write Data Transfer: Figure 2 shows
the baseline memory write data transfer network, which is
similar to the read data transfer network, except the data words
flow in the opposite direction. Each of the N accelerator write
Fig. 2. The baseline memory write data transfer network.
ports feeds into a data width converter, then into a FIFO. Each
FIFO has the same width as the memory controller and can
hold the maximum burst from a port. On each cycle, an N-to-1
mux chooses the output from one of the FIFOs to write to the
memory controller. By using FIFOs to accumulate complete
bursts of data, data from the same burst can be sent to the
memory controller using the full bandwidth of the memory
controller interface.
B. Baseline Logic Complexity
To better understand the baseline design, we perform an
analysis of its logic complexity. Given a memory controller
interface data width with Wline bits and a total of N read (or
write) ports on the accelerator(s), then each port must have
Wacc = Wline/N bits for the complete memory bandwidth to
be utilized. For simplicity, this analysis assumes both Wline
and N to be powers of 2.
For the baseline read data transfer network, the main logic
resource use comes from the data width converters and can be
measured in terms of the number of 2-to-1 single bit muxes.
Each converter needs to perform an N-to-1 mux of width Wacc
(or a shift of Wacc bits); either method will have a cost of
Wacc× (N−1) 2-to-1 single bit muxes. With N read ports, the
total cost is Wacc×(N−1)×N = Wline×(N−1) 2-to-1 muxes.
In other words, the complexity is O(Bandwidth×NumPorts).
For the baseline write data transfer network, most of the
logic is used for the N-to-1 mux of width Wline, so the total
resource use complexity is Wline × (N− 1), equal to the read
data transfer network.
C. Baseline Scalability Problems
Although the baseline presents a straightforward solution
to allow full DRAM bandwidth usage and to eliminate any
data switching conflicts among narrow memory ports, its
wide demux and mux are over-provisioned in terms of their
connectivity. For example, the demux used in the read network
has the ability to direct all of the read bandwidth to any
of the read ports on any cycle. Such flexibility is useful in
applications where the partitioning of memory bandwidth to
read ports needs to change over time. However, in the context
of DNN accelerators, the memory read bandwidth is expected
(a) Medusa’s memory read data transfer network
(b) Medusa’s memory write data transfer network
Fig. 3. High level view of the memory interconnect Medusa. Controller modules keeps track of data and space availability in buffers. Buffers next to the
DNN accelerator are double buffered. A line from the DRAM controller is 512-bit (Wline). Each port of the DNN accelerator is 16-bit wide (Wacc = 16 bits).
to simply be evenly divided among all the read ports [8]. As
such, the extra flexibility of the wide demux only incurs wasted
logic (the muxes of the data width converters) and wiring
resources. The write network incurs analogous resource waste.
Moreover, the combination of wide and shallow FIFOs leads
to inefficient use of FPGA resources. Implementing the shal-
low FIFOs using BRAMs wastes BRAM capacity, while using
LUTRAM consumes a large amount of logic. Additionally,
a large number of buses (as wide as the DRAM controller
interface) is widely distributed within this design. Handling
wide buses introduces challenges with FPGA routing, greatly
limiting the peak clock frequency when scaling to wider
memory interfaces.
III. MEDUSA: AN OPTIMIZED MEMORY INTERCONNECT
We propose a scalable high performance memory inter-
connect which is based on data transposition. Figures 3a
and 3b provide a high-level overview of the interconnect. Both
memory read and write use two data buffers, a rotation unit,
and control logic.
Our design evenly partitions the DRAM bandwidth to each
port of the DNN accelerator by transposing data instead
of routing it with wide demuxes and muxes, thus reducing
FPGA resource and routing complexity, without compromising
DRAM bandwidth utilization.
A. Bandwidth Partitioning Through Transposition
Here we provide detailed descriptions of how transposition
is used for memory read and write.
1) Transposition for Memory Read: Figure 4 shows an
example of transposition for memory read. Each memory line
is Wline = 64 bits, each accelerator port is Wacc = 16 bits wide,
and N =Wline/Wacc = 4 accelerator ports are used. We mark
each data word with coordinates (x,y), where x represents the
word’s destination accelerator port, and y is the word’s index
within its containing memory line. Words in the same memory
line are always destined to the same accelerator port, and are
sent to the destination port in increasing index order. Each
Wline-bit memory line is stored across the input buffer banks
(seen at the bottom of the figure). Specifically, words that are
destined to accelerator port i are stored in address i of each
of the input buffer banks.
Transposition is performed by reading data words from the
input buffer, rotating them, and storing them in appropriate
locations in the output buffer. First, at cycle c, words along
the diagonal (0,c mod N) to (N − 1,(c+N − 1) mod N) are
read. For example, Figure 4a shows c = 0, where words
(0,0), . . . ,(3,3) are read, and Figure 4b shows c = 1, where
words (0,1), . . . ,(3,0) are read. The rotation unit then takes
these N words and rotates them to the left by c mod N
locations. For example, Figure 4c shows that during c = 3,
the words are rotated 3 positions to the left. Lastly, the
output buffer stores the words into transposed locations: on
cycle c, bank i will store data into address (i+ c) mod N.
The transposition completes in N cycles, after which each
accelerator port can read from its corresponding output buffer
bank.
2) Transposition for Memory Write: Memory writes are
performed similarly, but with data flowing in the opposite
direction, as shown in Figure 3b. Each accelerator port writes
data words into its own bank of the input buffer. The intercon-
nect then transposes input buffer banks to rows in the output
buffer.
For both memory read and write, the interconnect is capable
Rotation Unit
o
u
tp
u
t b
u
ffe
r
In
pu
t b
u
ffe
r
BA
N
K0
(0, 0)
BA
N
K1 (1, 1)
BA
N
K2
(2, 2)
BA
N
K3
(3, 3)
BA
N
K0
(0, 0)
(1, 0)
(2, 0)
(3, 0) B
AN
K1
(0, 1)
(1, 1)
(2, 1)
(3, 1) B
AN
K2
(0, 2)
(1, 2)
(2, 2)
(3, 2) B
AN
K3
(0, 3)
(1, 3)
(2, 3)
(3, 3)
1616 16 16
1616 16 16
Read 
pointer:0
Read 
pointer:1
Read 
pointer:2
Read 
pointer:3
To Accelerator
(a) Cycle 0.
Rotation Unit
In
pu
t b
u
ffe
r
o
u
tp
u
t b
u
ffe
r
BA
N
K0
(0, 0)
(0, 1)
BA
N
K1 (1, 1)
(1, 2)
BA
N
K2
(2, 2)
(2, 3) B
AN
K3
(3, 0)
(3, 3)
BA
N
K0
(0, 0)
(1, 0)
(2, 0)
(3, 0) B
AN
K1
(0, 1)
(1, 1)
(2, 1)
(3, 1) B
AN
K2
(0, 2)
(1, 2)
(2, 2)
(3, 2) B
AN
K3
(0, 3)
(1, 3)
(2, 3)
(3, 3)
1616 16 16
16 16 16 16
Read 
pointer:3
Read 
pointer:0
Read 
pointer:1
Read 
pointer:2
To Accelerator
(b) Cycle 1.
Rotation Unit
In
pu
t b
u
ffe
r
o
u
tp
u
t b
u
ffe
r
BA
N
K0
(0, 0)
(0, 1)
(0, 2)
(0, 3) B
AN
K1
(1, 0)
(1, 1)
(1, 2)
(1, 3) B
AN
K2
(2, 0)
(2, 1)
(2, 2)
(2, 3) B
AN
K3
(3, 0)
(3, 1)
(3, 2)
(3, 3)
BA
N
K0
(0, 0)
(1, 0)
(2, 0)
(3, 0) B
AN
K1
(0, 1)
(1, 1)
(2, 1)
(3, 1) B
AN
K2
(0, 2)
(1, 2)
(2, 2)
(3, 2) B
AN
K3
(0, 3)
(1, 3)
(2, 3)
(3, 3)
1616 16 16
16 16 16 16
Read 
pointer:1
Read 
pointer:2
Read 
pointer:0
Read 
pointer:3
To Accelerator
(c) Cycle 3 with transposition result.
Fig. 4. A detailed transposition example for memory read.
of processing one Wline-bit line per cycle, as all parts (the
rotation unit, input buffer read/write, output buffer read/write)
operate on Wline-bit data in parallel. Therefore the system can
deliver the full bandwidth of the DRAM controller interface
to the accelerator ports. Furthermore, the bandwidth is evenly
partitioned across the ports, matching the accelerator’s require-
ments.
B. Rotation Unit Design
The data rotation unit takes N values of Wacc bits each and
left-rotates them in increments of Wacc bits (rotating by Wacc×
c bits in cycle c). Figure 5 shows an example rotation unit with
N = 8 ports. This unit, using a barrel shifter structure, passes
data through log2(N) levels of logic, where level ` is capable
of rotating the word by the bit length of 2` words. Stage `
is controlled by bit ` of the binary encoding of the desired
rotation amount, where logic-1 indicates that the stage should
rotate. Data rotation can either be performed in a single cycle
or be pipelined, depending on the frequency requirements.
Rotation 
Control 
bits
3
Inputs
Outputs
Fig. 5. An example data rotation unit for supporting eight ports.
C. Support for Burst Transfer
Support for burst data transfers is necessary to utilize the
bandwidth available from the DRAM controller.
1) Burst Transfer for Memory Read: A request can generate
a burst of line transfers to its port. Therefore, the input buffer
must be large enough to accommodate at least one burst per
port. In other words, the input buffer capacity must be at least
MaxBurstLen×N, with N being the number of ports. For each
port, head and tail pointers are maintained to track its input
buffer space. In each cycle, only the lines at the head pointers
participate in rotation. A head pointer is incremented when
the line it points to has finished transposition. Tail pointers
control where incoming memory lines are written.
2) Burst Transfer for Memory Write: The output buffer
capacity must be at least MaxBurstLen×N. Similar to the case
of memory read, head and tail pointers are used to keep track
of the buffer space for each port. Notably, for memory write,
the request arbiter must monitor data coming from the write
ports, and only issue requests for ports that have accumulated
enough data in the output buffer to finish the write request.
This requirement also applies to the baseline interconnect.
D. Interconnect Scalability
The primary use of logic resources for our memory inter-
connect design comes from the data rotation unit. There are
log2(N) layers of muxes in the data rotation unit, and each
layer contains N 2-to-1 muxes of width Wacc. Overall, each
layer contains N×Wacc =Wline 2-to-1 one-bit muxes, and all
layers combined contain Wline× log2(N) many 2-to-1 one-bit
muxes, which is a significant improvement over the baseline’s
cost of Wline× (N−1) muxes.
Furthermore, the Medusa interconnect consolidates the shal-
low and wide FIFOs of the baseline design into large buffers
with deep and narrow banks, making them amenable to
efficient storage in BRAM.
E. Latency Overhead
Compared to the baseline, our transposition-based design
has a constant latency overhead of Wline/Wacc cycles. This hap-
pens because a memory line can only be consumed after it has
been transposed. For a typical case, Wline/Wacc = 512/16= 32.
In the context of DNN accelerators, this latency overhead
has a negligible impact on performance, because DNN layer
processors double buffer their inputs and perform perfect
prefetch of data into the idle buffers.
Note that, even for burst transfers, the latency overhead of
Medusa is still Wline/Wacc cycles. This is because as soon as
the head of a burst arrives, transposition can start.
F. Data Transfer Characteristics
In the example in Figure 4, the buffer has data available for
each port at the time when the transposition begins. However,
this is not a requirement of the design. The control logic starts
transposition for a port without waiting for the other ports, and
a port can join the transposition when transfers on the other
ports are already in progress. In other words, the transposition
design does not incur any interference among ports.
Overall, except for the constant latency overhead explained
in Section III-E, the data transfer characteristics of the Medusa
interconnect are identical to that of the baseline.
G. Handling Irregular Configurations
Thus far, our discussion assumed a power-of-two number
of read/write ports. However, this is not a requirement. When
the number of ports is not a power of two, unused ports are
either left unconnected (for output signals) or connected to
suitable constant values (for input signals), and synthesis and
place and route tools will perform suitable optimizations to
remove unused logic.
IV. EVALUATION
We compare the Medusa transposition-based interconnect
and the baseline interconnect by looking at their resource
use, performance, and scalability. Note that both interconnects
use the same request arbitration logic, hence our evaluation
focuses on the data transfer networks within the interconnects.
A. Methodology
We used Bluespec [13] to implement both interconnects.
The implementations are highly parameterized to allow easy
generation of various design points used in our experiments.
For all designs, we perform synthesis as well as place and
route (P&R) using Xilinx Vivado 2016.4, with the Virtex-7
690T FPGA as the target device.
An important aspect of our evaluation is finding the post-
P&R peak frequencies of different data transfer networks.
However, doing P&R for the data transfer networks alone
will not yield representative results. This is because within a
DNN accelerator, layer processors consume the most resources
and thus will have a significant impact on the results of
place and route. Since the DNN accelerator’s performance is
what ultimately matters, ignoring the P&R impact from layer
processors is unreasonable. As such, when running synthesis
and P&R for a memory interconnect, we also include into
the design a convolutional layer processor, which uses all the
narrow read/write ports of the interconnect. At a high level,
a convolutional layer processor consists of vector dot-product
units, input feature map buffers, output feature maps buffers,
and weight buffers [14], [7], [8]. In our case, the number of
vector dot-product units is set differently for different exper-
iments. Each vector dot-product unit is 32-wide and operates
on vectors of 16-bit fixed point values. Correspondingly, the
narrow read/write ports of an interconnect are 16-bits wide.
Each vector dot-product unit uses 32 DSP slices to implement
its 32 multipliers. Input feature map buffers are 2260 deep,
output feature map buffers are 1792 deep, and weight buffers
are 9 deep. These buffer depths are chosen to be suitable
for VGGNet [1] and similar CNNs, and result in BRAM use
comparable to layer processors in existing works [7], [8].
To compare the resource use of different designs, we look
at the post-P&R consumption of the four main types of FPGA
resources: LUTs (look up tables), FFs (flip-flops), 18Kbit
BRAMs (block RAMs), and DSPs (DSP slices).
To compare the performance of different designs, we find
the post-P&R peak obtainable frequency of each design
(searching in steps of 25MHz). We used Vivado’s default syn-
thesis strategy with retiming turned on. For place and route, we
used the “performance explore plus post-route optimization”
strategy.
For the scalability evaluation, we vary the layer processor
and interconnect size, and observe the changes in peak fre-
quency.
To ease the implementation process, we excluded the mem-
ory controller and PCIe controller from our setup and replaced
them with stubs. The exclusion of these two components gives
equal benefit to the baseline designs and Medusa transposition-
based designs in terms of their area consumed and their ability
to reach higher frequencies. Importantly, because these two
components only use a small fraction of the resource of
a Virtex-7 690T and run in their own clock domains, the
frequency benefit from their exclusion is minor, and will not
affect the conclusions of our experiments.
B. Baseline Validation
Even though off-the-shelf IP cores can be used to implement
the baseline read and write interconnects, they often have
limitations that are insufficient for our use. For example, the
Xilinx AXI4-Stream Interconnect only supports up to 16 ports,
but we often require more (e.g., to consume a 512-bit DDR3
interface, we need 32 ports of 16 bits each).
To validate that our baseline implementation is resource-
efficient, we compare the post-synthesis resource use of our
baseline data transfer networks with equivalent networks built
from Xilinx AXI4-Stream IP cores. Xilinx AXI4-Stream IP
cores are used because they do not include the overhead of
AXI request arbitration, much like the data transfer networks
that we aim to evaluate. To enable this comparison, we choose
a configuration that fits within the limits of Xilinx AXI4-
Stream IP cores—specifically using a 256-bit wide memory
interface, multiplexed to 16 16-bit ports. We set the FIFO
depth to 32 words. As Table I shows, our baselines have
TABLE I
BASELINE DATA TRANSFER NETWORKS VS. AXI4-STREAM NETWORKS
(1×256-BIT PORT TO 16×16-BIT PORTS. NO DSPS OR BRAMS ARE
USED).
Base (Read) AXIS (Read) Base (Write) AXIS (Write)
LUT 5,313 11,562 6,810 9,170
(1.2%) (2.7%) (1.6%) (2.1%)
FF 5,404 27,173 9,023 26,554
(0.6%) (3.1%) (1.0%) (3.1%)
significantly lower cost than their Xilinx AXI4-Stream IP-
based counterparts. As such, our baseline designs represent
fair reference points with which to compare our Medusa
transposition-based designs.
Note that the data transfer networks in Table I are relatively
small, and do not exhibit the resource consumption problems
we will see when scaling to larger networks. The following
sections investigate larger designs to show this problem.
C. Hardware Resource Usage
To evaluate the hardware resources required by Medusa,
we thoroughly evaluate a representative design point. The
design point we use include a layer processor with 64 vector
dot-product units. We evaluate this layer processor coupled
with a memory interconnect, built using either the baseline or
the Medusa transposition-based data transfer networks. Each
memory interconnect multiplexes a 512-bit memory interface
to 32 16-bit read ports and 32 16-bit write ports; all ports
are used by the layer processor. For each read/write port, the
maximum burst that the memory interconnect can buffer is
32×512-bits. Overall, this setup represents a scenario where
the external memory is a single channel 800MHz DDR3,
which is common on FPGA boards. In such systems, the
memory controller runs in its own clock domain at 200MHz,
and exposes a 512-bit interface to the rest of the FPGA. The
amount of DSP slices and BRAM slices used by the layer
processor in this setup is representative of existing works [7],
[8], [9].
Table II shows the resource breakdown of the two designs.
For each design, we present the resource use of the whole
design, the read data-transfer network, and the write data-
transfer network in isolation. The percentages show resource
use relative to the capacity of a Virtex-7 690T.
First, we focus on the data transfer networks in isolation.
For memory read, compared to the baseline, the Medusa
transposition-based network reduces LUT use by 3.84x and
FF use by 4.04x, at a cost of 32 BRAMs. For memory write,
the Medusa transposition-based network reduces LUT use by
5.61x and FF use by 8.20x, also at a cost of 32 BRAMs.
Combined, the Medusa networks achieve 4.73x LUT and
6.02x FF savings, at a minor BRAM cost.
We next consider the entire design, including the layer
processor and memory interconnect (Total). The baseline uses
1.27x more LUTs and 1.23x more FFs than the Medusa
transposition-based design, whereas the transposition-based
TABLE II
MEDUSA VS. BASELINE (FPGA RESOURCE USE).
LUT FF BRAM-18K DSP
Read Network 18,168 19,210 0 0
(4.2%) (2.2%) (0%) (0%)
Baseline Write Network 26,810 35,451 0 0
(6.2%) (4.1%) (0%) (0%)
Total 198,887 240,449 726 2,048
(45.9%) (27.8%) (24.7%) (56.9%)
Read Network 4,733 4,759 32 0
(1.1%) (0.6%) (1.1%) (0%)
Medusa Write Network 4,777 4,325 32 0
(1.1%) (0.5%) (1.1%) (0%)
Total 156,409 195,158 790 2,048
(36.1%) (22.5%) (26.9%) (56.9%)
design uses 1.09x more BRAM. This shows that the LUT and
FF savings achieved by the Medusa data transfer networks
are significant even in the context of a resource-heavy layer
processor. In the baseline design, the combined read and write
data-transfer networks account for 22.6% of the total LUT use
and 22.7% of the total FF use of the accelerator. The Medusa
transposition-based design reduced these to 6.1% and 4.7%,
respectively.
The Medusa network’s efficiency stems from its lower logic
complexity and its ability to make efficient use of BRAMs,
saving LUTs, FFs, and routing resources. The Medusa design
uses a total of 64 BRAMs to efficiently buffer data. In contrast,
if the baseline design were to use BRAMs in its data-transfer
networks, 960 BRAMs would be needed, making it a poor
trade-off with respect to the savings in FFs and LUTs. This
is because each 18-Kbit BRAM is 36 bits wide, and each
32x512-bit FIFO would consume 15 BRAMs, requiring a total
of 960 BRAMs for 32 memory-read FIFOs and 32 memory-
write FIFOs.
D. Performance and Scalability
Besides saving logic resources, our Medusa transposition-
based design also critically saves on routing resources. The
compounded savings of logic and routing can lead to sig-
nificant improvements in an accelerator’s performance and
scalability. In this section, we evaluate this important quality of
our Medusa memory interconnect by scaling the size of both
the accelerator and the interconnect, and observing how the
baseline and Medusa change the reachable clock frequency.
Our experiment starts with a small layer processor with 16
vector dot-product units. The memory interconnect multiplexes
a 128-bit memory interface to 8 16-bit read ports and 8 16-
bit write ports. We perform place and route to find the peak
obtainable frequency of the accelerator, with a search step
of 25MHz, for both the baseline and for the Medusa-based
design. From this point, we go through several steps where we
scale up the accelerator’s size and number of memory ports,
and we repeat the peak frequency search. At each step, we
increase the number of vector dot-products units by 8, the
050
100
150
200
250
300
0 500 1000 1500 2000 2500
Fr
eq
ue
nc
y (
M
Hz
)
Accelerator Size (DSP Slices)
Medusa
Baseline
128-bit 256-bit
512-bit 1024-bit
Fig. 6. Change in peak frequency as the accelerator scales.
number of read ports by 4, and the number of write ports by
4. The width of the memory interface is always set to a power
of two, and is chosen to be wide enough to accommodate all
the read ports and write ports. For example, for (8,16] 16-bit
read ports, a 256-bit interface is needed, and for (16,32] read
ports, 512-bits are needed.
Figure 6 shows the result of our experiment. The x-axis
shows the size of the accelerator, measured in DSP slices
(equal to the number of vector dot-product units times 32).
The y-axis indicates the maximum reachable frequency for
that point. Separate lines are used to indicate the accelerator
implemented using the Medusa interconnect and the baseline
interconnect; keep in mind that the only difference between
the two lines is the interconnect. Points at 0MHz indicate that
Vivado was not able meet timing at 25MHz. Vertical dashed
lines are used to partition this figure into four regions, where
each region corresponds to a memory interface width (128-bit
through 1024-bit).
From Figure 6, we can see that starting from the point
with 1024 DSPs, Medusa designs always outperform baseline
designs. Furthermore, as the size of the accelerator increases,
the performance difference between Medusa and the baseline
increases, demonstrating the improved scalability of Medusa
designs.
Besides the general trend, there are several other interesting
things to note in Figure 6. First, within the 512-bit memory
interface region, which is the region that represents the popular
configuration of “FPGA + single channel DDR3,” Medusa
outperforms the baseline by up to 1.8x (the designs with
1280 DSPs and 2048 DSPs). Second, within the 1024-bit
memory interface region, which represents platforms with
higher memory bandwidth, the baseline is barely usable,
with some points failing to make timing even at 50MHz or
lower. Nonetheless, the Medusa designs with the same large
accelerators can keep running at 200 to 225MHz in this region.
Third, we note that although Medusa designs are most efficient
when the number of memory ports is a power of two, most of
the designs included in our experiment have a non-two-power
port count. In spite of this, Medusa still shows a clear benefit
over the baseline. Lastly, we note that the 2048-DSP points
in Figure 6 correspond to the designs whose resource use
metrics were evaluated in Table II. This demonstrates that the
Medusa designs simultaneously achieve both resource savings
and performance improvements.
V. RELATED WORK
Our work focuses on providing an efficient memory inter-
connect for DNN accelerators that require access to DRAM
through many narrow read and write ports. Some designs [15],
[16] avoid the need for such an interconnect by altering the
layout of data in DRAM. The main drawback of this approach
is its constraint on the data flow inside layer processors.
Specifically, the output layout of one layer must be compatible
with the input layout of the next, limiting how a layer
processor can perform its computation, which can lead to
underutilization of compute units [8].
Other designs [6], [17] avoid the width mismatch problem
by using narrow memory controller buses. When scaled up,
these designs will either be bottlenecked by DRAM band-
width, or face the interconnect conundrum which is addressed
by Medusa. In particular, [6] uses a multi-cast network for
distributing read data to the accelerators, and a mux-based
design for writing. Data being read from the memory interface
are broadcast to all ports with an ID attached. A port can
decide whether to accept the data by checking the ID. From a
resource use point of view, the multi-cast network is essentially
the read data network in our baseline; this and the mux-based
write interconnect will have similar scalability problems as the
baseline designs considered in this work.
Yet other designs [18], [19] avoid the need for an advanced
memory interconnect by storing DNN model parameters as
well as intermediate data on chip, so as to reduce bandwidth
use and simplify the interaction between the accelerator and
DRAM. For such designs, the supported DNN size is limited
by on-chip storage size.
VI. CONCLUSIONS
This paper presented a resource efficient and high-
performance memory interconnect for connecting many-port
DNN accelerators to wide DRAM controller interfaces. We
analyzed and experimented with commonly-used mux/demux-
based interconnects, and concluded that they were over-
provisioned and had serious scalability limitations.
To address this problem, we tailored our design to the
needs of DNN accelerators and used a transposition unit to
implement memory bandwidth partitioning. Our design has
lower logic complexity and can efficiently use BRAMs to
reduce LUTRAM use. Experiments showed that, compared
to the baseline design, our Medusa design reduced LUT and
FF usage by 4.7x and 6.0x respectively, and improved peak
frequency by 1.8x.
ACKNOWLEDGMENT
This material is based on work supported by the National
Science Foundation (NSF) under Grant Nos. 1533739 and
1453460. The experiments were conducted with equipment
purchased through NSF CISE Research Infrastructure Grant
No. 1405641.
REFERENCES
[1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with con-
volutions,” in Proceedings of the 2015 IEEE Conference on Computer
Vision and Pattern Recognition, ser. CVPR ’15, 2015, pp. 1–9.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Proceedings of the 25th
International Conference on Neural Information Processing Systems, ser.
NIPS ’12, 2012, pp. 1097–1105.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” CoRR, vol. abs/1512.03385, 2015.
[5] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high per-
formance FPGA-based accelerator for large-scale convolutional neural
networks,” in Proceedings of the 26th International Conference on Field
Programmable Logic and Applications, ser. FPL ’16, 2016, pp. 1–9.
[6] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao,
A. Mishra, and H. Esmaeilzadeh, “From high-level deep neural models
to FPGAs,” in Proceedings of the 49th Annual IEEE/ACM International
Symposium on Microarchitecture, ser. MICRO ’16, 2016, pp. 1–12.
[7] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural networks,”
in Proceedings of the 23rd ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, ser. FPGA ’15, 2015, pp. 161–170.
[8] Y. Shen, M. Ferdman, and P. Milder, “Maximizing CNN accelerator
efficiency through resource partitioning,” in Proceedings of the 44th
Annual International Symposium on Computer Architecture, ser. ISCA
’17, 2017, pp. 535–547.
[9] ——, “Escher: A CNN accelerator with flexible buffering to minimize
off-chip transfer,” in Proceedings of the 25th IEEE International Sympo-
sium on Field-Programmable Custom Computing Machines, ser. FCCM
’17, 2017, pp. 93–100.
[10] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN ac-
celerators,” in Proceedings of the 49th Annual IEEE/ACM International
Symposium on Microarchitecture, ser. MICRO ’16, 2016, pp. 1–12.
[11] Xilinx, “AXI Interconnect v2.1,” 2017.
[12] Altera, “Qsys Interconnect,” 2013.
[13] R. Nikhil, “Bluespec System Verilog: Efficient, correct RTL from high
level specifications,” in Proceedings of 2nd ACM and IEEE Interna-
tional Conference on Formal Methods and Models for Co-Design, ser.
MEMOCODE ’04, 2004, pp. 69–70.
[14] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
“DianNao: A small-footprint high-throughput accelerator for ubiquitous
machine-learning,” in Proceedings of the 19th International Conference
on Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS ’14, 2014, pp. 269–284.
[15] C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: Towards
uniformed representation and acceleration for deep convolutional neural
networks,” in Proceedings of the 35th International Conference on
Computer-Aided Design, ser. ICCAD ’16, 2016, pp. 1–8.
[16] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
N. Xu, S. Song, Y. Wang, and H. Yang, “Going deeper with embedded
FPGA platform for convolutional neural network,” in Proceedings of
the 24th ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, ser. FPGA ’16, 2016, pp. 26–35.
[17] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for
energy-efficient dataflow for convolutional neural networks,” in Proceed-
ings of the 43rd International Symposium on Computer Architecture, ser.
ISCA ’16, 2016, pp. 367–379.
[18] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre,
and K. Vissers, “Finn: A framework for fast, scalable binarized neural
network inference,” in Proceedings of the 2017 ACM/SIGDA Interna-
tional Symposium on Field-Programmable Gate Arrays, ser. FPGA ’17,
2017, pp. 65–74.
[19] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo,
S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel,
A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M.
Caulfield, E. S. Chung, and D. Burger, “A configurable cloud-scale
DNN processor for real-time AI,” in Proceedings of the 44th Annual
International Symposium on Computer Architecture, ser. ISCA ’18,
2018.
