An Open-Source Platform for High-Performance Non-Coherent On-Chip
  Communication by Kurth, Andreas et al.
1An Open-Source Platform for High-Performance
Non-Coherent On-Chip Communication
Andreas Kurth, Student Member, IEEE, Wolfgang Ro¨nninger, Thomas Benz, Matheus Cavalcante, Student
Member, IEEE, Fabian Schuiki, Florian Zaruba, Student Member, IEEE, and Luca Benini, Fellow, IEEE
Abstract—On-chip communication infrastructure is a central component of modern systems-on-chip (SoCs), and it continues to gain
importance as the number of cores, the heterogeneity of components, and the on-chip and off-chip bandwidth continue to grow. Decades of
research on on-chip networks enabled cache-coherent shared-memory multiprocessors. However, communication fabrics that meet the
needs of heterogeneous many-cores and accelerator-rich SoCs, which are not, or only partially, coherent, are a much less mature research
area. In this work, we present a modular, topology-agnostic, high-performance on-chip communication platform. The platform includes
components to build and link subnetworks with customizable bandwidth and concurrency properties and adheres to a state-of-the-art,
industry-standard protocol. We discuss microarchitectural trade-offs and timing/area characteristics of our modules and show that they can
be composed to build high-bandwidth (e.g., 2.5 GHz and 1024 bit data width) end-to-end on-chip communication fabrics (not only network
switches but also DMA engines and memory controllers) with high degrees of concurrency. We design and implement a state-of-the-art ML
training accelerator, where our communication fabric scales to 1024 cores on a die, providing 32 TB/s cross-sectional bandwidth at only
24 ns round-trip latency between any two cores.
F
1 INTRODUCTION
ON-CHIP NETWORKS are the primary means of communi-cation inside modern multi- and many-core processing
SoCs [1]. As the number of cores, the heterogeneity of
components, and the on- and off-chip bandwidth continue
to grow to meet ever higher application demands, on-chip
networks continue to gain importance. Decades of research
on on-chip networks were instrumental for breakthroughs in
scalability of homogeneous shared-memory multiprocessors,
and a continuation of this research is necessary to realize the
full potential of many-core accelerators and accelerator-rich
heterogeneous SoCs.
Ideally, SoC designers could compose on-chip networks
from a platform of components according to the requirements
of their application. The central design goals of such a
platform are: (G1) Elementary, modular components that can
implement any topology and that separate concerns such as
routing and buffering. (G2) Parametrizable components (e.g.,
data width, transaction concurrency) to cover a large design
space. (G3) Bridging components to connect heterogeneous
SoC elements (e.g., GPU SMs, DMA engines, and domain-
specific accelerators) and their subnetworks, each with unique,
application-driven latency and bandwidth requirements. (G4)
Compliance with an industry-standard protocol for extensibil-
ity, third-party compatibility, and verifiability. (G5) Detailed
characterization of the complexity and trade-offs of the
components in terms of performance vs. cost (area, power) to
guide design and optimization efforts.
Commercial offerings that meet (parts of) these goals exist
from multiple vendors (details in § 5), but their microarchi-
tecture, complexity, and performance are well-guarded trade
secrets. Research has also worked toward those goals (details
in § 5), but, to the best of our knowledge, an end-to-end
platform for non-coherent on-chip communication that meets
the needs of heterogeneous SoCs has not been presented yet in
open literature and is not available as open-source hardware.
In this work, we fill this gap with these contributions:
1) We present a modular, topology-agnostic (G1), high-
performance on-chip communication platform of
parametrizable components (G2) for a state-of-the-art,
industry-standard protocol (G4) (§ 2). The compo-
nents include bridges and converters to link sub-
networks with different bandwidth and concurrency
properties (G3). We publish the modules of our
platform, implemented in industry-standard Sys-
temVerilog, under a permissive open-source license
for research and industrial usage.
2) We discuss microarchitectural trade-offs and tim-
ing/area characteristics of the modules in our plat-
form (G5), both theoretically/asymptotically and with
topographical synthesis results (§ 3). We show that our
modules can be composed to build high-bandwidth
(e.g., 2.5 GHz and 1024 bit data width), end-to-end
on-chip communication fabrics (e.g., DMA engine to
memory controller), with high degrees of concurrency
(e.g., up to 256 independent concurrent transactions)
and flexibility (e.g., 64-bit subnetworks).
3) We design and implement (post-P&R) a state-of-
the-art many-core machine learning training (MLT)
accelerator in a modern 22 nm technology (§ 4), where
our communication fabric scales to 1024 cores on a
die, which deliver more than 2 Tdpflop/s, provid-
ing 32 TB/s cross-sectional bandwidth at only 24 ns
round-trip latency between any two cores.
We focus on non-coherent on-chip communication for two
main reasons: First, coherent on-chip communication in
homogeneous many-core processors has been studied ex-
tensively (see § 5 for an overview). Second, many complex
heterogeneous SoCs (e.g., mobile application SoCs [2], high-
speed networking SoCs [3]) and massively parallel data
processing architectures (e.g., GPGPUs [4]) are not or only
partially cache-coherent.
ar
X
iv
:2
00
9.
05
33
4v
1 
 [c
s.A
R]
  1
1 S
ep
 20
20
2This paper is organized as follows: We present the ar-
chitecture of our on-chip communication platform in § 2
and characterize its performance and complexity in § 3. We
then use our platform to design, implement, and evaluate the
communication fabric of a state-of-the-art many-core MLT
accelerator in § 4. Finally, we compare with related work in
§ 5 and conclude in § 6.
2 ARCHITECTURE
Current on-chip communication is centered around the
premise of high-bandwidth point-to-point data transfers. To
fulfill this premise despite increasing point-to-point latency,
three central traits of current on-chip communication protocols
are: burst-based transactions, multiple outstanding transactions,
and transaction reordering. Our design targets these central
traits in general, so the concepts we present potentially
apply to a wide range of modern on-chip protocols. More
tangibly, we adhere to the latest revision (5) of the AMBA
Advanced eXtensible Interface (AXI) [5]. AXI is one of the
industry-dominant protocols and the only protocol with an
open, royalty-free specification and a widespread adoption in
current systems designed by many different companies. Other
protocols with similar properties are discussed in § 5.
Communication between a master port and a slave port is
structured into two directions (read and write), into channels
for commands, data, and responses, and into transfer items called
beats. A write transaction starts with one beat on the write
command channel followed by one or multiple beats on the
write data channel and ends with a single beat on the write
response channel. A read transaction starts with one beat on the
read command channel and ends with one or the last of multiple
beats on the read response channel. Each transaction has an
ID. IDs define the order of transactions and beats according
to the following rules: (O1) Inter-Transaction Ordering: Any
two transactions in the same direction and ID are ordered.
(O2) Response Ordering: Any two responses with the same
direction and ID must be in the same order as their commands.
(O3) Write Beat Ordering: Write data beats do not have
an ID and are therefore always ordered. Each channel has
multiple isodirectional payload signals and two signals for bi-
directional flow control. We focus on valid-ready flow control,
where the channel master drives valid and the payload and the
channel slave drives ready (but other flow control schemes, e.g.,
credit-based, are possible). A handshake occurs when valid and
ready are high on a rising clock edge. There are two essential
rules in this valid-ready flow control: (F1) Stability Rule: Once
valid is high, it and the payload must not change until after
the next handshake. (F2) Acyclicity Rule: The channel receiver
may depend on valid to be high before setting ready high, but
the channel sender may not depend on ready to be high before
setting valid high.
An overview of the modules in our on-chip communication
platform is given in Table 1. In this section, we discuss their
microarchitecture and design trade-offs, from elementary
components through all essential interconnecting modules to
endpoints of increasing complexity.
2.1 ElementaryComponents: Network (De)Multiplexers
Network multiplexers and demultiplexers are the elementary
components that join multiple ports to one and split one port
into multiple, respectively. In doing so, they must adhere to
Category Module Section
Elementary
Components
Network Multiplexer 2.1.1
Network Demultiplexer 2.1.2
Network
Junctions
Crossbar 2.2.1
Crosspoint 2.2.2
Concurrency
Control
ID Remapper 2.3.1
ID Serializer 2.3.2
Data Width
Converters
Data Upsizer 2.4.1
Data Downsizer 2.4.2
Data Movement DMA Engine 2.5
On-Chip Memory
Endpoints
Simplex Memory Controller 2.6.1
Duplex Memory Controller 2.6.2
Last Level Cache 2.7
Table 1. Overview of the modules in our on-chip communication platform.
Figure 1. Architecture of our multiplexer, drawn with two slave ports.
the relations between the channels and to the ordering rules
(O1–3). They are obviously used to build network junctions
(e.g., crossbars), but they can be reused far beyond that because
they implement a central part of the communication protocol.
In fact, these elementary components are essential for almost
all modules of our platform.
2.1.1 Network Multiplexer
The multiplexer, which connects multiple slave ports to one
master port, consists of multiplexing components for the
forward channels and demultiplexing components for the
backward channels. The complexity lies in demultiplexing
the backward channels, because the multiplexer needs the
information to which output a beat on a backward channel
must be routed. Multiplexing the command channels simply
requires the selection of a valid beat, with the restriction that a
selection must be stable once made (F1).
Our multiplexer architecture is shown in Fig. 1. We first
prepend the ID of each command beat with the ID of the
slave port. We then select among beats on the command
channels with round-robin (RR) arbitration trees. For writes,
the decision is forwarded through a first-in first-out buffer
(FIFO) to a multiplexer for the write data beats, which is
sufficient due to (O3). As commands out of our multiplexer
carry the input port information in the most significant
bits (MSBs) of their ID, routing responses is as simple as
demultiplexing based on the MSBs and then truncating
the ID to the original width. Another key advantage is
that transactions with the same ID from any two different
slave ports remain independent, so (O1) does not restrict
communication through our multiplexer. Note that stream
demultiplexing means the payload is the same for all demux
outputs and only the handshake signals are (de)multiplexed.
Alternative multiplexer architectures could do without
extending the ID, for example by allowing only transactions
with different IDs concurrently or by remapping IDs internally.
3write
cmd[1]
write
cmd[0]
Stream
Demux RR Arbiter
write
resp[1]
write
resp[0]
RR Arbiter
read
resp[1]
read
resp[0]
write
resp
read
resp
write
cmd
write
data[1]
write
data[0]
Stream
Demux
write
data
write
mst sel
read
cmd[1]
read
cmd[0]
Stream
Demux
read
cmd
read
mst sel
ID
FSM
index
&
counter
ID
ID
FSM
index
&
counter
ID
Figure 2. Architecture of ourdemultiplexer, drawn with two master ports.
However, the former restricts communication, and the latter
significantly increases the complexity of the multiplexer.
Nonetheless, some network modules grow exponentially in
complexity with the ID width. We have a modular solution to
this challenge with the ID width converters discussed in § 2.3.
2.1.2 Network Demultiplexer
The demultiplexer, which connects one slave port to multiple
master ports, is more complex than the multiplexer due to the
ordering rules: When the demultiplexer gets two commands
with the same ID and direction (O1) that go to two different
master ports, it must deliver the corresponding responses
in the same order (O2). After the demultiplexer, however,
transactions on different master ports are independent, so
the demultiplexer cannot rely on the order of downstream
responses to fulfill (O2).
Our demultiplexer architecture, shown in Fig. 2, solves this
by enforcing that all concurrent transactions with the same
direction and ID target the same master port. For example,
when a write with ID A targets master port 0, it is only
forwarded if no writes with ID A to master ports other than 0
have outstanding responses; otherwise, the write must wait. To
track this information, the demultiplexer contains one counter
and one index register per ID and direction. Commands that
fulfill the aforementioned requirement increase the counter;
the (last) response decreases the counter. A stream register
between the write command channel and the demultiplexer
of the write data channel stores the master port index of an
ongoing write burst while the command channel is indepen-
dently handshaked (F1). Write commands and data bursts are
sent in lockstep due to (O3); without this restriction, the write
command and data channels could deadlock downstream.
The multiple read and write response channels are joined
through a round-robin arbitration tree.
Alternative demultiplexer architectures could do without
requiring all concurrent transactions with the same direction
and ID to target the same master port, for example by remap-
ping IDs internally. However, this significantly increases the
complexity of the demultiplexer, which would have to reorder
responses internally to fulfill (O2). Instead of introducing this
complexity, we let a master use different IDs for different
endpoints if it can handle out-of-order responses.
2.2 Network Junctions: Crossbars and Crosspoints
2.2.1 Crossbar
The elementary components in § 2.1 can be combined to form
a fully-connected crossbar, shown in Fig. 3, where each slave
port has a dedicated connection to each master port.
At each slave port, two address decoders (one for reads,
one for writes) drive the selection signals of a demultiplexer.
Figure 3. Architecture of our crossbar, drawn with two slave and three
master ports. Each fat arrow represents a five channel (see § 2) connection.
Components with dashed outline are optional.
In the standard configuration, all slave ports use the same
address range for one master port, but different configurations
would be possible. There are two alternatives for handling
transactions to an address range that is not defined in a
decoder. First, one master port can be defined as default port.
This is useful, for example, in a hierarchical topology where
each downlink has a specific address range and any address
outside the downlink addresses is sent to higher hierarchy
levels through the uplink. Second, one can instantiate an
error slave, which terminates all transactions with protocol-
compliant error responses. These two alternatives can be
selected per slave port with a synthesis parameter.
Optional pipeline registers can be inserted on all or some
of the five channels of each internal connection. These registers
cut all combinational signals (including handshake signals),
thereby adding a cycle of latency per channel and pipelining
the crossbar so its critical path is no longer than that of the
demultiplexer or multiplexer. These pipeline registers can
be added without risking deadlocks, but this is not trivial:
Of the four Coffman conditions [6], (1) Mutual Exclusion is
fulfilled on the write data channel after the multiplexer, (2)
Hold and Wait is fulfilled as each pipeline register must hold
its value once filled, (3) No Preemption is fulfilled by (O3) on
the write data channel, and (4) Circular Wait would be fulfilled
by round-robin arbitration of write command and data beats.
However, the demultiplexer breaks condition (4) by restricting
write commands to be issued in lockstep with write data bursts
(i.e., the next write command is only issued after the previous
write data burst has completed), thereby preventing deadlocks
despite pipeline registers, which introduce condition (2).
2.2.2 Crosspoint
As the multiplexers in the crossbar expand the ID width, the
master ports of the crossbar have a wider ID than the slave
ports. This prevents the direct use of our crossbar as nodes
in a regular on-chip network where each node (also called
“router”) has isomorphous slave and master ports. To solve
this problem, we introduce a crosspoint.
Our crosspoint, shown in Fig. 4, has two additional proper-
ties over the crossbar that make it better suited for composing
arbitary regular on-chip topologies. First, it contains a crossbar
that is not necessarily fully connected: The connection between
any slave and master port can be omitted with a synthesis
parameter. This is useful to prevent routing loops when a
module has both a master and a slave port into the crosspoint,
and it minimizes the physical resources on links that would
be unused. Second, the crosspoint contains an ID remapper
4Figure 4. Architecture of our crosspoint, drawn with four slave and master
ports. Each arrow represents a five channel (see § 2) connection.
Figure 5. Architecture of our ID remapper, drawn with up to four unique
concurrent IDs (per direction).
(§ 2.3.1) on each master port, which reduces the ID width to
that of the slave ports. Thus, the slave and master ports of
each crosspoint are isomorphous.
2.3 Concurrent Transactions: ID Width Converters
The ID of transactions is central to their ordering (O1–
2). Essentially, the commands and responses of any two
transactions can be independently reordered if they have
different IDs. This makes a high number of possible IDs
attractive to prevent bottlenecks due to ordering constraints.
However, tracking a high number of IDs is complex for
network components (e.g., demultiplexer §§ 2.1.2 and 3.1.2).
ID width converters are the on-chip network designer’s
instrument to balance the number of independent concurrent
transactions vs. circuit complexity. We focus on reducing the
ID width (as extending it is trivial). There are two first-order
parameters for ID reduction: the width of IDs at the output, O,
and the maximum number of unique IDs at the input, U . The
relation between O and U determines whether all transactions
that were independent at the input remain independent at
the output: If U ≤ 2O, every unique ID at the input can be
represented by a unique ID at the output, therefore retaining
transaction independence. This means the sparsely used input
ID space can be ‘compressed’ to a narrower, densely used
output ID space by remapping IDs (§ 2.3.1). If U > 2O, there
are not enough output IDs to represent all U unique IDs. This
means some transactions with originally different IDs will
have to be mapped to the same ID, thereby serializing them
(§ 2.3.2).
2.3.1 ID Remapper
Our ID remapper, shown in Fig. 5, remaps IDs with one table
per direction. The table has as many entries as there are unique
input IDs, and it is indexed by the output ID. Each table entry
has two fields: the input ID and a counter that records how
many transactions with the same ID are in flight. The counter
is incremented on command handshakes and decremented
on (last) response handshakes. The mapping from input to
write
cmd
read
resp
write
resp
write
data
read
resp
write
resp
write
data
read
cmd
write
cmd
f(ID)
read
cmd
ID
write
data
write
data
write
cmd
write
cmd
push
ID = 0
pop
write
resp
ID
write
resp
ID
read
cmd
read
cmd
push
ID = 0
pop
read
resp
ID
read
resp
mst
sel
mst
sel
f(ID)
Figure 6. Architecture of our ID serializer, drawn with four master port
IDs (per direction).
output IDs is injective. Obtaining the input ID from an output
ID (to remap responses) is as simple as indexing the table.
Determining the output ID for an input ID (to remap requests)
requires a comparison of the input ID to all IDs in the table. If
the table currently contains an entry for the input ID, the same
output ID must be used (O1). If the table does not currently
contain an entry for the input ID, the output ID is the index
of the next free table entry.
Alternative ID remapper architectures could feature an
additional table indexed by input IDs to look output IDs up.
However, under the assumption of the remapper that the
input ID space is sparse, such an additional table would be
mostly empty. Therefore, it would be a poor usage of hardware
resources and we omit it at the cost of a longer ID translation
path, which could be pipelined.
2.3.2 ID Serializer
If the number of unique IDs at the input of the ID width
converter, U , exceeds the number of available IDs at the
output, 2O, both the input and the output ID space are
densely used. In this case, it is not possible to retain the
uniqueness of all IDs during conversion, and we call the
transformation that imposes additional ordering serialization.
Serialized transactions still have concurrently outstanding
requests, but they are now required to be handled in-order.
Our ID serializer, shown in Fig. 6, transforms IDs with
one FIFO per direction and master port ID. At the slave port
of the serializer, a demultiplexer assigns transactions to one
of the FIFO submodules through a combinational function f
of the request ID (e.g., the ID modulo the number of master
port IDs). The demultiplexer is a reduced configuration of our
network demultiplexer (§ 2.1.2) without ID counters because
f assigns identical IDs to the same master port (and thus the
same output ID (O1)). In each FIFO submodule, the ID of a
request is pushed into a FIFO and then truncated to zero. This
FIFO reflects the transaction ID in responses (O2), and the last
response of a transaction pops from the FIFO. After the FIFOs,
an instance of our network multiplexer (§ 2.1.1) assigns each
transaction the index of its FIFO and merges the requests to
the single master port of the ID serializer.
Alternative ID serializer architectures could use one
memory where one linked list per master port ID is stored for
ID reflection. This would allow to dynamically grow queues
in memory rather than statically provisioning hardware
resources to accommodate a fixed maximum of transactions
per master port ID. However, pushing and popping IDs from
this memory is on the critical path of the serializer, so we
prefer the architecture with multiple FIFOs.
52.4 Data Width Converters
The data width of network components depends on their
bandwidth requirements. For instance, the master port of
a high-performance DMA engine might have 512 bit data
width while that of a 64-bit processor core typically has 64 bit.
This extends to subnetworks, e.g., separate networks for the
DMA engine and the cores. However, as subnetworks with
different data widths are joined, e.g., at endpoints such as
memories, data width converters (DWCs) are required to
convert between data widths. DWCs can be either upsizers,
converting from narrow to wide, or downsizers, converting
from wide to narrow. Although similar in purpose, up- and
downsizer are not fully symmetric. In fact, the upsizer has
higher performance requirements than the downsizer, since
it must utilize the higher-bandwidth network as much as
possible to minimize the impact on other components on the
high-bandwidth network.
2.4.1 Data Upsizer
A data upsizer has a narrow slave port of data widthDN and a
wider master port of data width DW. In the simplest operating
mode, pass-through, the upsizer does lane selection on read
responses (Fig. 7a), selecting a slice of a wide incoming word,
and lane steering on write data, aligning narrow incoming data
into the wider outgoing word (Fig. 7b). In pass-through mode,
the upsizer does not change the number of bytes transferred in
each beat. This can be required by transaction attributes (e.g.,
to device memory). In terms of performance, however, this
underutilizes the high-bandwidth network, which inherits the
throughput of its low-bandwidth counterpart. Utilization can
be increased by reshaping incoming bursts with many narrow
beats into bursts with fewer wide beats: several narrow write
data beats are packed into one wide beat, and one wide read
response beat is serialized into several narrow beats.
Our data upsizer, shown in Fig. 7c, is capable of upsizing
between interfaces of any data width. It is composed by two
modules, read and write upsizers, that perform lane selection
and steering, besides deciding whether to upsize the request
based on the transaction properties. Due to (O3), only one
write upsizer is needed, containing a buffer of width DW to
perform data packing. On the read response channel, the
data upsizer handles a certain number of outstanding read
transactions in parallel. Each incoming read transaction is
assigned an idle read upsizer, unless there is an active upsizer
handling a transaction with the same ID. For that case, we
ensure (O1) by enforcing that incoming transactions with the
same ID are handled by the same read upsizer. Each read
upsizer has a DW buffer to hold incoming beats. This avoids
blocking the wide read response channel during serialization.
2.4.2 Data Downsizer
A data downsizer has a wide slave port of data width DW
and a narrower master port of data width DN. In the simplest
operating mode, pass-through, the downsizer does steering
on the read data channel and selection on the write data
channel, symmetrical to the base operations of the data upsizer.
Our downsizer, shown in Fig. 7d, differs from the upsizer in
two key points: First, the downsizer has lower performance
requirements than the data upsizer, since it connects to a
lower-bandwidth subnetwork, e.g., peripherals. This means it
does not need to support multiple outstanding reads. Second,
when downsizing, the downsizer converts few wide beats
into multiple narrow beats. It is possible that the resulting
burst is longer than the longest buffer allowed by the protocol.
In this case, the downsizer needs to break the incoming burst
into a sequence of bursts. To handle this corner case, among
others, the control logic of the read and write downsizers is
more complex than those in the upsizer.
2.5 Data Movement: DMA Engine
Transferring large amounts of data at high bandwidth requires
dedicated components for data movement called direct memory
access (DMA) engines. Our DMA architecture is designed to be
modular, dividing the unit into two parts: a system-specific
frontend and a backend implementing the data movement
within the on-chip interconnect. We define a simple, yet
well-defined interface uniting both parts: a one-dimensional
and contiguous memory block of arbitrary length, source,
and destination address, called 1D transfer. We chose this
interface abstraction because 1D transfers map very well to
burst-based transactions. More complex transfers, such as
multi-dimensional or strided accesses, are decomposed by
the frontend into 1D transfers. As the frontend is highly
system-specific, we will not discuss it.
In the backend, the burst reshaper, shown in Fig. 8a, divides
the arbitrary-length 1D transfers into protocol-compliant
bursts (adhering to, e.g., address boundaries and maximum
number of beats). On arrival of a new 1D transfer, the
burst converter loads length, source address, and destination
address into internal registers. The burst boundaries process
determines the number of bytes that can be requested in
the next burst. With this, the burst reshaper calculates the
address of the next burst and the remaining bytes left in the
1D transfer. Each protocol-compliant burst is then translated
by the data mover unit, shown in Fig. 8b, into a read and a
write command as well as a read and a write data job. The
commands are issued as beats on the command channels.
The data jobs are forwarded to the data path. The data path,
shown in Fig. 8c, receives read data beats, realigns the data to
compensate for different byte offsets between the read and
write data streams, and issues write data beats. The data path
consists of two independent processes. The read process is
realigns and buffers incoming data. If a burst starts on an
unaligned address, some leading bytes (“head”) in the first
beat are invalid and are masked. Similarly, a burst may end on
an unaligned address, in which case some trailing bytes in the
last beat (“tail”) need to be masked. The write process drains
data from the buffer as soon as it is available and masks it
according to the destination address offset with the strobe
signal of the write data channel.
2.6 On-Chip Memory Controllers
On-chip memories are an important class of endpoints for
on-chip network transactions. In this section, we describe two
memory controllers through which standard single-port static
random access memory (SRAM) macros can be connected to
the on-chip network.
2.6.1 Simplex Memory Controller
The architecture of our simplex on-chip memory controller
is shown in Fig. 9. Simplex in this context means that the
controller in each clock cycle can either read or write memory,
6Figure 7. Architecture of our data width converters (DWCs). (a) Data selection in the read response and (b) data steering in the write data channel of
the upsizer. (c) Upsizer, drawn with two outstanding read transactions. (d) Downsizer.
Figure 8. Architecture of our DMA engine. (a) Burst reshaper. (b) Data mover. (c) Data path, drawn for 64 bit data width.
Figure 9. Architecture of our simplex on-chip memory controller, with
the on-chip network slave port at the top and the memory master port
at the bottom. The memory master port has the same data width as the
network slave port.
as is natural for a single-port SRAM. The memory controller
first translates read commands and write commands plus
write data into memory requests. An arbiter then forwards
either a read or a write memory request per clock cycle. This
arbiter optionally takes quality of service (QoS) attributes
of a command into account and can prioritize write beats,
which cannot be interleaved due to (O3), over read beats. A
stream fork unit splits address and data, which go to the
memory interface, and meta data (e.g., the transaction ID),
which are used by the memory controller to form responses in
the network protocol. A converter translates the address and
data stream into memory interface signals (with stream flow
control on the request and no handshaking on the response
path). The memory responses are then joined with meta data
to form read or write responses, which are finally issued on
the corresponding network response channel.
The simplex memory controller cannot achieve the full
bidirectional bandwidth of the duplex on-chip network inter-
face, which has separate channels for read and write data. The
duplex memory controller removes this limitation.
2.6.2 Duplex Memory Controller
The architecture of our duplex memory controller is shown in
Fig. 10. To saturate the read and write data channels of the
on-chip network simultaneously (thus duplex), this memory
controller has at least two independent memory master ports
as well as one simplex controller for writes and one for reads.
A network demultiplexer statically routes all writes through
Figure 10. Architecture of our duplex on-chip memory controller with
four address-interleaved memory master ports.
the left controller and all reads through the right controller.
The unused resources inside both simplex controllers are
optimized away during synthesis. A logarithmic memory
interconnect then routes each request to one of the memory
master ports, which are address-interleaved.
The duplex memory controller can fully saturate both the
read and the write data channel of the on-chip network in the
absence of conflicts on the memory ports. However, irregular
traffic (e.g., misaligned addresses, mixed wide and narrow
beats) can give rise to a significant conflict rate. To reduce
conflicts, the banking factor (i.e., the number of memory master
ports per network slave port) can be increased to any integer
higher than 2 (at the cost of more wide and shallow SRAM
macros when the memory capacity is to remain constant).
2.7 Last Level Cache
In contrast to the on-chip memory controllers of § 2.6, where
the memory content is fully managed by software (so-called
scratch-pad memorys (SPMs)), a cache provides on-chip
memory fully managed in hardware. As this work focuses
on non-coherent on-chip communication, we present a non-
coherent last level cache (LLC). Even though traditionally caches
are not seen as part of the communication infrastructure, we
include this LLC in ours because it can reduce latency and
bandwidth between its slave (ingress) port and its master
(refill) port. This is very useful, for example, in front of an
off-chip memory controller.
7Figure 11. Architecture of our last level cache (LLC).
Our LLC’s set associativity, number of cache lines, and
number of cache blocks per cache line are synthesis parame-
ters, giving complete control over the physical size and shape
of the cache. It uses a write-back, read and write allocate
data policy with pseudo-random eviction. The cache supports
concurrent read and write accesses as well as eviction and
refill operations. Reads are interleaved while adhering to
(O1–2). Transactions that hit in the cache can bypass earlier
transactions that missed in the cache and are currently being
serviced (i.e., eviction and refill) as far as permitted by (O1).
As not all applications benefit from a hardware-managed
cache, our LLC can be reconfigured at runtime to partially
or fully become a software-managed SPM. This option is
available at the granularity of single cache sets. It is possible
to use the entire data memory of our LLC as SPM. In that
case, all accesses outside the address range of the SPM bypass
the core of the LLC and are directly forwarded to the master
port. This bypass is also used for non-cacheable transactions.
The architecture of our LLC is shown in Fig. 11. Like most
components in our platform, the LLC is implemented with
the stream-based control scheme that is natural to on-chip
communication. The main idea is to start from the command
and write data beats at the slave port, then transform, split,
and merge them into descriptors that flow through the cache
and give rise to new commands (for evictions and refills)
and eventually to read and write responses. Starting at the
slave port, commands are decoded by address and memory
attributes and either sent to the bypass or into the core
of the LLC. A command beat enters the cache over the
command splitting units. They split the command down
into descriptors, each of which targets exactly one cache line.
These splitters also determine whether the access targets a
cache set or an SPM region. Afterwards, the descriptors are
arbitrated together with flush descriptors into a common
pipeline. The descriptors then enter the hit-miss detection
unit. Descriptors flagged as SPM simply flow through this
unit, whereas all other descriptors perform a lookup inside
the tag storage. The comparison and eviction unit determines
the exact cache line and set of the descriptor. Additionally, this
unit determines whether the descriptor gives rise to a refill
or eviction. Descriptors that miss in the cache are sent to the
eviction and refill pipeline, whereas descriptors that hit bypass
Figure 12. Minimum clock period and corresponding area of our multi-
plexer in GF22FDX for 2 to 32 slave ports and 6 ID bits.
this pipeline, which reduces their access latency. Two units
ensure the descriptors maintain data consistency and adhere to
(O1–3): The index and miss counters prevent that a descriptors
in the hit bypass overtakes another descriptor in the miss
pipeline with the same ID. The line lock allows only one
descriptor to operate on a cache line and set at a time, which
prevents data corruption that could occur from descriptors
evicting a cache line used by another descriptor. Four units
manipulate the data SRAMs of our LLC: the eviction and refill
units, which update the state of the data prior to a requested
operation, and the read and write units, which perform the
actual cache operation. All four units are connected over a
logarithmic memory interconnect to the data SRAMs. The
data width of the data channels and the SRAM data ports
correspond to the cache block width. This setup allows all
four units to concurrently have one descriptor each active on
the data, thereby using the maximum available bandwidth of
the slave and the master port of the LLC.
3 IMPLEMENTATION RESULTS
This section provides quantitative and asymptotic complexity
results for our network components. These results are essential
for architects to assess the feasibility and strike trade-offs in
the design of on-chip networks.
We implement the components presented in § 2 in
GlobalFoundries’ 22 nm fully-depleted silicon-on-insulator
(GF22FDX) technology, using a ten-metal stack and eight track
SLVT/LVT flip-well standard cells characterized at typical
conditions (0.8 V, 25 ◦C). We synthesize with Synopsys De-
signCompiler 2019.12 using topographical mode, so physical
place-and-route constraints, dimensions, and delays are taken
into account. For the isolated implementation of the modules,
each input is driven by a D-flip-flop (FF), and each output
drives a D-FF. Unless we vary it in the evaluation, we set
the address and data width to 64 bit and the slave port ID
width to 6 bit. Before undergoing synthesis, all modules have
been verified for protocol compliance in RTL simulation under
extensive directed and constrained random verification tests.
3.1 Stream Flow Control: (De)Multiplexers
3.1.1 Network Multiplexer
The critical path of the multiplexer goes through from a slave
port command channel through the arbitration tree on its
handshake signals and the multiplexers on its payload signals
to a master port command channel. For S slave ports, it scales
with O (logS) due to the logarithmic depth of the arbitration
tree and the multiplexers. The area scales O (S) due to the
linear area of the arbitration tree and the multiplexers. The
area is further linear in the ID width and the maximum
number of write transactions due to the FIFO between write
command and data channel, but this part is usually negligible.
Fig. 12 shows the area and timing characteristics of our
8Figure 13. Minimum clock period and corresponding area of our demul-
tiplexer in GF22FDX: (a) with 2 to 32 master ports and 6 ID bits, and
(b) with 4 master ports and 2 to 8 ID bits.
Figure 14. Minimum clock period and corresponding area of our crossbar
with 4 slave ports, fully connected and unpipelined, in GF22FDX: (a) with
2 to 8 master ports, 4 slave ports and 6 ID bits, and (b) with 4 master ports
and 2 to 8 ID bits at the slave port.
multiplexer: for 2 to 32 slave ports, the critical path increases
logarithmically from 190 to 270 ps, and the area increases
linearly from 2 to 30 kGE.
3.1.2 Network Demultiplexer
The critical path of the demultiplexer goes from a slave
command channel through ID lookup to a command channel
on one of the master ports. It scales with O (M) as the stream
demultiplexers grow linearly in area with the master ports
and topographical synthesis takes the distance increase into
account. The area scales with O (M) due to the linear area
of the arbitration trees and the stream demultiplexers. The
ID width I is critical for the demultiplexer: the area scales
with O (2I) due to the exponential number of counters (one
for every possible ID), and the critical path scales with O (I)
because every ID bit adds a multiplexer level in the indexing
logic of the counters. Fig. 13 shows the area and timing
characteristics of our demultiplexer: For 2 to 32 master ports
and 6 ID bits (Fig. 13a), the critical path increases linearly from
330 to 430 ps, and the area increases linearly from 22 to 38 kGE.
The curve is non-monotonic mainly in two points, where
the synthesizer selects disproportionately strong and large
buffers to reach the target frequency. For 4 master ports and
2 to 8 ID bits (Fig. 13b), the critical path increases linearly
from 250 to 400 ps, and the area increases exponentially from
5 to 95 kGE. Depending on the ID width, the critical path
can be significantly longer than in the multiplexer, so the
demultiplexer will be the critical stage in a pipelined network
junction.
3.2 Network Junctions: Crossbars and Crosspoints
3.2.1 Crossbar
For a fully-connected crossbar with S slave ports, M master
ports and I bits at the slave port, the critical path is dominated
by the demultiplexer, thus scales with O (M + I). The area
is the sum of the area of the S demultiplexers and M
multiplexers plus a small overhead for each slave port for
address decoding and the error slave (when instantiated). The
area thus scales with O (MS + 2IS). Fig. 14 shows the area
and timing characteristics of a fully-connected, unpipelined
instance of our crossbar: For 4 slave ports, 2 to 8 master
(M)
(M)
(2I)
(I)
Figure 15. Minimum clock period and corresponding area of our cross-
point with 4 slave ports, fully connected and pipelined, in GF22FDX:
(a) with 2 to 8 master ports, 4 slave ports and 6 ID bits, and (b) with 4
master ports and 2 to 8 ID bits at the ports.
Figure 16. Minimum clock period and corresponding area of our ID
remapper in GF22FDX: (a) for 1 to 64 concurrent unique IDs and 8
transactions per ID, and (b) for 16 concurrent unique IDs and 1 to 32
transactions per ID.
ports and 6 ID bits (Fig. 14a), the critical path increases
linearly from 400 to 450 ps, and the area increases linearly
from 111 to 156 kGE. As was the case for the demultiplexer
(§ 3.1.2), the ID width of the slave ports has significant impact
on the critical path and area of the crossbar. For 4 master and
4 slave ports and 2 to 8 ID bits (Fig. 14b), the critical path
increases linearly from 340 to 460 ps, and the area increases
exponentially from 42 to 390 kGE.
3.2.2 Crosspoint
The critical path of a fully pipelined crosspoint goes from
the internal pipeline register of a master port into the table
of an ID remapper. For M master ports (Fig. 15a), it scales
with O (M) from 610 to 630 ns as topographical synthesis
takes the area increase into account. The area also scales with
O (M) but much more significantly from 243 to 587 kGE as
the crossbar and the number of ID remappers scale linearly.
Regarding the ID width I , the crosspoint is dominated by
the demultiplexer: For 2 to 8 ID bits in a 4× 4 configuration
(Fig. 15b), the area scales with O (2I) from 127 to 1181 kGE
and the critical path scales with O (I) from 290 to 800 ps.
3.3 Concurrent Transactions: ID Width Converters
3.3.1 ID Remapper
The critical path of our ID remapper goes from the input ID
through the ID equality comparators in in the table, through
a leading-zero counter (LZC) to determine the matching
or the first free output ID, into a table counter entry. For
an input ID width of I , up to U concurrent unique IDs
(per direction), and up to T transactions per ID, it scales
with O (log I + logU + log T ). The area is dominated by
the tables, which have U entries with I + log2 T bit each.
Additionally, the LZCs have an area of O (U logU). The total
area thus scales withO (U(I + log T + logU)). Fig. 16 shows
the area and timing characteristics of our ID remapper: For
U = 1 to 64 concurrent unique IDs and T = 8 transactions
per ID (Fig. 16a), the critical path increases logarithmically
from 200 to 520 ps until U = 48 and then linearly to 640 ps
for U = 64 as path delays due to the linearly growing table
start to dominate. The area increases linearly from 1 to 41 kGE.
9Figure 17. Minimum clock period and corresponding area of our ID
serializer in GF22FDX: (a) for 1 to 32 IDs at the master port and 8
transactions per master port ID, and (b) for 4 IDs and 1 to 32 transactions
per ID at the master port.
The highest (rightmost) configuration can remap up to 512
transactions in both directions with up to 64 unique IDs
concurrently, but the area and critical path costs are quite
high. In comparison, for U = 16 concurrent unique IDs
and T = 1 to 32 transactions per ID (Fig. 16b), the critical
path increases logarithmically from 300 to 440 ps, and the
area increases logarithmically from 7 to 16 kGE. Thus, the
highest (rightmost) configuration can also remap up to 512
transactions but with only up to 16 unique IDs concurrently,
at a 2.6× lower area and 1.5× shorter critical path.
3.3.2 ID Serializer
The critical path of the ID serializer goes through the demulti-
plexer, the push side of the ID FIFO, and the arbitration tree
in the multiplexer. For UM IDs at the master port and T trans-
actions per master port ID, it scales with O (logUM + log T ).
The area scales with O (UM + T ) due to the linear area of
all components in either UM or T . Fig. 17 shows the area
and timing characteristics of our serializer: For UM = 1 to 32
IDs at the master port and T = 8 transactions per master
port ID (Fig. 17a), the critical path increases logarithmically
from 195 to 410 ps, and the area increases linearly from
2 to 109 kGE. Clearly, compressing a densely used ID space
is expensive in terms of area. This cost can be reduced by
fixing UM at a low value and varying T : For UM = 4 IDs and
T = 1 to 32 transactions per ID at the master port (Fig. 17b),
the critical path increases logarithmic from 245 to 280 ps, and
the area increases linearly from 15 to 51 kGE. 128 concurrent
transactions (in both directions) could therefore be serialized
with UM = 4, T = 32 at 1.28× less area and 1.29× shorter
critical path.
3.4 Data Width Converters
For our data downsizer between a wide slave port of width
DW and a narrow master port of width DN, the critical path
goes through the data selection and steering logic, scaling
logarithmically with the downsize ratio O (log (DW/DN)).
The area is O (DNDW), the first term accounting for the
multiplexing logic for data selection and steering, and the
second accounting for the registers that hold a wide beat for
data packing on the write data channel. Fig. 18a (left side)
shows the area and timing characteristics of our downsizer: for
a master port of width 64 bits and a slave port of width 8 to 32
bits, the critical path decreases with increasing width of the
slave port (and decreasing downsize ratio), from 365 to 390 ps,
while the area grows linearly from 23 to 25 kGE.
For the data upsizer between a narrow slave port of width
DN and a wide master port of width DW, the critical path goes
through the data selection logic and the round-robin arbiter,
scaling linearly with the number of read upsizers R and log-
arithmically with the upsize ratio, O (R log (DW/DN)). The
1 2 4 8
# of read upsizers
300
400
500
600
m
in
im
um
 c
lo
ck
 p
er
io
d 
[p
s] (b)
min. tck
area (right)
30
40
50
60
ar
ea
 a
t m
in
im
um
clo
ck
 p
er
io
d 
[k
GE
]
(R)
(R)
Figure 18. Minimum clock period and corresponding area of: (a) our data
downsizer and upsizer, considering a master port 64-bits wide and a
slave port 8 to 512-bits wide and (b) our data upsizer, considering a
master port 64-bits wide, a slave port 128-bits wide, and 1 to 8 read
upsizers.
Figure 19. Minimum clock period and corresponding area in GF22FDX
of (a) our DMA engine for 16 to 1024 bit data width, and (b) our simplex
on-chip memory controller for 8 to 1024 bit data width.
area of the upsizer scales with O (RDNDW), compounding
the effect of the multiplexing logic for data selection and
steering, DN, and of the R DW-bit registers holding wide
beats for data serialization on the read data channel. Fig. 18a
(right side) shows the area and timing characteristics of our
upsizer: for a master port of width 64 bits and a slave port
of width 128 to 512 bits, the critical path increases with the
increasing upsize ratio, from 380 to 405 ps, while the area
increases from 27 to 35 kGE. Fig. 18b shows the area and
timing characteristics of the data upsizer from 64 to 128 bits,
for 1 to 8 read upsizers. These have an important effect on the
area and critical path of the upsizer. The critical path of the
upsizer increases linearly from 380 to 485 ps, while the area
increases from 27 to 59 kGE.
3.5 Data Streaming: DMA Engine
The area of the DMA engine scales with O (D), where D is
the data width, due to the linearly growing alignment buffer.
The critical path is dominated by the barrel shifter, which
scales with O (logD). For 16 to 1024 bit data width (Fig. 19),
the critical path increases logarithmically from 290 to 400 ps
and the area increases linearly from 25 to 141 kGE. As the
DMA engine uses the same ID for all transactions, the ID
width affects neither area nor critical path.
3.6 On-Chip Memory Controllers
3.6.1 Simplex Memory Controller
For a simplex on-chip memory controller with a data width
of D, the critical path is constant and found between the
command slave channels and the memory request master port.
The critical path does not depend on D as the transformation
of commands does not depend on the data width. Fig. 19b
shows the area and timing characteristics: The area scales
linearly withO (D) from 13 to 53 kGE; this linear dependency
is caused by the dominant read response buffers needed for
response path decoupling. The critical path remains roughly
constant around 290 ps. The ID width has no impact on the
critical path, as the simplex controller handles all requests in
order and only buffers the ID for the response. The area scales
with O (I) due to these buffers.
10
Figure 20. Minimum clock period and corresponding area of our duplex
on-chip memory controller in GF22FDX: (a) for 8 to 1024 bit data width
and two memory master ports, and (b) for 64 bit data width and 1 to 8
memory master ports.
(1)
(1)
(log L)
(L)
Figure 21. Minimum clock period and corresponding area of our last level
cache in GF22FDX, with a set associativity of 4, 16 blocks per cache line,
8 B per block, and 64 bit addresses, (a) without SRAM and (b) with SRAM.
3.6.2 Duplex Memory Controller
The critical path of the duplex controller goes from the
slave port command channels through the demultiplexer,
one simplex memory controller, and the logarithmic memory
interconnect to a memory request port. For a data width of D
andB memory master ports, it scales withO (logD). The area
is composed of the demultiplexer, the two simplex memory
controllers, and the logarithmic interconnect, and thus scales
with O (B +D). Fig. 20 shows the area and timing character-
istics of our duplex memory controller: For D = 8 to 1024 bit
data width and B = 2 memory master ports (Fig. 20a), the
critical path increases logarithmically from 280 to 330 ps, and
the area increases linearly from 20 to 175 kGE. For D = 64 bit
data width and B = 2 to 8 memory master ports (Fig. 20b),
the critical path stays constant around 300 ps and the area
scales with O (B) from 28 to 34 kGE. Regarding the ID width
I , the complexity is defined by the demultiplexer.
3.7 Last Level Cache
We evaluate our LLC with a set associativity of 4, 16 blocks
per cache line, and 8 B per block, and we vary the cache size
through the number of cache lines L. Area and critical path of
a cache are commonly dominated by its SRAM macros, but it
is essential that the control logic adds only minimal overhead.
The control logic remains constant in area when increasing
the cache size with L, as shown in Fig. 21a. The critical path is
inside the tag lookup unit, starting at the tag memory, going
through the tag comparators, and ending again in the tag
memory. The logic on the critical path does not increase with
L, however the tag memories get larger and thus become
slower (Fig. 21b). Changing the ID width would scale the
area with O (2I) due to the ID counters instantiated in the
bypass multiplexer and the counters in the hit-miss unit. The
ID width has no influence on the critical path.
The LLC including the SRAM macros is characterized
in Fig. 21b. Compared to the area of the control logic alone
(Fig. 21a), the SRAM macros occupy 8 to 64 times more
area for a cache size of 64 to 1024 KiB. The delays of the
memory dominates the critical path of the design. Thus,
the area occupied for control logic is below 10 % already
at 128 KiB, and becomes marginal at larger sizes.
Critical Path Area
Multiplexer O (logS) O (S)
Demultiplexer O (M + I) O (M + 2I)
Crossbar O (M + I) O (MS + 2IS)
Crosspoint O (M + I) O (M + 2I)
ID Remapper O (log I + logU + log T ) O (U(I + log T + logU))
ID Serializer O (logUM + log T ) O (UM + T )
Data Upsizer O (R log (DW/DN)) O (RDWDN)
Data Downsizer O (log (DW/DN)) O (DWDN)
DMA Engine O (logD) O (D)
Simplex Mem. Ctrl. O (1) O (D)
Duplex Mem. Ctrl. O (logD + logB + I) O (D +B + 2I)
Last Level Cache O (1) O (2I)
Legend: M = number of master ports; S = number of slave ports.
D = data width; DW = data width of the wide interface; DN = data
width of the narrow interface; I = ID width; U = concurrent unique
IDs; UM = concurrent unique IDs at the master port; T = concurrent
transactions per ID. B = number of memory master ports. R = number of
read upsizers.
Table 2. Overview of the complexity of our network components.
3.8 Complexity Overview and Summary
Table 2 gives an overview of the asymptotic complexity
of our network components. The critical path of all com-
ponents scales at worst linearly in their parameters, for
most components and parameters even logarithmically. As
the absolute results of the minimum clock period show,
the critical path of all components remains below 500 ps
post-topographical-synthesis in the large design space we
evaluated. This shows our components are suited for a wide
range of target frequencies and bandwidths, up to 2 GHz.
When even higher frequencies are required, most components
can be parametrized to have a critical path below 330 ps,
which would allow to clock them up to 3 GHz. The area of
most components scales linearly in their parameters, with the
notable exception of the ID width, which causes an exponential
growth of the demultiplexer and all components containing it.
As the absolute results show, most components fit a few tens of
kGE when not pushed to the highest possible clock frequency
and parametrization. Even more complex components, such
as a 4 × 4 crossbar with up to 256 independent concurrent
transactions, fit in a modest 100 kGE when clocked at 2.5 GHz.
While component-wise results are important to show the
complexity and trade-offs in the microarchitecture of our on-
chip communication platform, they of course cannot show the
full picture of a real on-chip network. In the next section, we
analyze a full on-chip network.
4 SYSTEM CASE STUDY
In this section, we design, implement, and evaluate the on-
chip networks of a many-core floating-point accelerator (§ 4.1),
using the modules presented in this paper. We use the tech-
nology and synthesis flow described in § 3 and additionally
implement the networks with Cadence Innovus 19.1.
4.1 Many-Core Floating-Point Accelerator
The Manticore architecture [7] is a state-of-the-art manycore
processor for high-performance, high-efficiency, data-parallel
floating-point computing. A Manticore accelerator consists
of four chiplet dies on an interposer. Each chiplet, shown
in Fig. 22, contains 1024 cores grouped in 128 clusters, one
8 GiB HBM2E controller and PHY, 27 MiB L2 memory, one
PCIe 5.0 x16 controller and PHY, and three die-to-die link
11
Figure 22. Conceptual floorplan of one Manticore [7] chiplet die.
Figure 23. Manticore’s on-chip network. Each arrow represents a five
channel (see § 2) connection from a master port to a slave port. Fat arrows
mean 512 bit data width, thin arrows 64 bit. Numbers above arrows indicate
maximum transaction concurrency in the form unique IDs / transactions
per ID / total transactions per link.
(D2D) PHYs to the other chiplets. Each cluster contains eight
small 32-bit integer RISC-V cores, each controlling a large,
double-precision floating-point unit (FPU), and 128 KiB L1
memory organized in 32 SRAM banks. As primary means
for moving data into and out of L1, each cluster contains
two of our DMA engines (§ 2.5, one for reads and one for
writes), which are attached to the L1 memory and control a
512-bit-wide master port. DMA engines in other clusters can
access the L1 memory through an additional 512-bit-wide
slave port. Each cluster has a 64-bit master port to let its cores
access external memory and a 64-bit slave port to let cores
in other clusters access its L1 memory. Four clusters form an
L1 quadrant, four L1 quadrants form an L2 quadrant, four
L2 quadrants form an L3 quadrant, and two L3 quadrants
form a chiplet. Manticore has been introduced in [7] without
disclosing its on-chip network. In this subsection, we describe
design and implementation of the network.
4.1.1 Network Design
The on-chip network is designed with four main goals:
(G1) High bandwidth between units within the same quadrant
for effective local data sharing. (G2) High bandwidth between
the chiplet-level I/Os (i.e., HBM2E, PCIe, D2D) and any
cluster for effective data input and output. (G3) Low latency
between any two cores for efficient concurrency. (G4) Minimal
interference between the wide bursts of the DMA engine and
the word-wise accesses of the cores for maximum network
utilization. The network, shown in Fig. 23, has the following
properties to meet these goals: (1) Physically separate net-
works for traffic by DMA engines and cores to meet (G4). (2)
Tree topology to meet (G2–3). (3) Fully-connected crossbars
Figure 24. Microarchitecture and dimensions of Manticore’s on-chip
network. (a) L1 network. (b) L3 network. Only select connections are
drawn for reasons of lucidity.
within each quadrant to meet (G1). (4) Links with the same
width from the HBM2E controller all the way down to the
DMA engine in each cluster to meet (G2). The clock frequency
of the entire network is 1 GHz. The data width of the DMA
network is set to 512 bit, which corresponds to one of the four
ports into the HBM2E controller. Therefore, saturating the
full HBM2E bandwidth requires concurrent transactions from
only four DMA engines in different L2 quadrants. The data
width of the core network is set to 64 bit, which is native for
the load/store unit of a core.
The concurrency of transactions is another important
aspect of the network design. The numbers above an arrow
in Fig. 23 define the number of concurrent unique IDs,
transactions per ID, and total transactions per link (reads
and writes separate), respectively. ID width converters (§ 2.3)
are placed in the network where required to reduce the ID
width. Starting at the cluster, each DMA engine is in-order
(thus has a single ID) and can have up to 8 outstanding
transactions➊. Transactions by the 8 cores in the cluster are
independent, and each core can have at most 1 outstanding
transaction➋. The L1 network maintains the independence of
all DMA and core transactions, and the number of unique IDs
expands accordingly, as do the total transactions➌. The L2
network maintains the independence of DMA transactions but
limits their total below the sum of the incoming ports➍. The
reason is that the maximum roundtrip latency at this level is 60
cycles, so a higher number of concurrent transactions would
not increase bandwidth or utilization. The concurrency on
downlinks is generally constrained to match that of an uplink
into the lower network, e.g., ➎. This means each network
level can handle transactions from the uplink slave port in the
same way as transactions from downlink slave ports.
4.1.2 Network Microarchitecture and Implementation
The microarchitecture and physical dimensions of one L1 and
L3 network are shown in Fig. 24. (The L2 network is very
similar to the L1 and omitted for brevity.) For the L1 network,
the downlink ports are in the left third of each cluster, close
to the cluster’s memory and internal interconnect, and the
uplink port is in the middle of the narrow side. For the L2
and L3 network, the downlink ports are at one quarter of the
wide side (determined by the lower network level), and the
uplink port is in the middle of the narrow side. To isolate the
timing closure of individual network levels, we cut all paths
12
Unit L1 L2 L3
Entire
Network
Clock Frequency [GHz] 1.00 1.00 1.00 1.00
Routing Density* [%] 59.6 49.6 45.7 —
Area per Inst. [mm2] 0.41 1.40 2.99 30.43
# Insts. per Chiplet 32 8 2 1
Area per Chiplet [mm2] 13.21 11.23 5.98 30.43
Area per Chiplet† [%] 9.05 7.69 4.10 20.84
Area per Core+FPU [µm2] 12 900 10 970 5840 29 710
*Routing density along wider dimension (i.e., where routing is denser).
†Relative to chiplet area without I/O controllers and PHYs.
Table 3. Implementation results of Manticore’s on-chip network.
at the uplink ports➏. Correspondingly, all downlink inputs
are driven by FFs and all downlink outputs drive FFs. There
are two central challenges in the physical implementation of
the networks. First, the extremely wide aspect ratio: while one
wide dimension is determined by the side length quadrant, the
other dimension should be as narrow as possible to minimize
the area of the network. Second, routing and wire congestion:
each of the five interfaces has ca. 3300 separate wires, and each
network level is fully connected. Routing the wires of a single
interface horizontally occupies a height of ca. 100 µm on all
three metal layers available for inter-cell horizontal routing.
To mitigate congestion, the crossbar, with its fanout of wires
between demultiplexers and multiplexers, should be placed
and routed as compact as possible➐. The crossbar nonetheless
incurs a significant combinational delay. To accommodate this
despite the long distances due to the extreme aspect ratio, we
insert registers around the crossbar➑. In contrast to pipelining
inside the crossbar, much fewer registers are required, which
again benefits the compact layout of the crossbar. In the L3
network (Fig. 24b), pairs of L2 networks share one port on the
HBM2E controller. Cores on the narrow access the wide HBM
ports through data width converters. Because the HBM2E
controller is located on the left side of the chiplet, the left
L3 network simply feeds two connections from the right L3
network through pipeline registers to the controller ➒. ID
remappers are used to reduce ID widths according to the
concurrency design described in § 4.1.1➓.
The implementation results of Manticore’s on-chip net-
work are listed in Table 3. We have been able to close timing
and DRC of the entire network after place and route at 1 GHz.
For this, we first loosely constrained the narrow dimension to
determine the required number of pipeline registers around
the crossbar, then we reduced the narrow dimension until the
design could no longer be routed without failing timing or
DRC. As the high routing densities show, the area of each
network level is mainly determined by the available routing
channels. The total area of the network is 30.43 mm2, which
is 20.84 % of the chiplet area without I/O controllers and
PHYs. Put differently, Manticore’s entire high-bandwidth, low-
latency, hierarchical on-chip network requires 29 710 µm2 per
core. This is merely about the same area as one core (without
any cache) and FPU, which are highly area-efficient.
4.1.3 Network Performance
We use cycle-accurate register-transfer level (RTL) simulation
to assess the performance of Manticore’s on-chip network.
As we are not interested in cluster-internal data movement
and simulating 1024 cores and FPUs at this accuracy is
prohibitively slow, we first extract the DMA transactions,
Unit Convolution Matrix Mul.
base chunked pipe’d base pipe’d
Op. Intensity [dpflop/B] 2.2 15.9 15.9 2.7 2.7
HBM BW [GB/s] 262* 103 6 262* 235
L3 Agg. BW [GB/s] 262 103 6 262 235
L2 Agg. BW [GB/s] 262 103 26 262 235
L1 Agg. BW [GB/s] 262 103 103 262 572
Performance [Gdpflop/s] 571 1638† 1638† 1383 1638†
*Of which 256 GB/s are on the read channel, which is its maximum.
†This corresponds to an FPU utilization of ca. 80 %, which is the maximum
all 8 FPUs in a cluster can sustain for real kernels.
Table 4. Performance of Manticore for different NN layer implementations.
cluster-internal computations, and their interdependencies
from an RTL simulation of an application in an isolated cluster.
We then substitute each cluster by its DMA engine in the
simulation of the entire network and use the extracted patterns
to inject DMA traffic. We characterize the performance of two
fundamental kernels, a convolutional neural network (NN)
layer and a fully-connected NN layer, which together amount
to 95 to 99 % of the floating-point operations (FLOPs) in MLT.
In a convolutional NN layer, a set of input layers (matrices)
is convolved with a filter kernel into a set of output layers.
Each output layer consists of data from all input layers,
and each pair of output and input layers has its own filter
kernel. We use 128 input and output layers, each with 32
rows and columns, and a 3 × 3 kernel. In the baseline
implementation, each cluster computes an entire output layer.
As all input layers do not fit into the local memory of a
cluster, it loads chunks of input layers. Thus, each cluster
needs to load each input layer once per output layer. As
the first result column in Table 4 shows, this implies a very
low operational intensity and entails that performance is
bound by the HBM memory bandwidth. One strategy to
alleviate this is to let each cluster compute a tile in a chunk of
multiple output matrices. As the input layers can be reused
for multiple output layers, this reduces the amount of data
transferred per computation. For a chunk size of 8 (second
column), the operational intensity is sufficiently high that the
performance becomes compute-bound. To save even more
off-chip bandwidth (e.g., for energy efficiency or if no HBM is
available) without sacrificing performance, the hierarchical
network can be used to form a processing pipeline where
clusters obtain their input matrix from another cluster instead
of off-chip memory. The third column shows that when all
16 clusters within one L2 quadrant form such a pipeline,
the off-chip memory traffic can be massively reduced while
performance is maintained. Traffic is also reduced on the L2
and L3 networks because data, once it is in the local memory
of a cluster, is mainly transferred through the L1 networks.
In a fully connected-layer, each cluster computes a tile of
the output matrix in a matrix-matrix multiplication. The tile
size is chosen so that two input matrix tiles and the output
matrix tile together fit twice (for double buffering) into local
memory. With 128 KiB memory, the tile size for a Manticore
cluster is 52. Even though matrix multiplication is theoretically
compute-bound, tiling significantly reduces the operational
intensity. Thus, as the fourth column of Table 4 shows, the
baseline implementation is memory-bound at the HBM. In
the baseline implementation, all clusters within a quadrant
simultaneously load the same tile of one input matrix from
HBM. The hierarchical network presents an opportunity to
13
reduce this bandwidth: The clusters within one L1 quadrant
can be arranged to form a pipeline, where tiles of the input
matrix rotate between clusters. As the last column shows, this
allows to attain compute-bound peak performance.
5 RELATED WORKS
Network-on-chip (NoC) topologies, routing algorithms, flow
control schemes, and router architectures have been subject
to a vast amount of research (see [1], [8]–[10] for detailed
reviews). Important conclusions from this research are that
the optimal on-chip network topology highly depends on the
target application and computer architecture, and that routing
strategies and flow control schemes are intertwined with the
communication protocol, which all connected components
need to adhere to. Thus, we do not try to innovate in this field.
Rather, the modules in our platform allow to build an on-chip
network with arbitrary topology that adheres to a state-of-
the-art, industry-standard protocol, following the paradigm
put forward by application-specific NoC research efforts (see
[11] for an up-to-date survey). Additionally, our elementary
components allow to design custom network modules, should
our pre-configured modules not suffice.
Design space exploration and electronic design automation
(EDA) for on-chip networks is a research field in its own right.
For instance, xENoC [12] is a tool to generate synthesizable
RTL code from an XML specification of a network. HeMPS [13]
is a tool to generate RTL code of an multiprocessor system-on-
chip (MPSoC) including its network from a SystemC model.
Open-Scale [14] is similar to HeMPS but primarily targets
real-time systems and uses the HERMES [15] framework
to generate its NoC. Finally, optimization algorithms are
being employed to design on-chip networks (e.g., [16]). While
our platform is designed with design space exploration in
mind, we consider it an orthogonal problem to designing
and characterizing on-chip network components: our com-
ponents could be integrated into a design space exploration
framework, which could then generate on-chip networks
for heterogeneous SoCs that adhere to an industry-standard
communication protocol.
Non-coherent on-chip communication is central for hetero-
geneous, accelerator-rich SoCs [17]. Protocols similar to AMBA
AXI5 [5], which our platform directly supports, are IBM’s
CoreConnect [18], Silicore’s Wishbone [19], Accellera’s Open
Core Protocol (OCP) [20], and SiFive’s TileLink Uncached
Heavyweight (TL-UH) [21]. They all, like AXI, are royalty-free
standards. CoreConnect, Wishbone, and OCP provide a subset
of the features of AXI5, and while they had been used in
the past, they are nowadays not nearly as widely used as
AXI. TL-UH, like AXI5, supports burst transactions, multiple
outstanding transactions, and transaction reordering and uses
valid-ready flow control. TL-UH has stricter forward progress
requirements than AXI5, which our modules could also fulfill.
While the specifications define interfaces and protocols for
on-chip communication, they do not describe the architecture
of network modules implementing them; that is an important
contribution of our work. The OpenSoC Fabric [22] is an open-
source implementation of a custom non-coherent protocol,
with an interface to AXI-Lite in development. AXI-Lite does
not support bursts or transaction reordering and is therefore
not suited for high-performance communication. The ESP
project [23] provides an open-source implementation of a
2D-mesh NoC with coherent and non-coherent layers and a
custom protocol. In contrast, our platform is topology-agnostic
and adheres to an industry-standard protocol.
Commercial intellectual property (IP) offerings for AXI
exist from multiple vendors, e.g., Arm’s CoreLink Network
Interconnect IPs [24], Synopsys’ DesignWare IPs [25], and
Arteris’ FlexNoC [26], and they are used in many modern
SoCs. The architecture and performance of these IPs is not
public. To the best of our knowledge, our work is the first to
present the microarchitecture, complexity, and performance of
a state-of-the art, industry-standard on-chip communication
protocol and to provide a free, open-source implementation
sufficiently mature for ASIC tapeouts (e.g., [7], [27]).
Cache-coherent on-chip communication protocols cur-
rently in use include Intel’s UltraPath Interconnect [28],
AMD’s scalable data fabric [29], IBM’s Power9 on-chip
interconnect [30], AMBA AXI Coherency Extensions (ACE) [5],
AMBA5 Coherent Hub Interface (CHI) [31], and TileLink
Cached (TL-C) [21]. ACE and TL-C are extensions of AXI and
TL-UH, respectively. As such, our platform could be extended
for coherent communication by adding channels, transactions,
and properties defined by these specifications. The other
protocols are standalone specifications with very different
properties. For instance, we refer to [32] for an open-source
bridge for connecting to CHI from AXI. With such a bridge,
our platform can connect to a coherent system interconnect if
needed, possibly extending to multiple chips. Coherency in
on-chip networks has been studied extensively in research,
e.g., [33]–[35]. A prominent system example is SCORPIO [36],
where a coherent mesh NoC interconnects 36 homogeneous
cores on a die. Their work focuses on the NoC and router
architecture for a coherent homogeneous multi-core, while we
design an end-to-end non-coherent on-chip communication
platform suitable for heterogeneous many-cores.
Generators for cache-coherent on-chip networks have been
presented in multiple works: Open2C [37] contains a library
of components and controllers for coherent networks written
in Chisel. Like us, they present an LLC, which is separated
from a coherence directory. In their 512 KiB L2 cache, the
area overhead of control logic and buffers is 38 %, whereas an
identical parametrization of our LLC has only 3 % overhead.
The Rocket chip generator [38] constructs SoCs written in
Chisel, and the coherent NoC adheres to TL-C. OpenPiton [39]
generates tile-based manycore processors with a 2D mesh,
coherent NoC. One tile has an area of 1.17 mm2 when targeting
IBM’s 32 nm SOI process at 1 GHz. Of the tile area, 22.3 % are
occupied by 32 KiB of distributed L2 cache and directory
controller and 2.7 % by the 5 × 5 NoC router. Accounting
for one full technology node difference, the equivalent area
in GF22FDX would be ca. 660 kGE and 80 kGE for 32 KiB
L2 cache and NoC router, respectively. The control logic of
their L2 cache is ca. 3.3 times larger than that of our LLC,
which could be due to the cache directory. Their 5× 5 NoC
router (without any virtual channels) has about the same
size as a 5× 5 configuration of our crosspoint (with up to 16
reorderable IDs). Open2C and OpenPiton implement a custom
protocol, which complicates connectivity with third-party
components, whereas we adhere to an industry-dominant
protocol. The modules in our work are implemented in
synthesizable SystemVerilog, so they could be integrated into
a higher-level generator as well.
14
6 CONCLUSION
We presented a high-performance non-coherent on-chip com-
munication platform that suits the needs of heterogeneous
many-core and accelerator-rich SoCs. The components of the
platform are not only topology-agnostic and parametrizable to
fit a wide design space but also include bridges and converters
to link subnetworks with different bandwidth and concurrency
properties. We characterized microarchitectural trade-offs and
timing/area characteristics and showed that our platform
can be used to build high-bandwidth end-to-end on-chip
networks with high degrees of concurrency. We used our
platform to design and implement a state-of-the art 1024-
core MLT accelerator in a modern 22 nm technology, where
our communication fabric provides 32 TB/s cross-sectional
bandwidth at only 24 ns round-trip latency between any two
cores. Our platform adheres to an industry-standard, royalty-
free protocol, and its modules, written in SystemVerilog,
are available under a permissive open-source license at
https://github.com/pulp-platform/axi.
REFERENCES
[1] N. Jerger et al., On-Chip Networks: Second Edition. Morgan & Clay-
pool, 2017.
[2] Qualcomm Inc., “Snapdragon 865 5G mobile platform,” 2020.
[3] B. Wheeler, “Tomahawk 4 switch first to 25.6 Tbps,” Microprocessor
Report, 2019.
[4] R. Smith, “NVIDIA Ampere unleashed: NVIDIA announces new
GPU architecture, A100 GPU, and accelerator,” AnandTech, 2020.
[5] AMBA AXI and ACE Protocol Specification Issue F.b, Arm Ltd., 2017.
[6] E. G. Coffman et al., “System deadlocks,” ACM Comp. Surv., 1971.
[7] F. Zaruba et al., “Manticore: A 4096-core RISC-V chiplet architecture
for ultra-efficient floating-point computing,” in IEEE Hot Chips, Aug.
2020.
[8] S. Pasricha et al., On-Chip Communication Architectures: System on
Chip Interconnect. Elsevier Science, 2010.
[9] J. Flich et al., Designing Network On-Chip Architectures in the Nanoscale
Era. CRC Press, 2010.
[10] S. Kundu et al., Network-on-Chip: The Next Generation of System-on-
Chip Integration. CRC Press, 2014.
[11] A. Cilardo et al., “Design automation for application-specific on-chip
interconnects: A survey,” Integration, 2016.
[12] J. Joven et al., “xENoC - an experimental network-on-chip envi-
ronment for parallel distributed computing on NoC-based MPSoC
architectures,” in PDP, 2008.
[13] E. A. Carara et al., “HeMPS - a framework for NoC-based MPSoC
generation,” in IEEE ISCS, 2009.
[14] R. Busseuil et al., “Open-Scale: A scalable, open-source NoC-based
MPSoC for design space exploration,” in ReConFig, 2011.
[15] F. Moraes et al., “HERMES: an infrastructure for low area overhead
packet-switching networks on chip,” Integration, 2004.
[16] B. K. Joardar et al., “Learning-based application-agnostic 3D NoC
design for heterogeneous manycore systems,” IEEE TC, 2019.
[17] D. Giri, et al., “Accelerators and coherence: An SoC perspective,”
IEEE Micro, 2018.
[18] CoreConnect Processor Local Bus Specification, IBM Inc., 2007.
[19] Wishbone B4 SoC Interconnection Architecture, Silicore Corp., 2010.
[20] Accellera Inc., Open Core Protocol Specification Release 3.0, 2013.
[21] SiFive TileLink Specification v1.8.0, SiFive Inc., 2019.
[22] F. Fatollahi-Fard et al., “OpenSoC Fabric: On-chip network generator,”
in IEEE ISPASS, 2016.
[23] D. Giri et al., “NoC-based support of heterogeneous cache-coherence
models for accelerators,” in IEEE/ACM NOCS, 2018.
[24] ARM CoreLink NIC-400 TRM, Revision G, Arm Ltd., 2016.
[25] Synopsys Inc., “DesignWare IP solutions for AMBA AXI 4,” 2018.
[26] J.-J. Lecler et al., “Application driven network-on-chip architecture
exploration and refinement for a complex SoC,” Design Automation
for Embedded Systems, Jun 2011.
[27] F. Zaruba et al., “The floating point trinity: A multi-modal approach
to extreme energy-efficiency and performance,” in IEEE ICECS, 2019.
[28] D. Mulnix, “Intel Xeon processor scalable family technical overview,”
Intel Corp., 2017.
[29] T. Burd et al., “Zeppelin: An SoC for multichip architectures,” IEEE
JSSC, 2019.
[30] S. K. Sadasivam et al., “IBM Power9 processor architecture,” IEEE
Micro, 2017.
[31] AMBA5 CHI Specification Issue D, Arm Ltd., 2019.
[32] M. Cavalcante et al., “Design of an open-source bridge between
non-coherent burst-based and coherent cache-line-based memory
systems,” in ACM CF, 2020.
[33] N. Eisley et al., “In-network cache coherence,” in IEEE/ACM MICRO,
2006.
[34] N. D. Enright Jerger et al., “Virtual tree coherence: Leveraging
regions and in-network multicast trees for scalable cache coherence,”
in IEEE/ACM MICRO, 2008.
[35] N. Agarwal et al., “In-network coherence filtering: Snoopy coherence
without broadcasts,” in IEEE/ACM MICRO, 2009.
[36] B. K. Daya et al., “SCORPIO: A 36-core research chip demonstrating
snoopy coherence on a scalable mesh NoC with in-network ordering,”
in ACM/IEEE ISCA, 2014.
[37] A. Butko et al., “Open2C: Open-source generator for exploration of
coherent cache memory subsystems,” in ACM MEMSYS, 2018.
[38] K. Asanovic´ et al., “The Rocket chip generator,” EECS Department,
University of California, Berkeley, Tech. Rep., Apr 2016.
[39] J. Balkind et al., “OpenPiton: An open source manycore research
framework,” in ACM ASPLOS, 2016.
Andreas Kurth received his BSc and MSc degree in
electrical engineering and information technology from
ETH Zurich in 2014 and 2017, respectively. He is currently
pursuing a PhD degree in the Digital Circuits and Systems
group of Prof. Benini. His research interests include the
architecture and programming of heterogeneous SoCs
and accelerator-rich computing systems.
Wolfgang Ro¨nninger received his BSc and MSc degree
in electrical engineering and information technology from
ETH Zurich in 2017 and 2019, respectively. He currently
works as a research assistant in the Digital Circuits and
Systems group of Prof. Benini. His research interests in-
clude high-performance on-chip communication networks
and general-purpose memory hierarchies.
Thomas Benz received his BSc and MSc degree in
electrical engineering and information technology from
ETH Zurich in 2018 and 2020, respectively. He is currently
pursuing a PhD degree in the Digital Circuits and Systems
group of Prof. Benini. His research interests include
energy-efficient high-performance computer architectures
and the design of ASICs.
Matheus Cavalcante received his MSc degree in inte-
grated electronic systems from the Grenoble Institute of
Technology (Phelma) in 2018. He is currently pursuing a
PhD degree in the Digital Circuits and Systems group
of Prof. Benini. His research interests include vector
processing and high-performance computer architectures.
Fabian Schuiki received his BSc and MSc degree in
electrical engineering and information technology from
ETH Zurich in 2014 and 2017, respectively. He is currently
pursuing a PhD degree in the Digital Circuits and Systems
group of Prof. Benini. His research interests include
computer architecture, transprecision computing, as well
as near-memory and in-memory processing.
Florian Zaruba received his BSc degree from TU Wien
in 2014 and his MSc from the ETH Zurich in 2017. He is
currently pursuing a PhD degree in the Digital Circuits and
Systems group of Prof. Benini. His research interests
include design of VLSI circuits and high-performance
computer architectures.
Luca Benini (F’07) holds the chair of Digital Circuits
and Systems at ETH Zurich and is Full Professor at the
Universita` di Bologna. Dr. Benini’s research interests
are in energy-efficient computing systems design, from
embedded to high-performance. He has published more
than 1000 peer-reviewed papers and five books. He is
a Fellow of the ACM and a member of the Academia
Europaea. He is the recipient of the 2016 IEEE CAS Mac Van Valkenburg
award.
