Packetizing OCP Transactions in the MANGO Network-on-Chip by Bjerregaard, Tobias & Sparsø, Jens
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 17, 2017
Packetizing OCP Transactions in the MANGO Network-on-Chip
Bjerregaard, Tobias; Sparsø, Jens
Published in:
Proceedings of the 9th Euromicro Conference on Digital System Design, August
Link to article, DOI:
10.1109/DSD.2006.75
Publication date:
2006
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Bjerregaard, T., & Sparsø, J. (2006). Packetizing OCP Transactions in the MANGO Network-on-Chip. In
Proceedings of the 9th Euromicro Conference on Digital System Design, August IEEE. DOI:
10.1109/DSD.2006.75
Packetizing OCP Transactions in the MANGO Network-on-Chip
Tobias Bjerregaard and Jens Sparsø
Informatics and Mathematical Modelling
Technical University of Denmark (DTU)
2800 Lyngby, Denmark
{tob, jsp}@imm.dtu.dk
Abstract
The scaling of CMOS technology causes a widening gap
between the performance of on-chip communication and
computation. This calls for a communication-centric design
ﬂow. The MANGO network-on-chip architecture enables
globally asynchronous locally synchronous (GALS) system-
on-chip design, while facilitating IP reuse by standard
socket access points. Two types of services are available:
connection-less best-effort routing and connection-oriented
guaranteed service (GS) routing. This paper presents the
core-centric programming model for establishing and using
GS connections in MANGO. We show how OCP transac-
tions are packetized and transmitted across the shared net-
work, and illustrate how this affects the end-to-end perfor-
mance. A high predictability of the latency of communica-
tion on shared links is shown in a MANGO-based demon-
strator system.
1. Introduction
While transistor speeds improve with each new CMOS
fabrication technology, wire speeds worsen. When scal-
ing wires, the resistance per mm increases. The capaci-
tance stays roughly constant, being mostly due to edge ef-
fects [12]. Hence the RC delay for a constant length wire
increases. With a projected processor speed of 40 GHz
in 2016, the latency cost of driving data 1 mm across a
chip, on broad, top-level, global wires, will be 32 clock cy-
cles [1]. While wire segmentation and pipelining can help,
ultimately the result is a widening performance gap between
communication and computation. Thus it has been evident
for some time now, that data trafﬁcking – not processing –
will be the performance bottleneck in future single-chip sys-
tems. This calls for a communication-centric design ﬂow.
Segmented interconnection networks, so called
networks-on-chip (NoC), constitute a viable solution space
to the communication challenges of future system-on-chip
(SoC) designs [9][3][13]. While there are many approaches
Figure 1. Core-centric view of a MANGO-
based system.
to increasing the throughput in such networks (e.g. pipelin-
ing), the fundamental drawback of scaling technologies
concerns the latency of communication. In previous works,
we have presented novel solutions relevant to the devel-
opment of a clockless NoC MANGO (Message-passing
Asynchronous Network-on-chip providing Guaranteed
services over OCP interfaces) [4][5][7][6]. Of particular
novelty is the ALG (Asynchronous Latency Guarantees)
scheduling discipline employed on the network links. In
contrast to time division multiplexing (TDM) scheduling
schemes commonly used in NoCs [14][10], ALG provides
latency and bandwidth guarantees which are not inversely
dependent.
In this paper, we present the use of guaranteed service
(GS) connections in MANGO from a system point-of-view.
We demonstrate with simulations of a 0.13 μm prototype,
showing end-to-end performance results. In Section 2 we
describe the core-centric view of MANGO, and explain the
Proceedings of the 9th EUROMICRO Conference on Digital System Design (DSD'06)
0-7695-2609-8/06 $20.00  © 2006
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on November 20, 2009 at 09:24 from IEEE Xplore.  Restrictions apply. 
Table 1. OCP signal subset.
Signal Group OCP Signal Driver Function
Basic Clk Master IP / Slave IP OCP clock
MAddr Master IP / Target NA transfer address
MCmd Master IP / Target NA transfer command (write/read)
MData Master IP / Target NA write data
MDataValid Master IP / Target NA write data valid
MRespAccept Master IP / Target NA master accepts response
SCmdAccept Slave IP / Initiator NA initiator accepts transfer
SData Slave IP / Initiator NA read data
SDataAccept Slave IP / Initiator NA initiator accepts write data
SResp Slave IP / Initiator NA transfer response
Burst extensions MBurstLength Master IP / Target NA burst length
MBurstPrecise Master IP / Target NA given burst length is precise
MBurstSeq Master IP / Target NA address sequence of burst
MBurstSingleReq Master IP / Target NA single request / multiple data
MDataLast Master IP / Target NA last write data in burst
MReqLast Master IP / Target NA last request in burst
SRespLast Slave IP / Initiator NA last response in burst
Thread extensions MConnID Master IP connection identiﬁer
MDataThreadID Master IP / Target NA write data thread identiﬁer
MThreadID Master IP / Target NA request thread identiﬁer
SThreadID Slave IP / Initiator NA reponse thread identiﬁer
Sideband SInterrupt Slave IP / Initiator NA slave interrupt
basics of its layered communication strategy. In Section 3
we describe the encapsulation of OCP (Open Core Proto-
col [2]) transactions in packets, and the two types of routing
services available. Section 4 presents the service guaran-
tees of the ALG scheduling scheme, and in Section 5 we
put these in a system context, and relate to experimental re-
sults from the demonstrator system. Section 6 provides a
conclusion.
2. Core-Centric View
This section introduces MANGO from the point of view
of the communicating IP cores. As illustrated in Figure 1
a MANGO-based system is composed of master and slave
IP cores, the packet switched network itself, and network
adapters (NAs) through which the IP-cores connect to the
network. A layered communication strategy is adapted in
which read/write-style transactions are provided based on
the message-passing primitives of the underlying shared
network. Figure 2 illustrates the communication layers. The
transaction layer is deﬁned by the OCP speciﬁcation. The
network layer constitutes the encapsulation of OCP transac-
Figure 2. Layered communication stack.
tions into packets by the NAs. The network uses wormhole
routing, packets being transmitted as sequences of ﬂits (ﬂow
control units) across the network links. This deﬁnes the link
layer. Such layering facilitates modularity in system design,
and portability of IP cores (IP reuse).
The network is implemented using clockless circuit tech-
niques, and each IP core may be clocked independently.
The necessary synchronization to the local clock domain
is performed in the NAs. This facilitates a globally asyn-
chronous locally synchronous (GALS) system view [8][15],
further helping to enable a modular system composition.
The IP/NA interface conforms to the OCP speciﬁcation
and provides read/write-style transactions into a distributed
shared address space. In the presented prototype implemen-
tation the following OCP-transactions are supported: sin-
gle reads and writes, burst reads and writes, and the use of
threads, connections, and interrupts. Table 1 lists the subset
of OCP signals implemented. Notice how signals are driven
either by the master IP and the target NA (the target NA
replicating the OCP transaction initiated by the master), or
the slave IP and the initiator NA (the initiator NA replicat-
ing the response of the slave). The only exceptions are the
connection identiﬁer (MConnID) which is only used at the
master IP, for specifying a connection through the network
(further explained below), and the clock which is driven by
the IP cores.
The network provides two types of routing services for
establishing master to slave connectivity: best-effort (BE)
services for which no performance guarantees are pro-
vided, and connection-oriented guaranteed services (GS)
for which latency and bandwidth guarantees are made.
Proceedings of the 9th EUROMICRO Conference on Digital System Design (DSD'06)
0-7695-2609-8/06 $20.00  © 2006
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on November 20, 2009 at 09:24 from IEEE Xplore.  Restrictions apply. 
OCP-write transactions require only a forward path (re-
quest) from master IP to slave IP, while read transactions
require also a return path (response). Any combination of
BE and GS routing services is possible for the request and
response.
As illustrated in Figure 1 each NA in the prototype has 4
unidirectional input ports and 4 unidirectional output ports
connecting to the network. One ingoing and one outgoing
port is dedicated for BE packets and the remaining ports
are dedicated to GS connections. The choice of outgoing
port is part of the OCP-transaction and is indicated using
the connection identiﬁer signal, MConnID.
BE packets are source routed, based on routing infor-
mation in the packet header (further details in Section 3.2).
To implement this each NA includes a small routing table
which is indexed by the upper 8 bits of the OCP address,
and which provides the routing information. Thus up to 256
slave units are supported. The entire system may contain
more slave units but a maximum of 256 can be known to a
given NA at any given time. In an actual implementation it
will be considerably smaller, as it constitutes a major part
of the NA area. The routing table may however be dynam-
ically updated. It is implemented as a content addressable
memory.
For establishing GS connections, the links of the network
implement a number of independently buffered, logically
separate virtual channels (VCs). GS packets are streamed
from NA to NA along predeﬁned virtual circuits, i.e., a re-
served sequence of VCs through the network. Each router-
node in the network allows static connections from input
VCs to output VCs to be set up (input-to-output hops).
When using GS connections the upper 8 bits of the OCP
address are ignored, as the connection ID uniquely deﬁnes
the destination NA.
The routing tables in the NAs and the ”VC hop tables” in
the individual router-nodes are mapped into the distributed
shared address space. They can can be written to using BE
transactions, by any master in the system, being addressable
as IP slave cores. In a real application we envision a central
network controller will be used to conﬁgure the network.
3. Message-Passing
The objective of a layered communication strategy is to
enable modularity at the system level. The underlying hard-
ware features are transparent to the system designer. In or-
der to optimally utilize these underlying resources however,
the system programmer should be aware of the basic mecha-
nisms. In the following we will explain the routing features
of MANGO, and by dissecting OCP transactions we will
show how the use of GS routing affects end-to-end perfor-
mance.
Figure 3. Packet formats for encapsulating
OCP transactions for network transmission.
3.1. Packet Formats
Figure 3 details the different types of packets generated
by the NAs. Requests are made by a master IP core while
responses are made by the slave IP cores to these requests.
Interrupts are generated by the slaves. The ﬁrst bit in a ﬂit is
the end-of-packet bit, which is high only in the last ﬂit of a
packet. The ﬁrst ﬂit of BE packets is the header which holds
the routing path. Transmitting on GS connections, there is
no header ﬂit as a connection uniquely deﬁnes the complete
path from master to slave (request) or slave to master (re-
sponse or interrupt).
3.2. Best-Eﬀort Routing
No guarantees apart from completion and correctness are
given for BE trafﬁc. OCP commands are issued as BE trans-
actions by addressing connection 0 (setting MconnID = 0
on the OCP interface). Upon issuing an OCP transaction on
connection 0, the upper 8 bits of the OCP address ﬁeld are
Proceedings of the 9th EUROMICRO Conference on Digital System Design (DSD'06)
0-7695-2609-8/06 $20.00  © 2006
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on November 20, 2009 at 09:24 from IEEE Xplore.  Restrictions apply. 
Figure 4. A GS connection constitutes a sequence of reserved VC buffers (a virtual circuit).
Figure 5. A MANGO router.
used to index the NA routing lookup table [4]. This table
maps the global address (8 MSBs of the OCP address ﬁeld,
MAddr) to a routing path, as explained in Section 2.
A BE routing path is a hop-by-hop speciﬁcation of the
path through the network. For each hop, a router output
port is speciﬁed. The BE router module shown in Fig-
ure 5 routes the BE packets according to their header. At
each hop, the header is rotated, such that the MSBs always
specify the next hop. With 5x5 routers, each hop requires
3 bits of routing information. At the ﬁnal hop, specify-
ing port 0 (the local port) indicates that the destination has
been reached. After this follows the router programming bit
(router pgm bit); 0 means that the packet should be used to
program the VC hop tables in the router, 1 means it should
be routed to the local target NA. If the target NA is the des-
tination, yet another bit follows – the NA programming bit
(NA pgm bit) – indicating whether the packet is meant for
programming the NA itself (NA pgm bit = 0) or whether it
is meant as an OCP transaction for the slave IP core attached
to the NA (NA pgm bit = 1). The return path (response e.g.
for read transactions) is placed after the forward (request)
routing path. Finally, a high bit indicates that the response
packet should be routed to the initiator NA and not used to
program the router. Thus a complete routing path has the
following syntax:
BERoutingpath =
∗[fwd hop] :: router pgm bit :: NA pgm bit :: ∗[rtn hop] :: 1
The BE router could also be implemented such that rout-
ing is relative. Instead of hops pointing speciﬁcally to
an output port, they could be speciﬁed as ’go left’ or ’go
straight’, relative to the input port. A forward path then
uniquely deﬁnes the return path, and the target NA can
extract the return path from the header of the incomming
packet. This would save header bits. The present scheme
however enables higher levels of ﬂexibility, as the return
path and the response path can be different, and links in the
network do not need to be bi-directional. It requires more
bits in the header ﬂit however.
3.3. Guaranteed Service Connections
As explained in Section 2, a GS connection is instan-
tiated by reserving a virtual circuit, a sequence of indepen-
dently buffered VCs through the network. This is illustrated
in Figure 4, where a virtual circuit from the initiator NA to
the target NA passes through 3 routers and 2 links. The
ﬁgure also indicates latencies in different segments of the
connection (Linitiator , Lhop1, etc.). These latencies will
be used when analyzing the performance of a connection,
in Sections 4 and 5. In order to establish a virtual circuit
two pointers must be programmed into each router on the
path. This is shown in Figure 5 (forward and backpressure
pointer tables), which illustrates a conceptual view of the
router architecture. At the VC buffers at the router outputs
Proceedings of the 9th EUROMICRO Conference on Digital System Design (DSD'06)
0-7695-2609-8/06 $20.00  © 2006
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on November 20, 2009 at 09:24 from IEEE Xplore.  Restrictions apply. 
a pointer forward on the path, to a VC buffer in the next
router, deﬁnes the data forwarding channel. This is the for-
ward pointer, which uniquely deﬁnes an output port in this
next router, and a VC on that port. Since ﬂow control is
handled on a hop-by-hop basis [6], a ﬂow control channel
backwards on the path, to the previous VC buffer, must also
be established. This is the backpressure pointer. At each
input VC on the virtual circuit, the backpressure pointer se-
lects an output port and a VC on that port. For simplicity
Figure 5 only shows 4 VCs on each port, even though the
routers in the demonstrator MANGO implement 8 VCs.
The GS router is a non-blocking crossbar. Hence the
latency through the router, from the input port to the VC
budders on the output, is predictable and bounded.
The pointer tables can be written to the routers by any
OCP master in the system, using OCP write commands on
connection 0 (the BE connection). To do this a router must
be addressed as a slave IP core: a routing path entry must
be made in the lookup table of the master’s initiator NA,
mapping from a global address to a routing path. In this
routing path, the router programming bit (see Section 3.2)
must be set to 0.
The presented system is constructed of 5x5 routers with
8 VCs on each bidirectional network port, and 5 bits are
needed by each pointer. Hereof, 2 bits are used for identify-
ing an output port. Ports are designated 0 to 4, port 0 indi-
cating the local port. A pointer value of 0 however maps to
port 4. Routing back on a link is not allowed, and pointing
to the same port number as the input, indicates pointing to
the local port (port 0, to which the NA – and hence an IP
core – is connected). The remaining 3 bits point to one of 8
VCs on each port. Of these, 0 through 6 can be used for GS
connections, while the last (VC 7) is used for BE routing.
Thus, pointing a GS connection to VC 7 is illegal.
4. Service Guarantees
Each VC on a link is associated with a certain hop-
guarantee. This is the service guarantee – in terms of la-
tency and bandwidth – in moving ﬂits from the given VC
buffer, to a VC buffer on an output in the next router. The
end-to-end guarantee on a connection of X hops, Lend2end
being latency and BWend2end being bandwidth, is deter-
mined as:
Lend2end = Lhop1 + Lhop2 + . . . LhopX
BWend2end = min(BWhop1, BWhop2, . . . , BWhopX)
The latency is the sum of the per-hop latencies. The la-
tency of the ﬁrst hop is the time it takes to access a virtual
circuit in the ﬁrst router, while the latencies of the follow-
ing hops are the link latencies. The bandwidth guarantee is
determined by the bottleneck of the path.
The switching of GS ﬂits, inside the routers, is
congestion-free, hence apart from a constant element
(tlink), the hop-guarantees of the connection are determined
by the link access guarantee (explained further below).
In the presented prototype we use so called ALG (Asyn-
chronous Latency Guarantee) scheduling [7]. While time
division multiplexing (TDM) schemes, commonly used for
guaranteeing bandwidth on shared links, result in an inverse
dependency between latency and bandwidth (the lower the
latency required, the more bandwidth must be reserved), the
ALG scheduling facilitates a high degree of decoupling of
latency and bandwidth guarantees. The access to the link
of individual VCs is prioritized. This prioritization together
with a special access scheme facilitates a latency guaran-
tee which is linearly proportional to the priority of the given
VC. Hence a very low latency guarantee can be given, with-
out the need to also reserve a large portion of the available
bandwidth.
In the following, the 8 VCs implemented on each link
are denoted Q ∈ {0, 1, . . . , 7}. Since the network is clock-
less, we specify latency and bandwidth guarantees in terms
of the time unit ﬂit-time (tflit), which is the cycle time of
transmitting one ﬂit on a link. This corresponds to a clock
cycle in a synchronous network. In MANGO, tflit depends
on the actual link implementation: the link encoding and
pipelining, the ﬂit width, etc. We use delay insensitive dual
rail encoding [16] in order to improve timing robustness in
the system, employ 2-stage pipelining on the links, and the
ﬂits are 32-bit wide. This conﬁguration results in tflit =
3.6 ns, in the worst-case process corner.
The hop-guarantees for VC Q are [7]:
Lhop = ((Q + 1) ∗ tflit) + tlink
BWhop = 1tflit∗(N+Q−1)
Here, N is the total number of VCs on a link and tlink is
the forwarding latency of a ﬂit from one VC buffer, across
the link, through the next router, to the next VC buffer on
the connection. VC control [5] ensures that a ﬂit can only
gain access to a link, if the target VC buffer is free, hence
once access is granted, the ﬂit will experience no conges-
tion. Therefore tlink is constant [6]. In the demonstrator
network, tlink = 7.9 ns. This includes the latency of merg-
ing ﬂits from different VCs onto the link, delay insensitive
(DI) encoding, the link pipeline latency, DI decoding, the
delay of the GS crossbar in the router, and the delay of the
VC control circuits and the VC buffers. The latency guar-
antee is given on a ﬂit by ﬂit basis. The time separation be-
tween two consecutive ﬂits determines the bandwidth guar-
antee. If there are a large number of VCs on a link (large
N ) this time may become large, since it is dependent on N .
Looking at the packet formats in Figure 3, it is seen how
non-burst reads, on GS connections, require only a single
ﬂit, likewise for non-burst responses and interrupts. This
makes them particularly suitable for exploiting the ALG la-
tency guarantees. This will become more clear in Section 5
Proceedings of the 9th EUROMICRO Conference on Digital System Design (DSD'06)
0-7695-2609-8/06 $20.00  © 2006
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on November 20, 2009 at 09:24 from IEEE Xplore.  Restrictions apply. 
where we dissect the end-to-end latency of transmitting a
packet, into its components.
In the MANGO prototype presented in this work, the
links are pipelined. In clocked networks, pipelining has the
side effect of increasing the forward latency by one clock
cycle per pipeline stage. In clockless pipelines however, the
forward latency is only part of the cycle time. Even though
we have placed two pipeline stages between each router, the
increase in forward latency is only 600 ps, due to the ex-
tra pipeline stages. This is not much more than the latency
penalty of wire segmentation.
5. Demonstrator and Results
Our test system is composed of three routers, as shown
in Figure 4. In order to illustrate the GALS capabilities of
MANGO, the master and slave cores are run at different fre-
quencies, 250 MHz and 333 MHz respectively. The master
could be a microprocessor, while the slave could be a shared
memory. Results are based on simulations of a 0.13 μm
standard cell implementation, using worst-case process cor-
ner timing parameters. In the following, we ﬁrst derive con-
nection latency bounds analytically. The latency bound of a
connection can be decomposed into local components. We
ﬁnd actual values for these components by performing gate-
level simulations. We then calculate the latency bound on
two given connections, and ﬁnally we verify these analyti-
cal results by running gate-level simulations of the complete
test system.
The end-to-end latency Lconn of using a GS connection
for OCP transactions, can be broken into components:
Lconn = Linitiator + Lfwd + Lcongestion
+Lserialization + Ltarget
Some of these can be identiﬁed in Figure 4. In the fol-
lowing each component is explained.
Linitiator is the latency in the initiator NA (the master
core’s NA). It is one OCP clock cycle plus some forward
latency in the clockless part of the NA [4].
Lfwd is the forward latency of a ﬂit, in an unloaded net-
work. It is tlink for each link traversed, plus a latency for
engaging the virtual circuit from the initiator NA, tengage .
Lcongestion is latency due to stalling in the network. It is
the time that a ﬂit is waiting in VC buffers, for access to a
link. Its value depends on the link access arbitration.
Lserialization is the latency penalty due to serialization
of the packetized OCP transactions into ﬂits. This is dic-
tated by the bandwidth guarantee of the connection. As
explained in [7], the latency guarantee of ALG is given at
the requirement of a time separation between ﬂits on the
VC. This is the inverse of the bandwidth guarantee. Hence,
in a fully loaded network, the separation between ﬂits will
be close to maximum (one over the bandwidth guarantee),
while in an uncongested network it will be much lower, as
more bandwidth than guaranteed is available.
Ltarget is the forward latency in the target NA. Resyn-
chronization, from the clockless network to the clock of the
IP core, takes one OCP clock cycle, while there is also half
a cycle latency in the clocked part of the NA. In addition,
there is some latency in the clockless part, plus an unknown
latency of up to one OCP cycle, due to the uncertainty of the
arrival time of the last ﬂit. If this occurs immediately after
the local clock tick, the packet will need to wait a complete
clock cycle before resynchronization can begin.
By simulating a transaction in an unloaded network, we
can measure Linitiator , Lfwd and Ltarget. Lserialization is
the difference between the arrival time of the ﬁrst ﬂit and the
last (zero for single-ﬂit packets, such as a GS read request).
In an unloaded network, Lcongestion is zero.
Two connections between the master and the slave are
tested: conn1 and conn2. The connections consist of three
segments: one from the initiator NA to the ﬁrst VC in the
virtual circuit (engage time, tengage), and one across each
of the two links in the path. The hop from the initiator NA to
the ﬁrst VC does not require access to a shared link, hence
its latency is constant. For conn1, low latency VCs have
been reserved on the two links that the connection traverses:
priority 0 (highest priority) on both links. For conn2 VCs
of priority 3 and 6, i.e. VCs which provide worse latency
guarantees, have been reserved.
From gate-level netlist simulations with back-annotated
timing, we obtain the values: tengage = 3.2 ns, tflit =
3.6 ns and tlink = 7.9 ns. On the two links between mas-
ter and slave, having reserved VCs 0 and 0 for conn1, and
VCs 3 and 6 for conn2, the theoretical latency guarantees,
of transmitting a ﬂit on the virtual circuits, are thus:
Lcircuit1 = tengage + (1 + 1) ∗ tflit + 2 ∗ tlink
= 26.2 ns
Lcircuit2 = tengage + (4 + 7) ∗ tflit + 2 ∗ tlink
= 58.6 ns
Note that Lcircuit = Lfwd+Lcongestion,max. The worst
case NA latency occurs when the synchronization clock is
just missed in the target NA. The worst case serialization
penalty for a write (2 ﬂits) is one over the bandwidth guar-
antee of the connection.
Lserialization,conn1 = tflit ∗ (8 + 1− 1)
= 28.8 ns
Lserialization,conn2 = tflit ∗ (8 + 7− 1)
= 50.4 ns
Now we get the total end-to-end latency guarantee by
adding to this the latency of the NAs, the circuit latency and
the serialization penalty:
Proceedings of the 9th EUROMICRO Conference on Digital System Design (DSD'06)
0-7695-2609-8/06 $20.00  © 2006
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on November 20, 2009 at 09:24 from IEEE Xplore.  Restrictions apply. 
Table 2. Examples of end-to-end latency of a write, on two connections at varying background loads.
conn1 / 0% load conn1 / 100% load conn2 / 0% load conn2 / 100% load
Linitiator 4.9 ns 4.9 ns 4.9 ns 4.9 ns
Lfwd 17.8 ns 17.8 ns 18.6 ns 18.6 ns
Lcongestion 0 ns 12.3 ns 0 ns 35.0 ns
Lserialization 12.0 ns 21.3 ns 12.3 ns 25.0 ns
Ltarget 7.0 ns 7.0 ns 7.0 ns 7.0 ns
Lconn 41.7 ns 63.3 ns 42.8 ns 90.5 ns
Lconn1,max = Linitiator + Lcircuit1
+ Lserialization + Ltarget
= 4.9ns + 26.2ns + 28.8ns + 7.5ns
= 67.4ns
Lconn2,max = 4.9ns + 58.6ns + 50.4ns + 7.5ns
= 114.2ns
In the following we compare these results with a TDM-
based NoC with 8 time slots, running at 300 MHz (corre-
sponding to a ﬂit-time of 3.33 ns), with one clock cycle for-
ward latency on the links and one in the routers. To access
the time slot allocated to the connection, ﬁrst a waiting time
of up to 8 clock cycles is endured. Adding to this the actual
latency through the network of 3 routers plus 2 link (5 cycles
total), one gets the circuit delay. The serialization penalty
constitutes another 8 cycles, waiting for the reserved time
slot to arrive again for the second ﬂit. The worst case la-
tency in the network is:
Lcircuit−TDM = 8 ∗ tclk + 5 ∗ tclk
= 43.3 ns
Lserialization−TDM = 8 ∗ tclk
= 26.6 ns
This is comparable to the delays of the MANGO net-
work using the ALG scheduling scheme. However, as the
bandwidth granularity is increased, the connection latency
of TDM increases fast. Using ALG instead, the latency is
independent of the bandwidth reservation. Hence, with re-
gards to latency, TDM is not a scalable solution. For practi-
cal examples of using TDM for GS in NoC, please refer to
studies made for the Æthereal NoC in [11].
Figure 6 shows end-to-end latencies of issuing OCP
write transactions across the tested connections. First GS
connections are established between the master and slave IP
cores. Then a number of connections are set up between the
other network ports, and these are loaded with random traf-
ﬁc. This provides a variable background trafﬁc load, sim-
ulating the master/slave subsystem being part of a bigger
system. Test results are sampled over sets of 1000 trans-
actions under different background loads. Results are taken
Figure 6. Latency distribution of OCP write
transactions on a) connection 1 and b) con-
nection 2.
from gate-level netlist simulations with back-annotated tim-
ing. During read transactions, the total latency is the sum of
the latency of the request connection, the response time of
the slave IP core and the latency of the response connec-
tion. Note however (Figure 3) that a GS read request and
response both require only a single ﬂit. Hence the serializa-
tion penalty is zero, and the transaction latency guarantee
is truely decoupled from the bandwidth guarantee. Herein,
write commands are sufﬁcient for illustration purposes.
The results illustrate how the latency does not exceed the
analytically guaranteed maximum, even under 100% net-
work load. This shows that a high degree of predictability
can be obtained in using the shared network, despite a shift
Proceedings of the 9th EUROMICRO Conference on Digital System Design (DSD'06)
0-7695-2609-8/06 $20.00  © 2006
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on November 20, 2009 at 09:24 from IEEE Xplore.  Restrictions apply. 
in the use of the network by other entities in the system
(different background loads), and despite its asynchronous
nature (clockless implementation).
Table 2 shows the breakdown of the latency of two write
transactions on each of the two connections; in unloaded
and fully loaded network scenarios. Note that these are
meassured examples from typical trafﬁc scenarios, i.e. not
worst case. It is seen that the forward latency (Lfwd) is
not exactly the same on the two connections, as one might
expect. This is because the clockless implementation of
the link access circuits is not symmetrical, since some VCs
must be prioritized over others [7]. We see how the effect
of Lcongestion is much smaller on connection 1 which has
reserved low latency VCs. We also see that the serializa-
tion penalty is quite high (GS writes consist of two ﬂits),
up to 34% of the total latency on conn1 under 100% net-
work load. Note that since the tested connections only tra-
verse two links, the serialization penalty contributes a rela-
tively large portion of the total. This value is independent of
the number of hops on a connection, and will dwindle rela-
tively on longer connections. The latency due to congestion
is per-link on the other hand, hence it will accumulate on
longer connections. The beneﬁt of scheduling for hard la-
tency guarantees in the link access thus increases as the con-
nections get longer. Since ALG facilitates a very low per-
link latency, by reserving high priority VCs, it is possible to
maintain a low forward latency, even on long connections.
The ﬂexibility of the guarantees provided by ALG, com-
pared with those of traditional TDM-based schemes, makes
it beneﬁcal in scaling, heterogeneous systems.
6. Conclusion
In this paper, we have demonstrated the programmabil-
ity and end-to-end performance of guaranteed service con-
nections in the MANGO NoC. We have described a pro-
gramming model for setting up connections in the network,
and shown how a tightly bound predictability of the latency
can be obtained in issuing OCP commands, despite different
levels of background trafﬁc and despite the clockless imple-
mentation of the network. Such predictability is important
in a modular system-on-chip design ﬂow, as it facilitates an-
alytical veriﬁcation, and a decoupling of sub-systems. Also,
globally asynchronous locally synchronous (GALS) system
composition is enabled, by the network adapters synchro-
nizing the clocked OCP interfaces with the clockless net-
work. The work illustrates the advantages of the architec-
ture, from a core-centric point-of-view.
References
[1] International technology roadmap for semiconductors
(ITRS) 2003. Technical report, International Technology
Roadmap for Semiconductors, 2003.
[2] Open Core Protocol Speciﬁcation, Release 2.0.
www.ocpip.org, 2003.
[3] L. Benini and G. D. Micheli. Networks on chips: A new SoC
paradigm. IEEE Computer, 35(1):70 – 78, January 2002.
[4] T. Bjerregaard, S. Mahadevan, R. G. Olsen, and J. Sparsø.
An OCP compliant network adapter for GALS-based SoC
design using the MANGO network-on-chip. In Proceedings
of International Symposium on System-on-Chip 2005. IEEE,
2005.
[5] T. Bjerregaard and J. Sparsø. Virtual channel designs for
guaranteeing bandwidth in asynchronous network-on-chip.
In Proceedings of the IEEE Norchip Conference (NORCHIP
2004). IEEE, 2004.
[6] T. Bjerregaard and J. Sparsø. A router architecture for
connection-oriented service guarantees in the MANGO
clockless network-on-chip. In Proceedings of Design, Au-
tomation and Testing in Europe Conference 2005 (DATE05).
IEEE, 2005.
[7] T. Bjerregaard and J. Sparsø. A scheduling discipline for
latency and bandwidth guarantees in asynchronous network-
on-chip. In Proceedings of the 11th IEEE International Sym-
posium on Advanced Research in Asynchronous Circuits and
Systems. IEEE, 2005.
[8] D. Chapiro. Globally-Asynchronous Locally-Synchronous
Systems. PhD thesis, Stanford University, 1984.
[9] W. J. Dally and B. Towles. Route packets, not wires: On-
chip interconnection networks. In Proceedings of the 38th
Design Automation Conference, pages 684–689, June 2001.
[10] J. Dielissen, A. Ra˘dulescu, K. Goossens, and E. Rijpkema.
Concepts and implementation of the Philips network-on-
chip. In Proceedings of the international workshop on IP-
Based SOC Design (IPSOC 2003), Nov. 2003.
[11] K. Goossens, J. Dielissen, O. P. Gangwal, S. G. Pes-
tana, A. Radulescu, and E. Rijpkema. A design ﬂow for
application-speciﬁc networks on chip with guaranteed per-
formance to accelerate SoC design and veriﬁcation. In Pro-
ceedings of the Design, Automation and Test in Europe Con-
ference (DATE 2005), pages 1182–1187. IEEE, 2005.
[12] R. Ho, K. W. Mai, and M. A. Horowitz. The future of wires.
Proceedings of the IEEE, 89(4):490 – 504, April 2001.
[13] A. Jantsch and H. Tenhunen. Networks on Chip. Kluwer
Academic Publishers, 2003.
[14] M. MillBerg, E. Nilsson, R. Thid, and A. Jantsch. Guaran-
teed bandwidth using looped containers in temporally dis-
joint networks within the Nostrum network on chip. In Pro-
ceedings of the Design, Automation and Testing in Europe
Conference (DATE 2004). IEEE, 2004.
[15] J. Muttersbach, T. Villiger, and W. Fichtner. Practical de-
sign of globally-asynchronous locally-synchronous systems.
In Proceedings of the Sixth International Symposium on
Advanced Research in Asynchronous Circuits and Systems,
2000 (ASYNC 2000), pages 52–59. IEEE Computer society,
April 2000.
[16] J. Sparsø and S. Furber. Principles of Asynchronous Circuit
Design - a Systems Perspective. Kluwer Academic Publish-
ers, Boston, 2001.
Proceedings of the 9th EUROMICRO Conference on Digital System Design (DSD'06)
0-7695-2609-8/06 $20.00  © 2006
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on November 20, 2009 at 09:24 from IEEE Xplore.  Restrictions apply. 
