Network-on-Chip Assembler Language (Version 0.1) by Zhonghai Lu & Axel Jantsch
Network-on-Chip Assembler Language
(Version 0.1)
Zhonghai Lu and Axel Jantsch
June 4, 2003
TRITA-IMIT-LECS R 03:02
ISSN 1651-4661
ISRN KTH/IMIT/LECS/R-03/02–SE
Laboratory of Electronics and Computer Systems
Department of Microelectronics and Information Technology
Royal Institute of Technology
Stockholm, Sweden.Abstract
Network-on-chip (NoC) is deemed to be a paradigm to tackle design challenges in the billion transistor
era. A NoC provides a reusable platform for integrating heterogeneous resources. This report discusses
application design on NoC. We propose Network-on-Chip Assembler Language (NoC-AL) that serves an
interface between NoC implementationsand applications, very similar to the instruction set of a traditional
CPU. A central part of NoC-AL is communication primitives. Every instance of a NoC must come with
a NoC assembler, which translates NoC-AL programs into a set of NoC conﬁguration ﬁles which are in
turn handled by standard tools for hardware and software design. In this report, we motivate our NoC
application design approach, and discuss NoC communications, in particular, channel communication.
Moreover we deﬁne two sets of basic communication primitives for the two interprocess communication
styles, message passing and shared memory. Furthermore, we discuss language binding and layers for
implementing the NoC communication primitives.
Keywords: Network-on-Chip (NoC), Communication Primitive, Channel Communication, Design
Methodology, System DesignContents
1 Background 2
1.1 Introduction . . . . . . .................................... 2
1 . 1 . 1 T h e N o C P l a t f o r m .................................. 2
1 . 1 . 2 T h e N o C A p p l i c a t i o n D e s i g n T a s k s ......................... 3
1.2 The NoC Assembler Language ................................ 4
1 . 2 . 1 T h e N o C A p p l i c a t i o n D e s i g n a n d C o m p i l a t i o n F l o w ................ 4
1 . 2 . 2 N o C - A L C o m m u n i c a t i o n s i n t h e O S I S e v e n - L a y e r M o d e l ............. 5
1 . 3 R e l a t e d W o r k......................................... 6
2 NoC Communications 8
2 . 1 C o m m u n i c a t i o n s o n N o C................................... 8
2 . 1 . 1 T y p e s o f C o m m u n i c a t i o n s o n N o C.......................... 8
2 . 1 . 2 C o m m u n i c a t i o n S t y l e s................................ 8
2 . 2 C o m m u n i c a t i o n I s s u e s .................................... 9
2 . 2 . 1 N a m i n g a n d A d d r e s s i n g I s s u e ............................ 9
2 . 2 . 2 C o n n e c t i o n I s s u e................................... 1 0
2 . 2 . 3 S y n c h r o n i z a t i o n I s s u e ................................ 1 1
2 . 3 C o m m u n i c a t i o n C h a n n e l C h a r a c t e r i s t i c s........................... 1 5
3 NoC-AL Communication Primitives 18
3 . 1 M e s s a g e P a s s i n g P r i m i t i v e s.................................. 1 8
3 . 2 S h a r e d M e m o r y P r i m i t i v e s.................................. 2 1
3 . 3 A n E x a m p l e o f N o C - A L P r o g r a m .............................. 2 3
4 NoC-AL Implementation Issues 25
4.1 Language Binding . . . .................................... 2 5
4 . 1 . 1 D a t a T y p e M a p p i n g s................................. 2 5
4 . 1 . 2 E x p r e s s i o n s o f P r i m i t i v e s ............................... 2 6
4 . 2 L a y e r e d I m p l e m e n t a t i o n ................................... 3 0
4 . 2 . 1 A S t a n d a r d I n t e r f a c e................................. 3 0
4 . 2 . 2 I m p l e m e n t a t i o n o f P r i m i t i v e s i n t h e O S I L a y e r s................... 3 1
4 . 2 . 3 I m p l e m e n t a t i o n L a y e r s................................ 3 1
4 . 2 . 4 C h a n n e l F e a t u r e s a n d t h e i r R e q u i r e d A c t i o n s .................... 3 3
5 Summary and Future Work 38
5 . 1 S u m m a r y ........................................... 3 8
5 . 2 F u t u r e W o r k.......................................... 3 8
iiList of Figures
1 . 1 A N o C o f M e s h S t r u c t u r e w i t h 9 N o d e s ........................... 3
1 . 2 N o C A p p l i c a t i o n D e s i g n T a s k s................................ 4
1 . 3 T h e N o C A p p l i c a t i o n D e s i g n a n d C o m p i l a t i o n F l o w .................... 6
1 . 4 N o C - A L C o m m u n i c a t i o n s i n O S I ’ s S e v e n - L a y e r M o d e l................... 6
2 . 1 A d d r e s s i n g i n t h e 2 D M e s h.................................. 1 0
2 . 2 P r o c e s s e s S h a r e M e m o r i e s .................................. 1 2
2 . 3 S h a r e d M e m o r y a s a n A P I .................................. 1 2
2.4 Shared Memory Synchronization Mechanisms . . . ..................... 1 3
2 . 5 P r o c e s s o r s S h a r e M e m o r i e s :D a n c e - h a l l a n d D i s t r i b u t e d M e m o r y ............. 1 4
2 . 6 M e s s a g e P a s s i n g S c e n a r i o f r o m S o u r c e t o D e s t i n a t i o n .................... 1 5
2 . 7 A T a s k G r a p h w i t h T h r e e C h a n n e l s .............................. 1 5
2.8 The Channel Reliability Constellation ............................ 1 6
2 . 9 R e l i a b l e S e s s i o n a n d T r a n s p o r t................................ 1 7
2 . 1 0U n r e l i a b l e S e s s i o n b u t R e l i a b l e T r a n s p o r t .......................... 1 7
2 . 1 1R e l i a b l e S e s s i o n b u t U n r e l i a b l e T r a n s p o r t .......................... 1 7
2 . 1 2U n r e l i a b l e S e s s i o n a n d T r a n s p o r t............................... 1 7
3 . 1 M e s s a g e P a s s i n g P r o c e d u r e B e t w e e n P r o c e s s e s ....................... 1 8
3 . 2 T h e S h a r e d M e m o r y P r o c e d u r e................................ 2 1
3 . 3 A n I m p l e m e n t a t i o n S c h e m e o f T h e A t o m i c R e a d - M o d i f y - W r i t e .............. 2 3
3 . 4 A n E x a m p l e o f N o C A p p l i c a t i o n i n T a s k G r a p h....................... 2 4
4 . 1 A S t a n d a r d P r o t o c o l E n a b l e s I P R e u s e............................ 3 0
4 . 2 S o f t w a r e I m p l e m e n t a t i o n o f P r i m i t i v e s............................ 3 1
4.3 A Communication Channel Connecting An Initiator and A Target ............. 3 2
4 . 4 T h e S t a c k o f C o m m u n i c a t i o n L a y e r s............................. 3 2
4 . 5 L a t e n c y C h e c k ........................................ 3 4
4 . 6 B a n d w i d t h C h e c k....................................... 3 4
4 . 7 N e g o t i a t i o n f o r L a t e n c y a n d B a n d w i d t h D u r i n g C h a n n e l S e t u p............... 3 5
iiiList of Tables
4 . 1 D a t a T y p e M a p p i n g s ..................................... 2 6
4.2 Comparisons on Reliability Levels . . ............................ 3 6
4.3 Reliability Levels and Their Implications . . . . . ..................... 3 7
1Chapter 1
Background
This chapter ﬁrst presents a context for Network-on-Chip design. Based on discussions on the design
challenges of a future System-on-Chip (SoC), we point out that a NoC will be a system platform for inte-
grating perhaps arbitrarily heterogeneousresources through communication interfaces. As an instance, we
introduce a mesh-structured NoC proposed by KTH. We also identify the design tasks for NoC application
designers. Next, we give a deﬁnition on NoC Assembler Language which targets NoC application design.
It highlights NoC communications by means of communication primitives that nicely ﬁts into the session
layer of the OSI seven-layer model.
1.1 Introduction
1.1.1 The NoC Platform
Following Moore’s law which has sustained in semiconductor industry for over 35 years, a single chip is
predicted to be able to integrate four billion 50-nm transistors operating below one volt and running at 10
GHz by the end of the decade [1]. Due to its huge capacity the billion-transistor chip can take on very
complex functionalities with interconnected hundreds of microprocessor-sized computing resources. Such
resourcesmaybe programmablelike CPUs, dedicatedlike ASICs, conﬁgurablelike FPGAs, or passivelike
memories etc.. However, to fully exploit the capacity offered by the technology is facing challenges. At
physical level, the interferences caused by crosstalk, power/ground plane noises etc. affect signal integrity
to a larger extent. Inductance has to be taken into consideration more closely. A distributed RLC wire
model is necessary to replace a lumped RLC wire model. This makes more difﬁculty to have physical
properties under control. At logic level, the wire delay will soon dominate the gate delay. Lowering
voltage to reduce power can not be furthered due to the limited space for thresholding a CMOS. If we
move upwards, the difﬁculties do not decrease. On the contrary, they become even worse. At RT level,
purely synchronous design approach is challenged because a global clock simply will be infeasible. The
synchronous design has to be constrained in a small portion of a future chip. At system level, bus-based
systems such as multiple-bridged buses, face even worse scalability problems. Limited bandwidth and bus
length will make it difﬁcult to interconnect many more resources. Also, at this level, the heterogeneous
resources imply various interfaces or protocols, operating systems etc., which are hard to integrate on a
single chip.
If we look from a design methodology angle, the problem looms ahead. The gap between the method-
ology capacity and the chip capacity is not decreasingbut increasing. The ad hoc RTL design considerstoo
many implementation details. Its design capacity (in terms of the number of transistors) and productivity
(in terms of design time) will not be sufﬁcient to design the billion transistor chip. To close the gap, the
abstraction level of design has to be increased, and reuses at all levels of abstraction are a must. The design
challenges of such a complex system on chip (SoC) spark new thoughts or ideas. Platform-based design
[2] is now a hot topic in academia and industry. It has been very successful among PC-makers. Now the
conceptis movedintochiparea. Althoughthereis nocommonlyaccepteddeﬁnitionforplatform-basedde-
2sign, the basic idea is to enable architecture reuse besides IP reuse in order to shorten time-to-market, also
simplify veriﬁcation. It is neither a top-bottom nor a bottom-up approach. It is a meet-in-the-middle ap-
proach. There are trade offs between ﬂexibility or generality and performance to come up with a platform.
Various classes of applications demand various platforms. For any system platform, the programming
model is essential.
Network-on-Chipis perfectly suited for such a platform concept which integratesany type of resources
via communication interfaces. One NoC platform [3] proposed by KTH is a mesh structure composed
of switches with each switch connecting to a resource, as shown in ﬁgure 1.1. The resources are placed
on the slots formed by the switches. The switch network offer communication for resources. The re-
sources perform their own computational functionalities and provide Resource-Network-Interfaces(RNI).
The maximal resource area is deﬁned by the maximal synchronous area of a technology. The reuse of
system architecture to enable fast IP integration is one major advantage expected from the NoC platform.
Meanwhile a mesh structure has well-controlled electrical properties which can largely alleviate the inter-
connection difﬁculties.
S
S
S
Resource
RNI
Resource
RNI
Resource
RNI
S
S
Resource
RNI
Resource
RNI
S
S
S
Resource
RNI
Resource
RNI
S
Resource
RNI
Resource
RNI
Figure 1.1: A NoC of Mesh Structure with 9 Nodes
1.1.2 The NoC Application Design Tasks
A NoC is inherently a heterogeneous distributed system. The resources are heterogeneous. They can be
programmable (FPGAs), dedicated hardware (ASICs), various processor cores, memories, and IP blocks.
Heterogeneity implies that different elements, like resources, switches and interfaces, are designed in vari-
ous means. A number of languages, synthesis tools, software compilers and linkers exist for the design of
individual elements. However,there is no single design ﬂow which can be applied to the design of all these
elements. The resources are distributed. Distribution implies that processes on different resources inter-
act with each other via the on-chip communication network. NoC design will be communication-centric.
In addition to the NoC architecture, we also have to address the design of process communications. If
we treat NoC as individual elements, the design of NoC communications may be very complicated. Many
communicationstandardsandprotocolsare desired to coordinateprogrammingNoC communications. The
relatively independent design of processes makes it hard to integrate all these elements. For example, how
to program a process running on an ARM microprocessor to communicate with a process running on an
ASIC through the communication network? The question is not if we can do that. The question is porta-
bility or reusability, and productivity. One can image how much application design work will be repeated
if there is no operating system to hide the details of the instruction set platform. Here the principle is the
same. With a relatively static interface we can hide the dynamics of the lower layer details. Of course, one
requirement is that the interface should give the designer enough control over the underlying resources.
In this way, the design complexity is decoupled from individual cases. Essentially NoC is treated as a
whole unit. Application designers only see the interface without knowing the implementation details of
communication.
3To facilitate the following discussions, we naturally partition a NoC into three parts: a network back-
bone, computational resources,a n dcommunication interfaces. Here we roughly give work deﬁnitions to
some terms used in this document in order to avoid confusions. A resource is a processing element of
any type. A NoC backbone, also called a network architecture, includes switching or routing network in
whatever topology, and the network interfaces (NIs). A network interface (NI) wraps a switch or router
to offer a standard interface or protocol to outside world. A NoC architecture includes not only a NoC
backbone and also resources such as CPUs, DSPs, etc. as well as RNIs. Software platform such as op-
erating systems for CPUs and DSPs is part of the NoC architecture. A communication interface called
Resource-Network-Interface (RNI) can be either a hardware or software interface. In hardware, a RNI is
an intermediate module connecting a resource to network. It may enable a resource to hopefully plug-and-
play on the NoC backbone providedthe RNI speaks the same protocol as the NI. In software, a RNI is also
called a communication stub or socket which is software platform dependent. It enables custom software
to use the network services provided by the NoC backbone.
A NoC is a system platform including both hardware and software platforms. Since some IP blocks,
such as DSPs, CPUs, and memories etc. are reusable cores, we expect them to be pre-fabricated together
with theNoCbackbone. Thus, aNoCitself isa half-customizedprototype. Ideallywe expectthat resources
can simply plug-and-play on the NoC backbone. To this end the interface between NI and RNI should be
standard. As the initial step for platform-based design, we need to deﬁne a NoC platform for a speciﬁc
application class. Consequently we have an application speciﬁcation on one hand, and a NoC platform
on the other hand. In one word, the application design is to map this system speciﬁcation onto the NoC
platform. Speciﬁcally speaking, the application design tasks are aimed at custom hardware like ASIC,
FPGA, and custom software for application tasks and custom communication interfaces in hardware and
software, as illustrated in ﬁgure 1.2, where the bridge is a bus bridge acting as an interpreter between
the local bus protocol and the network interface protocol. A NoC design methodology should be able to
describe both the NoC architecture and the NoC application. NoC Assembler Language serves for this
purpose.
NI NI NI NI
RNI   RNI
FPGA ASIC OS
CPU
NI
NI
S
t
a
n
d
a
r
d
 
I
n
t
e
r
f
a
c
e
NI
Switching or Routing Network
NI
  IP
Analog
D~A
Mem.
RNI   RNI  
N
o
C
 
a
r
c
h
i
t
e
c
t
u
r
e
N
o
C
 
b
a
c
k
b
o
n
e
Controller
Bridge Bridge Bridge RNI/Bridge
SW SW
SW
RNI
RNI
CPU DSP
RNI
Figure 1.2: NoC Application Design Tasks
1.2 The NoC Assembler Language
1.2.1 The NoC Application Design and Compilation Flow
We deﬁne the NoC-AL [4] as follows:
NoC Assembler Language(NoC-AL) serves as an interface between NoC implementations
and applications, very similar to the instruction set of a traditional CPU. A central part of
4NoC-AL will be communication primitives such as send and receive, open and close, and a
standardized way of using shared memory. Every instance of a NoC must come with a NoC
assembler, which translates NoC-AL programs into a set of NoC conﬁguration ﬁles.
From the deﬁnition, we see that NoC-AL treats NoC as a whole instead of individual elements such as
resources, switches, and interfaces etc. The design languages for these elements, e.g. VHDL/Verilog for
hardware design, C/C++ for software design, SystemC [5] and SpecC [6] for both hardware and software
design, are coherent parts of NoC-AL. The NoC-AL offers methods to describe NoC architecture and pro-
cess communicationsbesides computationaltasks. The NoC architectureconcernsNoC topology,resource
list and process-to-resourcemappings. The methods used for describing process communicationsarecom-
munication primitives which we will deﬁne in this report. Based on architecture and IP reuse, a NoC is a
communication platform on which applications run. In application, we separate computation from com-
munication to highlight the communication problems in a NoC. The difference between computation and
communication lies in that the former uses only processing elements, while the latter uses both processing
elements and communication media [7]. A conceptual illustration of a NoC-AL program is shown below,
where the NoC backbone is the 2D mesh. Please notice that some additional statements may be needed
to coordinate the computation code and the communication code of a process, but are not shown in the
conceptual code. Also, the computation code and the communication code of a process may be actually
combined into one ﬁle.
NoC Architecture Description
{Topology: mesh 2 x 2
Resource List:Row1: R1=SHARC DSP, R2=ARM CPU
Row2: R3=FPGA, R4=ASIC}
NoC Application {
R1:{P11:{computation_file11.c; communication_file11.c}
P12:{computation_file12.c; communication_file12.c}...}
R2:{P21:{computation_file21.cpp; communication_file21.cpp}
P22:{computation_file22.cpp; communication_file22.cpp}...}
R3:{P31:{computation_file31.vhdl; communication_file31.vhdl}
P32:{computation_file32.vhdl; communication_file32.vhdl}...}
R4:{P41:{computation_file41.verilog; communication_file41.verilog}
P42:{computation_file42.verilog; communication_file42.verilog}...}
} /*Pij stands for process j on resource i.*/
A NoC-AL program includes NoC architecture description and application description. A NoC appli-
cation can be expressed as a task graph, and then mapped onto the given NoC architecture after iterations
of reﬁnement and hardware/software partitioning until satisfaction. Then we can write NoC-AL programs
for custom hardware like ASIC and FPGA, custom software, and communication interfaces in hardware
and software. To translate NoC-AL programs into NoC conﬁguration ﬁles including both hardware and
software parts, we need a NoC assembler which does source-to-source processing. Further on, standard
tools for hardware synthesis and software compilation & linking are used to generate lower abstraction
level codes. This procedure is illustrated in ﬁgure 1.3, where the libraries are implementations of NoC
primitives, for example, communication primitives.
There are a lot of open NoC-related issues, such as what NoC architectures/topologies are good for
which applications, how to efﬁciently map processes to resources at run time by task migration, and so on.
Most of the topics are beyond the scope of this report. In this report we concentrate on the central issue of
NoC-AL, the NoC communications, in particular, communicationprimitives for both message passing and
shared memory.
1.2.2 NoC-AL Communications in the OSI Seven-Layer Model
Network communications are usually expressed using the ISO’s OSI seven-layer reference model. In this
layered model, the NoC-AL communications handle the session layer, as shown in Figure 1.4. Seen from
the session layer, the implementation details from the transport layer down to the physical layer are hid-
den. The session layer offers a network InterProcess Communication (IPC) service. It should also provide
service access points to its upper layer, the presentation layer (if any) or the application layer, and uses ser-
vices at its lower layer, the transport layer. A session connection provides a relationship between functions
5HW codes SW codes
Application task graph 
Refinement & HW/SW partitioning & Mapping
D
e
s
i
g
n
 
f
l
o
w
NoC architecture 
NoC−AL programs
C
o
m
p
i
l
c
a
t
i
o
n
 
f
l
o
w
NoC configuration files
Synthesis tools
Linkers
Executable files Configuration files Silicon masks
NoC assembler
SW compilers
ASIC
Libraries
FPGA
Figure 1.3: The NoC Application Design and Compilation Flow
located in a pair of cooperating systems, established for the purpose of information transfer between them.
Information transfer, originated by application processes, is carried through the application and presenta-
tion layers to the session layer (ISO 8326, 1990). The services provided by the session layer are concerned
with the management of a coherent dialogue between cooperating systems: session establishment and ter-
mination, interaction management, synchronization, data transfer and exception reporting [8].
Physical
Datalink
Network
Transport
Session
Presentation
Application
Physical
Datalink
Network
Transport
Session
Presentation
Application
Physical
Datalink
Network
The underlying communication systems
viewed by session lyaer
message
NoC−AL Communication
NoC−AL Description
Packet
Frame
Message
Transmission Unit
Word
Datagram/Stream
Figure 1.4: NoC-AL Communications in OSI’s Seven-Layer Model
1.3 Related Work
While the NoC design process is new, the design of embedded systems and distributed systems has been
well-developed. The central part of NoC-AL deals with the inter-process communications (IPC) on NoC.
IPC has been extensively researched and used in multitasking operating systems, computer networks and
distributed computing [9] [10] [11]. And a lot of network applications such as client-server, producer-
consumer, interacting peers paradigms, Remote Procedural Call (RPC) make use of IPC mechanisms.
However, the IPC only handles software communications. No matter it works on a distributed or non-
distributed system, processes are communicating via software, the underlying of which is always an op-
erating system. However, from the task level, a process can be either a piece of hardware or a piece of
software. Hardware/software is the result of a partitioning. NoC-AL communications must deal with pro-
cesses in both hardware and software forms, resulting in requiredconcerns for chip features, such as power
and latency etc.
Within the context of a SystemC model, channels provide means of communication between modules
and between processes within a module [5]. SystemC places few restrictions on the functionality of chan-
nels. Thus channels may vary widely in complexity - from hardware signals to complex protocols with
embedded processes. SystemC deﬁnes two types of channels, primitive channels and hierarchical chan-
nels. A primitive channel is one that supports the request-update method of access. A few examples of
primitive channels are the hardware signal
s
c
s
i
g
n
a
l, the FIFO channel,
s
c
f
i
f
o, the mutual-exclusion
lock,
s
c
m
u
t
e
x. Primitive channels are limited to the simpler communication mechanisms only. On the
6otherhand,hierarchicalchannels,which, as modulesthat implementoneor moreinterfaces, canhave inter-
nal processes, offera morepowerfulmethodformodelingcomplexcommunicationstructures,forinstance,
the on-chip bus (OCB), and also useful for the reﬁnement of primitive channels. Since the SystemC chan-
nel supports hierarchical communicationand communicationreﬁnement, it providesa very good back-end
support for implementing the channel we will put forward in the report. We will discuss this further in
chapter 4.
One question arises when deﬁning NoC message passing primitives: why not simply adopt Message
Passing Interface (MPI) [12]? MPI has become the de facto standard for distributed programming that de-
ﬁnes a message passing API library. It comprises 129 functions offering extensive functionality,ﬂexibility,
and generality. The implementation of MPI demands the support of powerful operating systems which are
very often not the cases for embeddedoperatingsystems that are real-time oriented and compact. This may
lead to implement MPI on NoC difﬁcult, and less efﬁcient for a speciﬁc application. On the other hand,
although NoCs use network communication to overcome the scalability problem of bus-based System-on-
Chips (SoCs), some communicationfeatures, such as transmission latency, bandwidth, and trafﬁc type etc.
need to be reserved. However, these features are not available in MPI, but important issues for chip design
in order to achieve efﬁcient implementations in terms of speed, area as well as power.
The rest of this reportis organizedas follows. Chapter2 discusses NoCcommunicationincludingcom-
munication types and styles, common communication issues and communication channel characteristics.
Based on the work in Chapter 2, we deﬁne primitives for the two communication styles, message pass-
ing and shared memory in Chapter 3. In Chapter 4, we present some implementation issues of primitives
deﬁned in Chapter 3. We summarize this report together with future work in Chapter 5.
7Chapter 2
NoC Communications
After classifying the two types of communications on NoC, we brieﬂy look at the two communication
styles, messagepassingandsharedmemory. Thenwediscussthegeneralcommunicationissuesconcerning
naming and addressing, connection, and synchronization. At last we propose communication channel that
carries a set of characteristics imposed by an application.
2.1 Communications on NoC
2.1.1 Types of Communications on NoC
Communicationscan beclassiﬁed eitherphysicallyorlogically. Fromphysicalpointofview, NoCcommu-
nicationscanbeeitherintra-resourceorinter-resource. Intra-resourcecommunicationisbetweenprocesses
on the same resource. Inter-resource communication is between processes on different resources. From
a logical point of view, NoC communications can be either intra-process or inter-process. Intra-process
communication takes place inside a process, inter-process communication (IPC) between processes. Obvi-
ously inter-resourcecommunicationbelongsto distributed interprocesscommunicationsince messages are
passed through the chip network. Distributed processes are concurrent processes that communicate using
the message-passing mechanisms found with IPC facilities.
Although the following discussions are also well suited with intra-resource communication, we focus
on inter-resource communication in this report. It is interconnecting heterogeneous resources that makes
on-chip communication become a bottleneck for system performance.
2.1.2 Communication Styles
Memory organization plays a decisive role in interprocess communication styles. The memories in NoC
can be either public/shared or private/local. If they are public or shared, the memories are organized as a
single global address space. Processes implicitly communicate via shared variables. Shared memory can
be designed as shared centralized or distributed memory. If the memories are private or local, that means
the NoC has multiple address spaces. Processes communicate by message passing, i.e. explicitly send and
receivemessages. In the style of sharedmemoryIPC, concurrentprocessesshareone or morevariablesand
use the changes in state of these variables to communicate. In message passing-based technique, processes
send and receive messages explicitly instead of examining the state of a shared variable.
We should mention that the two styles of interprocess communication, shared memory and message
passing are logically equivalent, i.e. given one, you can build an interface that implements the other.
However, some programs may be easier to write using one rather the other. In addition, the hardware
platform may make one easier to implement or more efﬁcient than the other [13].
82.2 Communication Issues
Communication is concerned with: Who communicates with whom in which language by which media?
And how is the communicating procedure conducted? The three elements involved in any communication
are communicatingentities WHO (sender,receiver),MEDIA (dedicatedor sharedchannel, connection-less
or connection oriented channel), and PROTOCOL. The protocol is viewed in two aspects. One is message
format, like language and grammar, which enables to correctly interpret data. A typical protocol data
unit (PDU) consists of three parts: header, payload (data), and trailer. Obviously all the communicating
entities at the same layer must agree on a message format. The other is synchronization regarding how the
actual communication is proceeded. Depending on the communication style, the synchronization and its
implication vary. For instance, in shared memorystyle, how to allow concurrentaccess to a piece of shared
code? In message passing style, can both communicating entities perform send operation simultaneously?
Or is it only allowed to have that one is sending while the other is receiving, which is similar to that one
is talking while the other is listening? A solution for network-based communication has to address the
following three major issues [14]:
￿ Naming and addressing issue – How to designate the communicating entities/parties?
￿ Connection issue – What type of connection exists between senders and receivers?
￿ Synchronization issue – How to synchronize between write and read, between send and receive?
2.2.1 Naming and Addressing Issue
To denote the communication parties, a certain type of naming and addressing scheme is used. There are
many ways to designate whom you want to communicate with:
￿ By name (e.g. object X)
￿ By address (e.g. object at destination X)
￿ Group identiﬁer (e.g. all objects related to X) used to identify a NoC multi-cast group.
Resourcenamingand addressingdependson the networkarchitectureandthe communicationprotocol.
The principle of naming lies in uniqueness and efﬁciency. On a single resource the name or process-id of a
process is sufﬁcient. If a process name is used, the operating system must resolve the name to a process-id.
In a NoC, a (resource address, process-id) pair can be used to uniquely identify a process on a resource.
We use
P
i
j to represent the process
j on the resource
i.
The destination resource addressing (routing) concerns routing algorithm, routing mechanism, and
routing mode. The routing algorithm determines which of the possible paths from source to destination
are used as routes and how the route followed by each particular packet is determined. It restricts the set
of possible paths to a smaller set of legal paths. The routing algorithm may be deterministic, adaptive,
minimal. One property of any algorithm has to achieve is deadlock free. The routing mechanism selects an
output port for each input packet. Usually there are three types of mechanisms: dimension-order routing,
source-basedrouting,a n dtable-drivenrouting. The routingmode determineshow a packet proceedsalong
its routing path. Typically it includes store-and-forward, cut through,a n dwarm-hole. The three schemes
differ in buffer size requirement in switches and packet delivery latency. An interesting extreme case of
non-minimal adaptive routing is what is called “hot potato” routing. In this scheme, the switch never
buffers packets. If more than one packet is destined for the same output link, the switch sends one toward
its destination and temporarily “misroutes” the rest into other output link. The detailed text on routing
issues can be found in [15].
In the mesh structure, it is easy to address a resource by making use of the two-dimensional Cartesian
coordinate
(
x
;
y
), so called dimension-orderrouting. Figure 2.1 shows two possible routes from the source
resource
(
2
;
2
) to the destination resource
(
0
;
0
) in the 2D mesh. Since a resource and a switch are con-
nected in a one-to-one pair, a resource and the connected switch can share the same Cartesian coordinate
(
x
;
y
). In ﬁgure 2.1, they are together viewed as one node.
9(2,2)
(1,0) (2,0)
(1,2)
(1,1)
(2,0)
(1,0)
(2,1)
(0,0) x
y
Figure 2.1: Addressing in the 2D Mesh
2.2.2 Connection Issue
If processes are all located inside a resource, connection issue is relatively simple. They are connected by
atomic connections such as signal, FIFO, mutex, or bus connections. Data ﬂow between processes reli-
ably. Connection issue becomes more critical and intractable when dealing with network message passing.
There are two types of connections, connection-oriented or connection-less. Connection-oriented com-
munication can take place in two kinds of networking schemes, packet-switching and circuit-switching.
Circuit switching is inherently connection-oriented. After a connection setup phase, the channel provides
dedicated circuit connection for both sides. It has guaranteed constant data rate service, resulting in Qual-
ity of Service (QoS). After data transmission, the circuit connection is torn down. A typical example of
circuit switching is public telephony. Packet switching is based on a routing layer. It may provide both
connection-oriented virtual circuit and connection-less service. In connection-less service, each message
called datagram is routed immediately and independently without the knowledge of its previous message.
A familiar analogy is a courier service (e.g. postal mail). Each message has to have fully-addressed des-
tination. This incurs much overhead to the message. Although datagram-oriented connection-less service
has no setup delay and tear down cost, the packets may be lost due to network congestion or buffer con-
tention, and the data sequence can not be guaranteed after transmission. Also the data may be duplicated.
In connection-oriented service, it is called virtual connection because it is achieved in software [16]. It
also has a channel setup phase. And a route may be shared among several logical circuits. The routing
decision is decided once. Next, the messages called data streams, byte streams,o rs i m p l ystreams can be
addressed by a simple virtual channel identiﬁer and follow the same path from the source to destination. It
may be delayed at intermediate nodes but the data sequence is guaranteed. In summary, a message passing
service can be accomplished by one of the three means, dedicated connection, connection-oriented virtual
circuit and datagram-oriented connection-less message passing. Each of these methods has strengths and
weaknesses, and ﬁts different applications.
Basically we distinguish four types of connections:
￿ Dedicated connection. For instance, circuit-switching connection typically for telephony is ded-
icated. A channel is a point-to-point hardwired connection between an initiator and a target. It
providesconstant data rate, and guaranteed services. Data order is preserved. The QoS is high while
the interconnect network utilization is lower.
The other three cases deal with packet switching network where connection implies if a sender is
needed to keep the transmission state information, such as which data is being transferred, which
have been acknowledged, which can be sent right now, and so on.
￿ Connection-oriented virtual circuit. An end-to-end protocol, similar to TCP [17, 18]. In virtual
circuit services, messages are delivered in order with guaranteed bandwidth. The latency varies
within a bounded range. The upper bound is the worst case latency along the virtual circuit path.
￿ Connection-less. Similar to UDP [17, 18]. In connection-less or best effort service, messages are
independently routed in the network. Neither the bandwidth nor the latency can be guaranteed.
￿ Raw. This opens up a possibility for the session layer to bypass the transport layer and thus directly
talk to the network layer [11].
10The details of the connection issue are dealt with by the transport layer. For instance, if the connection-
oriented service is chosen, the transport layer is required to support the following:
￿ Connection management: The signaling procedures are required to open, maintain and tear-down
connectionsbetween communicating entities if connection-orientedservice is used. Concrete proto-
cols are needed for the three phases.
￿ Acknowledgments: An acknowledgment scheme is used by receivers to notify senders of the suc-
cessful or unsuccessful reception of data. In shared memory communication,it is particularly impor-
tant to have acknowledgmentfor the completion of write operations to maintain correctness [15].
￿ Error handling: Data loss due to transmission errors (serious signal integrity problem arising from
deep sub-micron implementation) and buffer overﬂow at the network and the receivers may occur,
necessitating error detection and recovery schemes.
￿ Congestion control and ﬂow control: Flow control involves preventing senders from over-running
the capacity of receivers. Congestion control involves preventing too much data from being injected
into the network, thereby causing switches or data links to become overloaded. Congestion happens
atthe intermediateroutingnodesduringbuffercontention. Flow controlis an end-to-endissue, while
congestion control is concerned with how hosts and networks interact [19].
If the connection-less service is chosen, messages from the session layer are simply encapsulated with
the header of the transport layer protocol, and then routed away via the network to destination.
2.2.3 Synchronization Issue
The fundamentalproblemintroducedby concurrentexecutionof processesis the possibility of interference
leading to inconsistent states. The role of synchronization is to avoid the possible histories of a concurrent
execution of processes to forbidden states. In other words, to constraint the possible histories to good
states. With respect to the two communication styles, there are two classes of synchronization. One is for
the shared memory. The other is for the message passing.
Shared Memory Synchronization
Figure 2.5 illustrates processes sharing memory resources. The memories, which can be physically cen-
tralized or distributed, are organized as a single address space. Allowing multiple processes to share data
structures has to have a memory consistency model. This model, which functions as a set of rules, in
fact serializes concurrent randomly unordered accesses to memories, i.e. the order of execution of mem-
ory accesses from multiple processes. There are several different memory consistency models deﬁned for
multiprocessorssystems. Sequentialconsistency[20] is the strongest model; it guaranteesthat memoryup-
dates will appear to occur in some sequential order, and that every processor will see the same order. Some
consistency models relaxing the ordering constraint of sequential consistency by distinguishing between
accesses to synchronization variables and ordinary data. Synchronization operations are dealt with at a
high level of consistency, usually sequential consistency. Ordinary operations are processed with a weaker
consistency such as processor consistency and release consistency, but the presence of synchronization
operations enforces additional ordering restrictions on ordinary operations. Dubois et al. deﬁned weak
consistency model [21]. Their model requires (a) access to global synchronizing variables are strongly or-
dered, i.e. sequentially consistent. (b) no access to a synchronizing variable is issued in a processor before
all previous global data accesses have been performed. (c) no access to global data is issued by a processor
before a previousaccess to a synchronizingvariable has been performed. An ordinarydata access is issued
either before or after a synchronization operation. Thus this model requires strong consistency of global
data accesses with respect to synchronizationvariables. In another word, all processes must see the normal
data access occur in an order with respect to synchronization operation.
The memory consistency models present tradeoffs between ease of programming and implementation
overhead. Sequential consistencymakesa programmerfeel ease because he or she doesnot need to explicit
11P
2
Controller
M
1
Controller
P
1
R
1
R
2
R
3
P
3
M
2
Memory resources
Interconnection network
Figure 2.2: Processes Share Memories
y achievesynchronization. Howeverit iscostly toimplement,andtheperformanceisslower. Consequently
multiprocessors typically implement a weaker memory consistency model and leave to programmers the
task of inserting memory synchronization instructions. Compilers and libraries often take care of this so
the application programmer does not have to do so [22].
A NoC itself can be also a multiprocessorsystem althoughit may be heterogeneous. The heterogeneity
refersto heterogeneousprocessor-memoryresourceson a NoC insteadof homogeneousprocessor-memory
resources found in many actually implemented multiprocessors systems, for example, the famous Cosmic
cube [23]. This further implies heterogeneous software platforms on the processor-memory resources,
such as operating systems and middle-ware libraries. A NoC architecture implements a relaxed memory
consistency model. It is the programmers’ job to write synchronized programs [15] where programs are
labeled with synchronization events. To this end, we need a primitive for application programmers to
explicitly label synchronizing operations upon synchronization variables.
With memory consistency models, the concept of shared memory is no longer tied to the physical
implementation of memory banks. A programmer can write a correct program using the abstractions of
concurrent processes and shared memory with little knowledge about the underlying memory implemen-
tation that will eventually execute the program. All that the programmer needs to know is the consistency
model enforcedby memory. This leads to take the shared memoryas an applicationprogramminginterface
[24], as shown in ﬁgure 2.3. The program and memory agree on a consistency model.
Application program
Values returned Read/Write Consistency
Model
Shared Memory API
Memory Implementation
Figure 2.3: Shared Memory as an API
In our NoC memory consistency model, we distinguish synchronization variables from ordinary data.
Each critical section is associated with a synchronization variable. The critical section is protected by the
entry and exit protocol. Thuswe assume that
n processessharing a critical section have the followingform:
Process Critical Section [ i
=1t on ]
f
while (true)
f
entry protocol;
critical section;
exit protocol;
non
￿critical section;
g
g
Theentryandexitprotocolwork uponthe synchronizationvariable. To implementthe synchronization,
we need special-purpose messages to the memory, and also speciﬁc protocol to handle these messages. In
12ﬁgure2.5,synchronizationmessagesaresentfromtheprocesses
P
1,
P
2 and
P
3,andhandledbythememory
controllers in the memories
M
1 and
M
2 .
Sharedmemorysynchronizationcan beachievedbymutualexclusionand/orconditionsynchronization
[22], as shown in ﬁgure 2.4. The synchronization problem occurs when multiple processes simultaneously
try to access the critical section which can only be executed by one process. The critical section is a
sequence of execution that accesses the shared objects (variables or resources). Condition synchronization
is concerned with delaying an action until the state satisﬁes a Boolean condition, e.g. a ﬂag is set, a
semaphore is up, all processes have reached a barrier. Mutual exclusion is a control mechanism that
guarantees only one process can enter, execute and exit the critical section at a time. Entering and exiting
the critical section must be atomic, i.e. its execution can not be interleaved with other statements leading
to invisible state transformation to the outside processes. Locks and their variants such as spin locks and
queue locks are the common mechanisms to provide mutual exclusion. Semaphores provide a low-level
but efﬁcient signaling mechanisms for both mutual exclusion and condition synchronization. A semaphore
is a special kind of shared variable that is manipulated by two atomic operations, P(s) and V(s). Each is
an atomic operation with one argument. Let s be a semaphore. The value of a semaphore is a nonnegative
integer. Then the deﬁnitions of P(s) and V(s) are as follows:
P(s):
<await (s
>0) s
=s
￿1;
> #wait, down
V(s):
<s
=s
+1;
> #up, signal
where the angle brackets
< and
> specify atomic actions. For example,
<
a
w
a
i
t
(
B
)
S
> constructs
a general coarse-grained atomic action. Boolean expression B speciﬁes a delay condition and S is a
sequential statements that is guaranteed to terminate.
The P operation decrementsthe value of s, but to be sure that s is never negative, the P operation waits
untilsis positive. TheV atomicallyincrementsthevalueof s. Themaindisadvantageofsemaphoresisthat
it is a low-level mechanism and thus difﬁcult to use. Monitors and condition variables are two high-level
synchronization mechanisms combined together in order to achieve both mutual exclusion and condition
synchronization. Monitors provide implicit mutual exclusion and condition variables provide explicit con-
dition synchronization with blocking semantics. A monitor is an abstract data type that deﬁnes monitor
variables and atomic operations on them. Collective processes may use barriers for synchronization. A
barrier is synchronizationpoint that all processes must reach before any process is allowed to proceed. The
basic implementation semantics for such synchronization schemes are busy waiting and blocking. In busy
waiting, a process repeatedly checks a condition until it becomes true. Blocking synchronization requires
a mechanism to suspend and resume processes (context switching) and to maintain a queue of delayed
processes.
mutual exclusion condition synchronization
spin locks, queue locks
locks
Shared memory synchronization
collective processes inter−process 
barriers
(counter barrier, flags and coordinator
symmetric barriers)
condition variables
with condition variables
monitors
(binary, split, general gounting)
semaphores
Figure 2.4: Shared Memory Synchronization Mechanisms
Implementing any atomic operations such as barriers, locks, P(s) and V(s) require synchronization
primitives that realize an atomic read-modify-write operation. For instance, many synchronization prim-
itives have been developed in the form of software procedures, such as test-and-set, compare-and-swap,
fetch-and-add,fetch-and-increment,to realize locks, i.e. lock on entering the critical section and unlock on
exiting the critical section. These coarse-grainedoperations are built on some special ﬁne-grainedmachine
instructions that provide atomicity during reading and writing memory. For example, in general-purpose
13processors, special load-linked (ll) and store conditional(sc) instructions (e.g. “ll”a n d“ sc” for MIPS4000
or “lwarx”a n d“ stwcs” in MPC860) are implemented in hardware. However, the need for the special
load (ll) and store (sc) can be removed if the underlying hardware architecture has some special functional
units such as arbiter and hardware locks in addition to memory controller in a multiprocessor SoC [25]. In
summary, the implementations of atomic operations require the support of the underlying system architec-
ture. But this does not mean that the microprocessorsmust offer special machine instructionsto support an
atomicread-modify-writeoperation. Ifthereare speciﬁchardwaresynchronizationunitshandlingaccessto
shared memory, the multiprocessors can exclusively access shared memory without using special machine
instructions to achieve an atomic read-modify-write instruction.
Interconnection network
CPU CPU CPU
Cache Cache Cache
Mem Mem Mem
R
1
R
2
R
3
R
1
R
2
R
3
CPU CPU CPU
Cache Cache Cache
Interconnection network
Mem Mem
b) Distribued-memory a) Dancehall
Figure 2.5: Processors Share Memories: Dance-hall and Distributed Memory
To explore temporal and spatial locality, microprocessors usually have local caches to enhance com-
puting performance, as shown in ﬁgure 2.5. The memory in ﬁgure 2.5(a) is organized symmetrically -
all of main memory is uniformly far away from all processors - but its limitation is that all of memory
is indeed far away from all processors. Several “hops” or switches in the interconnect must be traversed
to reach any memory module from any processor. The approach in ﬁgure 2.5(b), which uses distributed
memory,is not symmetric. A scalable interconnectis located between processing nodes, but each node has
its own local portion of the global main memory to which it has faster access. By exploiting locality in the
distribution of data, more cache misses may be satisﬁed in the local memory and may not have to traverse
the network [15]. Shared-memory multiprocessors have an advantage: the simplicity of sharing code and
data structures among the processes comprising the parallel application. Process communication, for in-
stance, can be implemented by exchanging information through shared variables. This sharing can result
in several copies of a shared block in one or more caches at the same time. To maintain a coherent view
of the memory, these copies must be consistent. This is the cache coherence/consistency problem. Cache
coherenceschemes tackle the problem of maintaining data consistency in shared-memorymultiprocessors.
They rely on software, hardware, or a combinationof both [26]. Hardware-basedprotocolsinclude snoopy
cache protocols, directory schemes and cache-coherent network architectures. An excellent text on cache
coherency can be found in [26, 27].
Message Passing Synchronization
Ifprocessesuselocalmemories,thesynchronizationcharacteristicsofamessagingprotocolgovernwhether
a process stops running when it executes a send or receive [14]. In general message passing can be syn-
chronous or asynchronous depending on what synchronization schemes are employed by the send and
receive. Basically a send and a receive can be blocking or nonblocking. A nonblocking operation allows
the process to continue execution whereas a blocking operation suspends the process until a certain pre-
speciﬁed condition turns true, e.g. receiving acknowledgment for send, or data available for receive, or
timeout. A blocking condition decides when the blocking function call will return. The reliability require-
menton communicationdeterminesthe blockingconditions. The nonblockingoperationshould allow later
to check its status if it has been completed.
Figure 2.6 shows the communication scenario from a source process to a destination process. We
assume that there are buffers at both the session layer and the transport layer. At the source, a message is
ﬁrst delivered to the session layer buffer, and then moved to the transport layer buffer. After routing in the
14network, the message arrives at the transport layer buffer, and then dispatched to its correspondent session
layer buffer at the destination.
S
2
S
3
R
1
R
2
session layer session layer
transport layer transport layer
Destination Source
S
1
R
3
Figure 2.6: Message Passing Scenario from Source to Destination
The blocking send works as follows:
1. If the source process (the application process) asks no acknowledgment, the blocking send returns
when the message is delivered to the transport layer successfully. To complete this, ﬁrst the session
layer buffer should be available, and then the transport layer buffer is available.
2. Ifthesourceprocessrequiresacknowledgmentfromitsdestinationprocess,theblockingsendreturns
until receiving acknowledgment.
The nonblocking send returns when the message is successfully delivered to the session layer. The
requirement to complete this is that the session layer buffer has to be available.
The blocking receive works as follows:
1. Ifthe sourceprocessasks no acknowledgment,the blockingreceivereturnsafter receivinga message
from its session layer buffer.
2. If the source process asks acknowledgment, the blocking receive returns until receiving a message
from its session layer buffer and ﬁnishing in sending back an acknowledgment. This implies that
there is a session layer buffer which is available for the send back of acknowledgment.
The nonblocking receive returns after checking its session layer receive buffer.
2.3 Communication Channel Characteristics
An application is composed of concurrent communicating processes. It may be represented as a directed
graphknown as task graph in which a node denotesa process, and an arc a communicationchannel. Figure
2.7 shows a task graph with three channels.
P
1
P
2
P
3
c
h
1
c
h
2
c
h
3
Figure 2.7: A Task Graph with Three Channels
A channel is an arc in the task graph connecting a communicating pair. It is an abstraction of com-
munication media, either dedicated or shared. At the task level, it does not incorporate implementation
details, such as interfaces, but a set of characteristics regarding performance, cost and Quality-of-Service
(QoS) which are required by the application in order to satisfy design goals under given constraints. In
the following, we use
C syntax and
C conventions to facilitate deﬁning communication primitives when
necessary. However, they are language-independent, and can be bound to various hardware and software
design languages. We will show these bindings in the next chapter.
We identify and deﬁne some important channel characteristics in a struct as follows:
15struct channel feature
fdirection, burstiness, latency, bandwidth, quality class, reliability
g
￿ direction: it gives the orientation of message transfer. It may be simplex, half-duplex, or duplex.
For half-duplex, control messages are needed to switch the message transfer direction.
￿ burstiness: it reﬂects the channel trafﬁc characteristics, which can be periodic or aperiodic. If
periodic, it has a burst cycle with maximum burst length. It can model a channel with constant
data rate. If aperiodic, it may have minimum burst interval and maximum burst length. For example,
keyboardtypingis aperiodic,but reasonablyit has maximumspeed (frequency). Thisparametermay
be useful for modeling chip network trafﬁc, saving power consumption, and allocating bandwidth.
￿ latency: the absolute time of a single unit of data transmitted from the sender to the receiver. It may
be also measured in relative time in terms of the number of clock cycles. The latency is one of the
requirements from the application. It may have three values: minimum, average, and maximum.
￿ bandwidth: it deﬁnes the channel ability of transferring data. It can be measured by different units,
for instance, the number of bits per second, the number of frames per second, etc. Different layers
have different transfer units. The physical layer transmits words, the datalink layer frames, the
network layer packets, the transport layer datagrams or byte streams depending on the connection
type, the session layer messages. Message is referred to as the natural communication unit of an
algorithm; in general, a message must be broken up into packets to be sent on the network. Among
those the size of a frame, packet, and a message is variable within a range. It is infeasible to use
the units with variable length to represent bandwidth. A unit with ﬁxed size has to be adopted. We
use the number of bytes per second to denote channel bandwidth that is also one of the requirements
from the application. It may have three values: minimum, average, and maximum.
￿ quality class: a natural number to represent the quality level of a channel. It is useful for schedul-
ing purpose, for example, when multiple channels are contending for a shared resource, like buffers,
communicationlinks etc. It is a qualityofservicemetricat the channellevel. TheQualityClass (QC)
is closely related to the connection states of the channel that are built on the NoC backbone commu-
nication services. As discussed previously, the connection can be basically connection-oriented and
connection-less.
￿ reliability: In a communication sense, it means that data are sent and received correctly, i.e. no
corruption, no loss, no duplication, no out-of-order. Reliability is orthogonal. Any channel can be
designed with a certain reliability no matter the underlying layer is reliable or not. The channel
directly deals with the session layer whose lower layer is the transport layer. When deﬁning the
channel reliability, we have to take into account the reliability of both the transport layer and the
session layer. We deﬁne four levels of reliability as illustrated in ﬁgure 2.8, and describe them as
follows:
Transport Layer
r
u
u
r
Session Layer
L1 L2
L3 L4
Figure 2.8: The Channel Reliability Constellation
– Level 1,
S
r
T
r means that both the session layer and the transport layer are reliable. The session
layer at the destinationsends backacknowledgmentto its counterpartat the source, as shownin
ﬁgure 2.9. In ﬁgure 2.9, the transport layer also sends acknowledgment back. Please note, this
scenario only refers to the packet-switching network where the network layer is not reliable.
A typical example is TCP built on the Internet. In the case of the circuit switching where the
network layer is reliable, the transport layer acknowledgment may not be necessary.
16Transport Transport Application Session Application Session
data
ack
source destination
data
ack
data
ack
Figure 2.9: Reliable Session and Transport
– Level 2,
S
u
T
r refers to an unreliable session layer but a reliable transport layer. The session
layer at the destination does not send back acknowledgment to its counterpart at the source, as
shown in ﬁgure 2.10. Similarly to the Level 1, the transport layer does not necessarily send
back acknowledgmentif the underlying network is circuit switching.
Transport Transport Application Session Application Session
data
ack
source destination
data data
Figure 2.10: Unreliable Session but Reliable Transport
– Level 3,
S
r
T
u refers to an reliable session layer but an unreliable transport layer. Like the relia-
bility Level 1, the session layer at the destinationsendsback acknowledgmentto its counterpart
at the source, as shown in ﬁgure 2.11. This actually implies that the session layer has to con-
sider retransmission, thus message labeling and timeout. If a nonblocking sender cares about
its messages, it should be prepared to re-send messages if the recipient does not respond after
a reasonable amount of time. This means that the sending process should not terminate until it
is assured that its messages were indeed received.
Transport Application Session
source
data
ack
Transport Application Session
destination
data
ack
Figure 2.11: Reliable Session but Unreliable Transport
– Level 4,
S
u
T
u means that both the session layer and the transport layer are not reliable. It is
the lowest level of reliability. The transmitted data will neither be acknowledged at the session
layer nor the transport layer, as shown in ﬁgure 2.12.
Transport Application Session Transport Application Session
source destination
data data
Figure 2.12: Unreliable Session and Transport
17Chapter 3
NoC-AL Communication Primitives
This chapter deﬁnes primitives for the two communication styles, message passing and shared memory,
followed by a simple application example of using the primitives.
3.1 Message Passing Primitives
A process is uniquely identiﬁed by a tuple (resource number, process number). A channel is shared by a
source process and a destination process, thus identiﬁed by a (source process, destination process)p a i r .
We treat a channel as an object, and deﬁne a set of methods to operate on it. A message passing procedure
is a channel-based data transaction that consists of three phases: channel setup, data transmission,a n d
channel tear down, as illustrated in ﬁgure 3.1. A channel is set up by request and response. The concrete
channel setup like three-way handshaking is implemented at the transport layer. For the session layer, it
onlyneedstoknowwhatkindofchanneltheapplicationaskstoestablish. Thishandshakingproceduremay
notactually take placeif the initiatorasks fora connection-lesschannel. Inthis case, openinga channeljust
assignsthedestinationaddressandthechannelparameterstotheinitiator,anddoesnotexpectanyresponse.
That means, the channel establishment is handled locally. If the initiator asks for a connection-oriented
channel, the initiating process sends the setup message to negotiates with the network for bandwidth and
delay during the channel setup phase. Once the request is granted, the channel path is ﬁxed, and the
bandwidth is reserved. Data transmission may be one-way or two-way. The channel initiator/creator does
not necessarily send message ﬁrst because a channel request may be initiated by the destination who wants
speciﬁc channel characteristics/parameters. After data transmission phase, the channel can be torn down
by either end of the communicating processes.
Send
Receive
Accept Open
Request
Response
Source/Initiator Destination/Target
Close
Receive
Send
Close
Channel setup
Data transmission
Channel tear down
Figure 3.1: Message Passing Procedure Between Processes
In terms of the message passing procedure, we deﬁne the following communication primitives for
message passing:
￿ Open a channel
int channel(source process, destination process, channel feature)
18Description:
– This function initiated by the source process opens a channel between the source process and
the destination process. It is carried out by the initiator who wants to contact the other end, the
destination. It returns a local channel descriptor, which is a nonnegative integer, if successful,
or a different negative integer for each of the different reasons of failure, such as the network
bandwidth requirement not satisﬁed, or the destination not available.
– The source process is the process who initiates the channel setup.
– The destination process is the process whom the initiator wants to communicate with.
– The channel feature is deﬁned as a struct reﬂecting a set of channel characteristics as deﬁned
in the previous chapter.
￿ Listen to channel
int listen(maxQueueLimit)
Description:
– This function sets the maximum size maxQueueLimit of channel request queue, and causes
internal state changes to permit channel requests.
￿ Accept a channel
int accept(channel)
Description:
– This function used by the destination to read one channel request from incoming buffer, stores
intochannel,andresponsesthe channelinitiatorifnecessary. It returnstheinteger1onsuccess,
oradifferentnegativeintegerforeachofthedifferentreasonsoffailure,suchaschannelrequest
not available, parity check/checksum error, requested channel features not met.
– The channel is the address to put an incoming channel setup request.
￿ Bind a channel
int bind(expected channel, channel)
Description:
– This function checks if an accepted channel matches an expected channel. It returns 1 for
successful matching, -1 for failure.
￿ Send message
int send(channel, msg, msg size, msg type, msg id, sync ﬂag, timer, out-of-band, request)
Description:
– This function sends a message to the speciﬁed channel. It returns 1 for success, a different
negative integer for a different reason of failure, and 0 when the timeout occurs if the send is
blocking.
– The channel is the channel descriptor where the messages are sent to.
– The msg is the initial address of the message to be sent.
– The msg size is the size of the message.
– The msg type is the datatype of the message. Datatype is one of the basic features of message.
Different data types take up different amount of memory. The representations of data types in
different microprocessors and design languages may differ. It is necessary to explicitly send
the data type of the message in order to avoid misinterpretation and wrong conversion.
19– The msg id is the identity number of the outgoing message. It is a natural number used to tag
each message. A message can be corrupted, lost, duplicated, and delivered out of sequence
during the course of network transmission. To have a reliable transmission, we can use positive
acknowledgment with retransmission. A message ID enables the sender to do retransmission,
and the receiver to maintain correct message sequence and avoid message duplication.
– The sync ﬂag with value 1 or 0 speciﬁes whether the send functionis blockingor nonblocking.
– The timer is used for two purposes. If the send is blocking, the timer speciﬁes the maximum
amount of time the sender waits for that the blocking condition is violated, e.g., the acknowl-
edgment is received. When timeout occurs, it informs the application by returning 0. Whether
toretransmitisuptoitsapplication. Ifthesendisnonblocking,thetimerspeciﬁestheminimum
amountof waitingtime beforeretransmission,andif thetimer equalsto -1, noacknowledgment
from the receiver is required, thus no automatic retransmission.
– The out-of-band is a ﬂag with value 1 or 0 to distinguish out-of-band data from in-band data.
Out-of-band data is considered higher priority than the normal data (sometimes called in-band
data). It is useful for conveying control information if something important occurs at one end
of the connection and that end wants to inform its peer quickly.
– The request is an optional object associated with nonblocking send and receive. It is used later
to query the status of the nonblockingcommunication or wait for its completion.
￿ Receive message:
int receive(channel,msg, msg size, msg type, msg id, sync ﬂag, timer, request)
Description:
– This function is used by a destination process to receive data from the speciﬁed channel.I t
returns1uponsuccess, 0whentimeoutoccursifthereceiveisblocking,anda differentnegative
integer for some reason of failure.
– The msg is the initial address of the incoming message buffer.
– The msg size is the size of the message to be taken from the incoming buffer.
– The msg type is the data type of the incoming message.
– The msg id is the identity number of the incoming message.
– The sync ﬂag speciﬁes whether the receive function is blocking or nonblocking.
– The timer is used for two purposes. If the receive is blocking, the timer speciﬁes the maximum
amount of time the receiver waits for that the blocking condition is violated, e.g. a message is
available. When timeout occurs, it informs the application by returning 0. Whether to continue
polling the channel is up to its application. If the receive is nonblocking,the timer speciﬁes the
minimum amount of waiting time before re-polling the channel.
– The request is similar to that explained in the send function.
￿ Check nonblocking completion
int check(request, status)
Description:
– This function checks if the operation identiﬁed by request completes. It returns information
on the operation in status, which may be an integer value of 1 or 0 representing completion or
not-yet-completion,respectively.
￿ Close a channel
int close(channel)
Description:
– This function closes the speciﬁed channel. It can be initiated by both communicating ends. It
returns either 1 for success, or -1 for failure. One possible reason of failure is trying to close a
reliable channel before all ongoing messages are received and acknowledged.
20In addition to these basic primitives described above, we need some other primitives for channel man-
agement such as getchannelopt
(
) and setchannelopt
(
), data conversions, multi-cast, and so on.
3.2 Shared Memory Primitives
Shared memories can be used statically like deﬁning global shared variables, or dynamically. Here we are
concerned with dynamic use of shared memory. The ways of using shared memories should be standard-
ized. Sharedmemorymeansdata is written into a globalmemoryﬁrst by a writer, and then readby a reader.
This is in contrast to message passing where the sender directly/explicitly passes data to the receiver.
Amemorysharingprocedureconsistsofthreephases: memoryallocation,memory access,a n dmemory
release, as shown in ﬁgure 3.2. Memory can be written and read in two ways, one-byte-based or multiple-
byte-based. That means, data can be written and read one byte at a time, or as a burst of data (multiple
bytes) at a time.
Memory Release
Memory Access
Memory Allocation
Figure 3.2: The Shared Memory Procedure
￿ Memory allocation
int memory(resource, start address, end address, memory type)
int memory(resource, number, memory type)
Description:
– Thetwofunctionsrequestamemorysegmentfromthememoryresource. Theﬁrstonespeciﬁes
start address and end address. The second one gives the requested number of bytes space. If
successful, both return a memory descriptor, which is a nonnegative integer. On failure, both
return a different negative integer for each of the different reasons of allocation failure, such as
memory resource not available, memory full etc.
– The resource is the location of the memory.
– The start address and end address speciﬁes the range of the requested memory segment.
– The number is the size of the requested memory segment.
– The memory type is deﬁned as a struct concerning if the memory allows multiple concurrent
reads:
struct memory type
fread: multiple
j single; write: single;
g
￿ Memory access – Read one or multiple bytes
int read(memory, number, start address, dataarray)
Description:
– This function reads a bunch of data from the speciﬁed memory address space. It returns a
positive integer denoting the number of bytes being successfully read upon success, and a
different negative integer for the different reasons of failure such as read contention, memory
address error etc.
– The memory is a memory descriptor.
21– The number is the number of data bytes to be read out. If the number equals to 1, this function
reads only one byte.
– The start address speciﬁes the starting memory address for reading.
– The dataarray is the pointer where the data is to be written into.
￿ Memory access – Write one or multiple bytes
int write(memory, number, start address, dataarray)
Description:
– Thisfunctionwrites oneormultipledatabytestothespeciﬁedmemoryaddressspace. Itreturns
a positive integer denoting the number of bytes being successfully written upon success, and a
different negative integer for the different reasons of failure such as write contention, memory
address error etc.
– The memory is a memory descriptor.
– The number is the number of bytes of the data array. If the number equals to 1, this function
writes only one byte.
– The start address speciﬁes the starting memory address for writing.
– The dataarray is the pointer where the data are to be read out.
￿ Memory release
int free(memory)
Description:
– This function releases the allocated memory space. It returns 1 for successful release or -1 for
failure.
￿ Atomic read-modify-write
int rmw(variable, relation, value, operation)
Description:
– Thisfunction,whichisadaptedfromthegeneralatomicinstructionoftheCedarsupercomputer[28],
does atomic read-modify-writeof the synchronizationvariable. The semantic is
<
a
w
a
i
t
(
c
o
n
d
i
t
i
o
n
=
=
t
r
u
e
)
o
p
e
r
a
t
e
o
n
t
h
e
v
a
r
i
a
b
l
e
>.
The value this function returns depends on the speciﬁc atomic operation it does. For example,
if it does Test-and-Set(lock), it returns the variable value before the operation. If it does Test-
and-Set(unlock), it returns the variable value after the operation.
– The two ﬁelds relation and value form one of testable conditions between the variable and the
value. Ingeneral,theconditiontakesoneofthethreeformats:
v
a
r
i
a
b
l
e
=
=
v
a
l
u
e,
v
a
r
i
a
b
l
e
>
v
a
l
u
e,
and
N
U
L
L. The value is a nonnegative integer.
– The operation is applied to the synchronization variable. It has four options: INCrement,
DECrement, ADD, SET to 1, and RESET to 0. All of the operations are indivisible.
With this general atomic function, it is straightforward to derive equivalent primitives for Test-and-
Set, Fetch-and-Increment,Fetch-and-Add, as well as the semaphore operationsP(s) and V(s).
Test
￿ and
￿Set(lock): int rmw(lock, ’ ’
=
=’ ’ ,0 ,SET)
Test
￿ and
￿Set(unlock): int rmw( lock , NULL, NULL, RESET)
Fetch
￿ and
￿Increment(s): int rmw(s , NULL, NULL, INC)
Fetch
￿ and
￿ Add(s ) : int rmw(s , NULL, value , ADD)
Semaphore wait :
<wait until s
>0, s
=s
￿1
>
P(s): int rmw(s , ‘ ‘
>’’,0,D E C )
Semaphore up:
<s
=s
+1
>
V(s): int rmw(s , NULL, NULL, ADD)
22As discussed previously, the implementation of atomic operations needs the support of the architec-
ture. To implement this general atomic conditional operation, we can add some special hardware
logic (basically an adder and some control logic) in the memory module [28] to realize the follow-
ing instruction format:
fVariable-Address; (Condition)*; Operation on Variable
g. Otherwise, one
can use the resource’s processor, then a lot of back-and-forth communication between a processor
and a memory are incurred. In addition, before doing so, one has to lock the variable ﬁrst. The
communication overhead is very high.
Address
Memory
Controller
Opcode
ALU
Mux
Figure 3.3: An Implementation Scheme of The Atomic Read-Modify-Write
A possible implementation scheme is shown in ﬁgure 3.3. The microprocessor who tries to oper-
ate on the synchronization variable sends a special message to the shared memory with the variable
address and the operational code representing the condition and the operation in the functional call.
The atomicity is achieved at the memory level. The memory has a small control unit (limited num-
ber of states) and data path (ﬁxed point arithmetic) to realize conditional operation accordingly. To
implement blocking semantics for semaphores, a queue associated with a semaphore may be built
additionally. The advantage of this general atomic operation lies in that it does not need any special
instructionsupport from the microprocessor. Thus it is potentiallyscalable across heterogeneousmi-
croprocessors. One limitation of this approach is that the synchronization variable is non-cacheable.
This sounds a drawback. But the overhead of implementing cache coherency based on an intercon-
nected on-chip network is so high that researchers are trying with cache-less memory, for example,
VTT’s ECLIPSE [29].
3.3 An Example of NoC-AL Program
Suppose there is a simple application illustrated by the task graph in ﬁgure 3.4.(a). Assume we have a NoC
which only consists of two resources, a SHARC DSP and an ARM microprocessor.
We manually map the processes P11 and P12 to R1, the SHARC DSP, the process P21 to R2, the ARM
CPU. Using the proposed primitives, a NoC-AL program might be coded as follows:
NoC Architecture {
Topology: mesh 1 x 2
Resource List: Row1: R1=SHARC DSP, R2=ARM CPU}
NoC Application {
R1:{#include <NoC-AL-SHARC.h>
#include <f1.h>
#include <f2.h>
double in1,in2,out,x,y; int ch1=0, ch2=0, ach2=0;
Process P11 {
while (1) {
x=f1(in1,in2);
while (ch1<=0) {ch1=channel(P11,P21);} //Open channel ch1 until success
23P11
P12
f1
f2
f3
x
b
a
y
P21
R1:SHARC DSP
ch1
ch2
R2:ARM CPU
in1
in2
out
P11
P12
f1
f2
f3
x
b
a
y
P21 ch1
ch2
in1
in2
out
R2:ASIC R1:FPGA
(b) FPGA + ASIC (a) SHARC DSP + ARM CPU
Figure 3.4: An Example of NoC Application in Task Graph
while (send(ch1,x)!=1) {continue;}} //Wait for send to ch1 success
while (close(ch1)!=1) {continue;}} //Close channel ch1
Process P12 {
while (1) {
while (ach2<=0) {ach2=accept(&ch2);} //Wait for channel ch2 accepted
while (receive(ch2,y)!=1) {continue;} //Wait for receive from ch2 success
out=f2(y);}}}
R2:{#include <NoC-AL-ARM.h>
#include <f3.h>
double a,b; int ch1=0, ch2=0, ach1=0;
Process P21 {
while (1){
while (ach1<=0) {ach1=accept(&ch1);} //Wait for channel ch1 accepted
while (receive(ch1,a)!=1) {continue;} //Wait for receive from ch1 success
b=f3(a);
while (ch2<=0) {ch2=channel(P21,P12);} //Open channel ch2 until success
while (send(ch2,b)!=1) {continue;}} //Wait for send to ch2 success
while (close(ch2)!=1) {continue;}}}} //Close channel ch2
Here the channel, accept, send, receive and close primitives are simpliﬁed without additional argu-
ments. These primitives are implementedin software libraries “NoC-AL-SHARC.h” for SHARC DSP, and
“NoC-AL-ARM.h” for ARM CPU. Both libraries are used as include ﬁles.
We should note that the task graph can be also mapped onto hardware resources, or both hardware
and software execution resources. If we map the processes to hardware resources, say, FPGA and ASIC,
theprimitivesareusedsimilarly,butimplementedinVHDL/Verilog/SystemClibrariesdependingonwhich
languagewe are utilizing to describe hardwareprocesses. Figure 3.4.(b)reﬂects one possibility of the map-
pings. Accordingly its architecture and application description will be modiﬁed. We show its architecture
description below:
NoC Architecture {
Topology: mesh 1 x 2
Resource List: Row1: R1=FPGA, R2=ASIC}
24Chapter 4
NoC-AL Implementation Issues
In this chapter we address the implementation issues of the NoC-AL communication primitives. We focus
on the following two points:
￿ Language binding: the proposed primitives are language-independent. Although they appeared in a
form similar to C syntax and C conventions,they are independentof a speciﬁc design language. And
those primitives can be bound to a speciﬁc hardware/software design language, such as VHDL, C,
or SystemC. In this chapter, we take VHDL and C to discuss binding the primitives to hardware and
software, respectively.
￿ Layered implementation: the primitives are high-abstraction level primitives. To be incrementally
synthesized, the implementations should have a clear layering from their abstract deﬁnitions down
to implementations. To this end we adopt the System communication layers.
4.1 Language Binding
There are two issues concerning binding the abstract primitives to target languages. One is data type
mapping, the other being expression of primitives.
4.1.1 Data Type Mappings
Data type is fundamental to any design language. The ways to represent data types and the operations on
them are various in design languages which adopt various syntax and assume various capacity of the un-
derlying processing elements, such as microprocessors, FPGAs etc. For example, a ﬂoating-point variable
can be deﬁned as single precision, and double precision in C. In VHDL, it is deﬁned as a real type. An
integer is deﬁned as int in C. In VHDL, an integer may be deﬁned with a constraint range like subtypes, in
order to be effectively synthesized.
Due to heterogeneity, NoC resources may work with various data length and data types (16-bit, 32-
bit, ﬁxed-point, ﬂoating-point etc.), and various representations for the same/similar data types in different
design languages. To allow data interoperability among heterogeneousresources, we do the following: (1)
use uniform data types, which serve as an abstraction of data types. Each of them has a direct mappingto a
data type in the target design language. (2) explicitly send the uniform data type with message. One place
to handle the mappings is the NoC assembler.
From a language point of view, there are basically four classes of data types as described in the VHDL
1076 speciﬁcation:
1. Scalar types represent a single numeric value or, in the case of enumerated types, an enumeration
value. They are ordered in some way so that relational operations (such as greater than, less than,
etc.) can be applied upon them. The standard types that fall into this class are integer, real (ﬂoating
point) and enumerated types.
25NoC DataType C DataType VHDL DataType
NoC CHAR signed char string
NoC INT signed int integer
NoC FLOAT ﬂoat real
NoC DOUBLE double real
NoC WORD pointer bit/std logic vector
NoC Array array array
NoC Struct struct record
Table 4.1: Data Type Mappings
2. Composite types represent a collection of values. Basic composite types are arrays containing el-
ements of the same type and records containing elements of different types. Composite data types
offer freedom for users to deﬁne custom data types.
3. Access/Pointer types provide references to objects/data.
4. File types reference objects (typically disk ﬁles) that contain a sequence of values.
NoCs as heterogeneoussystems should adopt uniform data types, which are used as an intermediary to
map into target data types. Table 4.1.1 shows the mappings from some NoC data types to their correspon-
dent VHDL/C data types. This table is by no means complete. A further elaboration is needed to introduce
some other useful data types, for example, packed or encrypted data types.
4.1.2 Expressions of Primitives
The primitives are abstract. They can be bound to any hardware and software design language. To bind the
primitives to a particular language, we should follow the syntax of the target language.
At ﬁrst we express the two data types channal feature and process in C and VHDL.
￿ The channel feature data type
struct channel feature
fdirection, burstiness, latency, bandwidth, quality class, reliability
g
– Data type channel feature in C:
typedef enum
fPeriodic, Aperiodic
g burstKind;
typedef struct
fburstKind kind; int MinCycle; int MaxBurstLength
g burstType;
typedef enum
fAbsolute, Relative
g latencyKind;
typedef struct
ffloat min latency; float avg latency; float max latency
g latencyValue;
typedef struct
flatencyKind kind; latencyValue value
g latencyType;
typedef struct
fint min bandwidth; int avg bandwidth; int max bandwidth
g bandwidthType;
typedef struct
fint direction , burstType burstiness, latencyType latency,
bandwidthType bandwidth, int quality class,
int reliability
g channel feature;
– Data type channel feature in VHDL:
TYPE burstKind IS (Periodic, Aperiodic);
TYPE burstType I S RECORD
burstKind:kind; MinCycle:integer; MaxBurstLength:integer;
END RECORD;
TYPE latencyKind IS (Absolute, Relative);
TYPE latencyValue I S RECORD
26min latency: real; avg latency: real; max latency: real;
END RECORD;
TYPE latencyType I S RECORD
kind:latencyKind; value: latencyValue;
END RECORD;
TYPE bandwidthType I S RECORD
min bandwidth: integer; avg bandwidth: integer; max bandwidth: integer;
END RECORD;
TYPE channel feature I S RECORD
direction: integer; burstiness:burstType;
latency:latencyType; bandwidth:bandwidthType;
quality class:integer; reliability:integer;
END RECORD;
￿ The process data type
A process is notated as
P
i
j denoting the process
j on resource
i. We deﬁne the data type process in
C and VHDL as follows:
– Data type process in C:
typedef struct
fint rscNumber; int prsNumber
g process;
– Data type process in VHDL:
TYPE process IS RECORD
rscNumber:integer; prsNumber:integer;
END RECORD;
In the following we give expressions of the communicationprimitivesone-by-onein the two languages
C and VHDL:
￿ Open a channel
int channel(source process, destination process, channel feature)
– Function channel deﬁnition in C:
int channel(process source process, process destination process,
channel feature channel type);
– Function channel deﬁnition in VHDL:
FUNCTION channel(source process:process;
destination process:process;
channel type:channel feature) RETURN integer;
￿ Listen to channel
int listen(maxQueueLimit)
– Function listen deﬁnition in C:
int listen(int MaxQueueLimit);
– Function listen deﬁnition in VHDL:
FUNCTION listen (MaxQueueLimit:integer) RETURN integer;
27￿ Accept a channel
int accept(channel)
– Function accept deﬁnition in C:
int accept(int
￿channel);
– Function accept deﬁnition in VHDL:
FUNCTION accept(channel:access) RETURN integer;
￿ Bind a channel
int bind(expected channel, channel)
– Function bind deﬁnition in C:
int bind(int expected channel, int
￿channel);
– Function bind deﬁnition in VHDL:
FUNCTION bind(expected channel:integer; channel:access)
RETURN integer;
￿ Send message
int send(channel, msg, msg size, msg type, msg id, sync ﬂag, timer, out-of-band, request)
– Function send deﬁnition in C:
int send(int channel, void
￿msg, int msg size, int msg type,
int msg id, int sync flag, float timer, int out
￿of
￿band,
int
￿request);
– Function send deﬁnition in VHDL:
FUNCTION send(channel:integer; msg:access;m s gsize:integer;
msg type:integer; msg id:integer; sync flag:bit;
timer:real; out
￿of
￿ band: bit ;
request:access) RETURN integer;
￿ Receive message
int receive(channel,msg, msg size, msg type, msg id, sync ﬂag, timer, request)
– Function receive deﬁnition in C:
int receive(int channel, void
￿msg, int msg size, int msg type,
int msg id, int sync flag, float timer, int
￿request);
– Function receive deﬁnition in VHDL:
FUNCTION receive(channel:integer; msg:access;m s g size:integer;
msg type:integer; msg id:integer; sync flag:bit;
timer:real; request:access) RETURN integer;
￿ Check nonblocking completion
int check(request, status)
– Function check deﬁnition in C:
int check(int
￿request, int status);
28– Function check deﬁnition in VHDL:
FUNCTION check(request:access; status:bit) RETURN integer;
￿ Close a channel
int close(channel)
– Function close deﬁnition in C:
int close(int channel);
– Function channel deﬁnition in VHDL:
FUNCTION close(channel:integer) RETURN integer;
The following are expressions of shared memory primitives in languages C and VHDL.
￿ The memory type data type is deﬁned as:
struct memory type
fread: multiple
j single; write: single;
g
Since always only one writer is permitted to access a memory,we only need to deﬁne the read mode.
– Data type memory type in C:
typedef enum
fMultipleRead, SingleRead
g memoryType;
– Data type memory type in VHDL:
TYPE memoryType IS (MultipleRead, SingleRead);
￿ Memory allocation
int memory(resource, start address, end address, memory type)
– Function memory deﬁnition in C:
int memory( int resource, int start address, int end address, memoryType memory type);
– Function memory deﬁnition in VHDL:
FUNCTION memory(resource:integer; start address:integer;
end address:integer; memeorytype:memoryType)
RETURN integer;
int memory(resource, number, memory type)
– Function memory deﬁnition in C:
int memory( int resource, int number, MemoryType memory type);
– Function memory deﬁnition in VHDL:
FUNCTION memory(resource:integer; number:integer;
memeory type:MemoryType) RETURN integer;
￿ Memory access – Read one or multiple words
int read(memory, number, start address, dataarray, ﬂag)
– Function read deﬁnition in C:
int read(int memory, int number, int start address, void
￿dataarray, int flag);
29– Function read deﬁnition in VHDL:
FUNCTION read(memory:integer; number:integer;
start address:integer; dataarray:access; bit:flag)
RETURN integer;
￿ Memory access – Write one or multiple words
int write(memory, number, start address, dataarray, ﬂag)
– Function write deﬁnition in C:
int write(int memory, int number, int start address, void
￿dataarray, int flag);
– Function write deﬁnition in VHDL:
FUNCTION write(memory:integer; number:integer;
start address:integer; dataarray:access; bit:flag)
RETURN integer;
￿ Memory release
int free(memory)
– Function free deﬁnition in C:
int free(int memory) ;
– Function free deﬁnition in VHDL:
FUNCTION free(memory:integer;) RETURN integer;
4.2 Layered Implementation
Layering is a powerful means to cope with complexity.
4.2.1 A Standard Interface
To enable IP integration onto a NoC backbone, a standard interface is a cheap solution. In ﬁgure 1.2, the
interconnectionis wrapped by NIs which speak the standard protocol. The resources are wrapped by RNIs
speaking also the same protocol. If the resources are pure hardware execution resources, like FPGAs and
ASICs, parts of the hardware resources are RNIs. If the resources are software execution resources, like
DSPs and CPUs, we assume they have a local memory bus structure. In this case, there is a need of a bus
bridge interpreting its speciﬁc bus protocol and the standard protocol. A good candidate for the standard
protocol may be OCP protocol[30] or VCI protocol [31]. The standard protocol acts as an interconnection
protocol. All IPs wrapped by this protocol can be simply plug-and-play on the interconnection platform,
as shown in ﬁgure 4.1. For some applications, there may be more than one standard interface.
/interface Interconnection protocol
IP
IP
IP IP Any interconnection modes
Figure 4.1: A Standard Protocol Enables IP Reuse
304.2.2 Implementation of Primitives in the OSI Layers
For a hardware execution resource, a HW RNI is responsible for implementing the communication prim-
itives. For a software execution resource, a SW RNI, which can also be called communication stubs,
implements the communication primitives. The SW RNI is built on the operating system, if any. Other-
wise, if there is no operating system, the function libraries may contain implementations of the primitives.
The devicedrivermay be neededif the local microprocessorsystem connectsto the NoC backbonethrough
an I/O device. The SW RNI is illustrated in ﬁgure 4.2, where we assume that the device driver is part of
the operating system.
NI
NI
Microprocessor
OS
Bridge
Bus
NI Software RNI 
(Com. stubs)
processes
Application
Primitives
Memory
local bus protocol Interconnection protocol
I/O
Figure 4.2: Software Implementation of Primitives
In ISO’s OSI seven-layer model, a resource implements the higher four layers, i.e. from the applica-
tion layer down to the transport layer, but not the lower three layers, i.e. from the network layer down to
the physical layer. This is in contrast to computer networks where each computer node has a full imple-
mentation of the seven layers. In resources, HW/SW RNIs should implement both the session layer and
the transport layer. For some custom FPGA/ASIC-type hardware resources, the transport layer RNI may
be standardized while the session layer RNI is customized. This separation can make the transport layer
relatively stable. In such cases, only the session layer RNI needs to be customized. The same argument
may be also suitable for some custom SW RNIs. The NoC switches routing packets from source to desti-
nation implement the network layer, the data link layer, and the physical layer. The basic assumption of the
packet-switched network communication is unreliability. The effects of unreliability usually lead to data
loss due to network congestion, data corruption due to interference and coupling etc. at the physical layer,
data delivery out of sequence due to routing packets along different routes, and data duplication due to
extra retransmission. Each layer should have its own ability to provide services to achieve a certain reliable
data delivery, e.g. the data link layer may have error detection and/or error correction. If the network layer
doesn’thavea reliable transmissionscheme, the transportlayeris neededto have a reliable scheme, usually
by labeling message, acknowledgment and retransmission.
Offering network IPC, the proposed primitives deal with the session layer. Applications directly use
the primitives as APIs to program NoC communications. Let us analyze some features at the primitives’
layer from an implementation angle:
￿ It allows abstract data types. The four classes of data types can be directly used.
￿ It does not incorporate interfaces with both the initiator and the target.
￿ The control scheme for starting communicationis request/responsepair which has different implica-
tions at various channel requirements.
￿ Untimed communication behavior. The synchronization scheme for send/receive is either blocking
or nonblocking.
4.2.3 Implementation Layers
As mentioned previously, to guarantee interoperability and compatibility among IP cores, a standard inter-
face is required. Here we adopt the OCP protocolwhich aims to be a hardware interconnectstandard for IP
31cores and interconnect models to facilitate true plug-and-play methodology [30]. In addition, it has been
published a white paper on SystemC-Based SoC CommunicationModeling for the OCP Protocol [32]. The
conceptual framework for modeling communication is based on channel communication. This channel
implements the communication between two modules where a module is an initiator or target or both, as
illustrated in ﬁgure 4.3. It is generic in the following aspects:
Module 2
(Target)
Module 1
(Initiator) Channel
Interface
Port
Figure 4.3: A Communication Channel Connecting An Initiator and A Target
￿ No assumptions on the communication protocol between the two modules are being made. The
channel just implements the communication;the protocol must be implemented in the modules.
￿ The same channel can be used at different abstraction layers.
The white paper deﬁnes a four-layer communication abstraction, which allows reﬁnements down to
RTL level. Each communicationlayer consists of three agents: Initiator, Channeland Target.T h eInitiator
is connected to the Target with the Channel, through an Interface. The interface presents the target and
the initiator with the services the channel offers, thus the channel implementation can be modiﬁed without
the target and the initiator knowing as long as the minimum services required by the initiator and target
are provided. The communication layers support true interface-based design methodology. The interoper-
ability of initiators and targets from different communication abstraction layers can be implemented with
adapter components such as a layer-wrapper or multi-layer capable channels.
Message Layer (L−3)
Transaction Layer (L−2)
Transfer Layer (L−1)
RTL Layer (L−0)
Clock, protocols
Resource sharing, time
Wires, registers
Abstraction removes:
Gates, gate/wire delays
Figure 4.4: The Stack of Communication Layers
The layers are deﬁned as follows [32]:
￿ Layer-3 – Message Layer
Layer-3 systems are untimed. The system executes event-driven. A single message transmission
between initiator and target involves the transfer of several data, which can be of very abstract data
types. This layer provides point-to-point initiator-target connections.
￿ Layer-2 – Transaction Layer
Layer-2 systems are timed, but not cycle-accurate. The system executes event-driven. A single
transaction between initiator and target involves the transfer of several data (i.e. a burst, or a partial
burst of data). Normally they are independent of bus protocols since bus protocols can only be
implemented with cycle-true systems.
￿ Layer-1 – Transfer Layer
Layer-1 systems are characterized by cycle-true behavior. Layer-1 channels provide a fully cycle
and protocol accurate connectivity. Most layer-1 functionality can be achieved in RTL. However the
32beneﬁts of the Layer-1 over RTL are: simpler netlist, only single wire for the whole communica-
tion interface. The netlist does not need to be changed with parameter (communication protocol or
functionality) changes; faster simulation as well as simpler interface code.
￿ Layer-0 – RTL
RTL layer is pin/bit accurate, register transfer accurate. It is written in ﬁnal VHDL/Verilog/Synthe-
sizable SystemC.
Withlayersgoingdown,weareclosertoaﬁnalimplementation. Datatypesareproceededfromabstract
data types, to burst of data, to data until bit vector. Control ﬂow is proceeded from blocking/nonblocking
semantics to clocked operations, time is added from untimed system to timed systems. This is actually
a reﬁnement procedure from abstraction down to implementation. Clearly the SystemC channel and the
communication layers are well-suited for our implementation purposes. We share the same concept of
channelat a high abstractionlevel. Thusthe SystemC channel can be used for our channelimplementation.
Also the communicationlayers allow reﬁnements, thus enable an incremental design principle. In addition
to the deﬁnitions of the communication layers, in [32] there are library functions proposed with respect to
the topthreelayers. Oneimplicationis that we canuse thefunctionstoprograma SystemCimplementation
of the communication primitives at each of the communication layers. Hopefully the implementation can
be synthesized by industrial tools. Because the library functions are not yet stable today, we can’t do this
right now. However, we believe a layered implementation approach for communication reﬁnement will
play an important role in communication-centricdesigns in the near future.
4.2.4 Channel Features and their Required Actions
One of important things regarding implementation is the channel characteristics/features. What are the
implications of the channel features? What should an implementation do in response to a channel fea-
ture requirement? This subsection answers these questions. In the following, we discuss them feature by
feature. When speaking of channel setup/tear down phase, we use terms initiator/source and target/desti-
nation. When talking about data transmission, we use the terms sender and receiver.
1. Direction. It has three options: simplex, full-duplex, and half-duplex. Simplex channel is used
in a non-acknowledged producer-consumer paradigm. Data is always sent from a producer to a
consumer. Duplex channel works with interacting peers, client-server paradigms. The sender and
receiver interact with each other. The third option, half-duplex, is allowed. A sender can send a
control message to a receiver to switch the roles of sending and receiving.
Direction only refers to data transmission phase. During channel setup and close phase, two-way
communication may be needed.
2. Burstiness. It may have a great impact on power savings. If transmission is bursty with period cycle
and burst length, the intermediate nodes and the receiver may switch to idle states or other lower-
powerstatesduringno-transmissionphase. Iftransmissionisrandom,itmaynotbeefﬁcienttodoso.
During the channel establishment phase, the burstiness parameter can be used to direct the receiver
and the channel nodes (if there is a virtual or dedicated channel path) enter into power-aware states.
3. Latency. A pair of time values
f
m
i
n
L
;
m
a
x
L
g is set by an application. It may be required to meet
deadline constraint of real-time applications. Suppose we are using absolute time values. To check
if a channel meets this requirement, the initiator who negotiates with the network needs to calculate
the round trip time. The initiator transmits a couple of single unit of data to the target, and waits for
acknowledgment. Upon receiving acknowledgment, it calculates the round-trip time. The initiator
also starts a timer to specify the maximum waiting time for the acknowledgments. The calculated
latencyvalueissimplyderivedbyhalfoftheaverageoftheseround-triptimevalues. Ifthecalculated
latency value falls into the range of
f
m
i
n
L
;
m
a
x
L
g,the latency requirement is satisﬁed. Otherwise,
it fails. This latency check procedureis illustrated in ﬁgure 4.5. The destination should acknowledge
the latency checking messages as soon as possible. The latency check is carried out during the
establishment phase of a connection-orientedchannel.
33N minL<= realA <=maxL
Y
Receive ackonwledgements
Y
Calculate each round trip time t
Calculate average round−trip time realA
   Start a timer = 2*n*maxL
Set (minL, maxL)
Send n bytes
Start
If all ackonwledgements received or timeout? N
(2) (min, avg, max) latency 
Return: 
(2) (min, avg, max) latency 
(1) latency met (1) latency not met
Return: 
Figure 4.5: Latency Check
4. Bandwidth. Apairofbandwidthvalues
f
m
i
n
B
;
m
a
x
B
gis setbyanapplicationtocheckifachannel
can be established meeting this requirement. After ﬁxing a circuit path, the initiator sends maxB
bytes to the target within one second, and waits for acknowledgments within two seconds, and then
calculate the real allowable bandwidth
r
e
a
l
B according to the number of acknowledgments. Here
we assume that messages are individually acknowledged. If the real bandwidth value
r
e
a
l
B falls in
the range
f
m
i
n
B
;
m
a
x
B
g, that means the channel meets this constraint. If the calculated value is
lower than
m
i
n
B, the initiator will be warned that the channel bandwidth can not be satisﬁed. The
bandwidthcheck procedureis illustrated in ﬁgure4.6. Similarly, the destination shouldacknowledge
the bandwidth checking messages as soon as possible. The bandwidth can only be reserved with a
connection-orientedchannel during channel establishment phase.
Receive ackonwledgements
Set (minB, maxB)
Y
Send maxB bytes in 1s
Start
Calculate the number of acks.
minB<= realB <=maxB N
   Start a timer=2s
Calculate real bandwidth realB
Receive all or timeout? N
Y
Return: Return:
(1) bandwidth satisfied
(2) real bandwidth (2) real bandwidth
(1) bandwidth not met
Figure 4.6: Bandwidth Check
From the above description, we see that both the latency and the bandwidth check is taken after a
34virtual circuit is granted, i.e., (1) Fixing a virtual circuit path by sending the setup message using the
best-effort service, and then waiting for acknowledgment. (2) Check the latency and the bandwidth
alongthis circuit path by the proceduresin ﬁgure 4.5 and ﬁgure 4.6, respectively. This two steps may
be iterated until accepting the results, as shown in ﬁgure 4.7. This negotiation process is conducted
in the session layer.
Accept results  ?
Start
Fix a virtual circuit path
Latency and bandwidth check
(1) Channel setup success or failure
(2) Bandwidth and latency values
Return:
Set latency and bandwidth value
N
Y
Figure 4.7: Negotiation for Latency and Bandwidth During Channel Setup
5. Quality Class (QC). Reﬂecting an application requirement, it is deﬁned for scheduling purpose. It
is used in three aspects. (1) For best-effort datagram. A higher QC datagram will be routed ﬁrst or
buffered ahead when contending for a communication link, compared with a lower QC datagram.
This happens when routing connection-less channel datagrams, system conﬁguration or network
management datagrams, and during establishing connection-oriented channels when channel setup
messages are routed using the best-effort service. (2) For virtual circuit packets. If channels are
overlappedsomewhere (part of virtual circuit to be shared), higher QC packets are switched out ﬁrst.
(3) For local resource sharing. One resource may have established multiple connection-oriented
channels that talk to the transport layer of the resource. Although all these channels’ bandwidth may
be guaranteed, the latency may be satisﬁed with commitment or with relaxed commitment.I n t h e
latter case, the channel does its best to deliver messages. The upper bound for message delivery is
the worst case latency along the virtual circuit path. In a resource with multiple opened connection-
oriented channels, messages from a higher QC channel are scheduled ﬁrst. Clearly the QC is related
to Quality of Service, since the difference between a connection-orientedchannel and a connection-
less channel lies in that the commitment for bandwidth and latency differ. Both the transport layer
and the network layer need to be aware of the class levels. We deﬁne the following four quality
classes basically according to the fulﬁllment of the channel bandwidth and latency requirement.
￿ QC
Q
C
3: Neither bandwidth nor latency asks for guarantee.
￿ QC
Q
C
2: Bandwidth needs to be guaranteed, but latency commitment may be relaxed.
￿ QC
Q
C
1: Both bandwidth and latency need to be guaranteed.
￿ QC
Q
C
0: The highest class is reserved for network management/system reconﬁgurationor ini-
tialization messages. For example,a failure of a switch nodemakes it necessary to dynamically
adapt the routes of the virtual circuits. In such circumstances, the NoC works in a supervising
mode. Some of the virtual circuit services, even granted, may be temporarily interrupted.
6. Reliability. Wehavedeﬁnedfourlevelsofreliabilitywhichputdifferentrequirementsontheinitiator,
thechannelnodesandthe targetduringchannelsetup,data transmissionaswell aschanneltear-down
phase.
35L
1
:
S
r
T
r
L
2
:
S
u
T
r
L
3
:
S
r
T
u
L
4
:
S
u
T
u
Connection con.-oriented con.-oriented connection-less connection-less
dedicated dedicated
Session direction duplex simplex duplex simplex
Msg. yes no yes no
acknowledgment
Msg. labeling no no labeled no
Msg. sequence maintained maintained maintained no promise
Msg. Retransmission no no done by the n o( u pt ot h e
session layer application)
Session sender no copy no copy keep copy no copy
Intermediate nodes be aware be aware not aware not aware
Session receiver ack. no ack. ack. no ack.
Table 4.2: Comparisons on Reliability Levels
￿
L
1
:
S
r
T
r. It is built on the connection-oriented packet switching network or dedicated con-
nection circuit switching network. The source application process sends messages to the ses-
sion layer, and waits for acknowledgments from the destination application process. Since the
lower layer, the transport layer is reliable, the session layer does not need to retransmit mes-
sage. In other words, the transport layer promises to deliver message correctly. If necessary,
the transport layer has to take care of retransmission, thus labeling, assembling and time-out.
The intermediate nodes at the network must be aware of the channel messages delivered as
connection-oriented,since they have to follow the same routing path.
￿
L
2
:
S
u
T
r. The transport layer is reliable. But the destination application process does not send
back acknowledgments.
￿
L
3
:
S
r
T
u. The transport layer is not reliable. The data integrity and sequence are not guar-
anteed. The transport layer does not maintain packet sequence. However, the messages must
be delivered correctly to the destination in order. The session layer has to take care of retrans-
mission. The destination is required to send back acknowledgments. Achieving this level of
reliability results in a costly session layer because the unreliability of the transport layer is
compensated at the session layer.
￿
L
4
:
S
u
T
u. The source application process sends messages to the session layer which in turn
handles over to the transport layer. It is a purely simplex transfer of messages. And the desti-
nation application process will not send back acknowledgments.
Connectionandreliabilityare closely tiedto each other. The fourlevels of reliabilityare basedon the
basic two type of connections: connected-oriented and connection-less. If the interconnect network
is circuit-switched, it provides reliable service. If the interconnect network is packet-switched, it
offers either reliable or unreliable service. We compare the four levels of reliability in table 4.2.
Differentlevelsofreliabilityhavedifferentimplicationsonthetwobasicsendingandreceivingschemes,
i.e. blocking send and nonblocking send, blocking receive and nonblocking receive. We assume that the
session layer (SL) and the transport layer (TL) have their own send and receive buffers. Table 4.3 shows
when the function calls for the blocking and nonblocking send/receive will return. If a session layer buffer
is available, a message will be delivered to the session layer successfully. If a transport layer buffer is
available, a message will be successfully delivered to the transport layer.
It is worth mentioning that communication protocols are agreements to guarantee the required channel
features, not a feature itself. A protocol is a set of control and data structures understood by the communi-
cating entities for synchronizing emission and reception of data and interpreting data. In ISO’s OSI seven-
layer referencemodel, a channel deals with the session layer offeringnetwork interprocesscommunication
services. Implementing a channel with various features requires the support of communication protocols
36L
1
:
S
r
T
r and
L
3
:
S
r
T
u
L
2
:
S
u
T
r and
L
4
:
S
u
T
u
Blocking send SL receives ack. S Ld e l i v e r sm s gt oT L
Nonblocking send Msg is delivered to SL Msg is delivered to SL
Blocking receive SL receives msg. SL receives msg
and sends back ack.
Nonblocking receive SL checks its receive buffer SL checks its receive buffer
Table 4.3: Reliability Levels and Their Implications
at the lower layers. We have mentioned that SystemC channels support hierarchical communication and
communication reﬁnement [5], thus it provides a good reference for NoC channel implementations. For
software implementation, Berkeley sockets interface [9] [11] [14] was designed to provide generic access
to IPC services implemented by whatever protocols on a particular platform. In this sense, the sockets API
is also a good reference for implementing NoC channels in software.
37Chapter 5
Summary and Future Work
Network-on-Chip is receiving more and more attentions in academia, and perhaps in the industry. It is
regarded to be a solution to cope with future complex System-on-Chip challenges. It aims to provide a
network backbone to integrate IP resources via communication interfaces. While a lot of research is going
on regarding the platform itself, how to design applications on NoC, in particular, interprocess communi-
cations, is also a challenge. Due to its heterogeneityand distribution nature, no existing design ﬂow can be
directly applied to the NoC application design.
5.1 Summary
In this report, we have deﬁned NoC Assembler Languagethat targets the NoC application design. It serves
as an interface between applications and NoC implementations. Subsequently we put forward a NoC ap-
plication design and compilation ﬂow, which enables reuses of design languagesand tools that are familiar
to SoC designers. The central part of NoC-AL is communication primitives, which ﬁt into the session
layer in the OSI seven layer model. A NoC-AL program consists of both NoC architecture description and
application description. In application, we separate communication from computation. The methods to
design computational tasks are handled by one of the design languages such as VHDL/Verilog/SystemC.
The methods used for describing communications are communication primitives. To translate NoC-AL
programs into NoC conﬁguration ﬁles, a NoC assembler is required to do source-to-source processing be-
fore standard tools for hardware and software design are used. We advocate the channel communication
for NoC IPC. A channelis naturally an arc in the task graph representingan application. Every channelhas
its own features regarding QoS and performance under design constraints. Moreover, we have proposed
a set of basic primitives for the two basic communication styles, message passing and shared memory.
Furthermore, we have discussed two implementation issues of the NoC communication primitives. One is
language binding. As the proposed primitives are abstract, they must be bound to a target design language.
Second the implementations should be layered which allow incremental reﬁnement to add details step by
step. Furthermore we have discussed some channel feature implications upon its implementations.
5.2 Future Work
Just as NoC research is in its infancy, so does the NoC-AL. There are a lot of future work opportunities.
There will be a huge amount of work on the implementations of the proposed primitives. Although a
complete set of implementationsdoesn’tﬁt well with a research project, a sample implementationfor some
of the primitives in hardware, say VHDL, and in software, say C, will be most beneﬁcial for evaluating the
feasibility of the primitives, and for exploring NoC communication characteristics from high-abstraction
level. Another easier way of evaluating those primitives can be done with a NoC simulator. According to
feedbacks from evaluations, some of the primitives may be further elaborated, and more primitives may
be added. Also one part of the NoC-AL, NoC architecture description has not been addressed in detail.
38Moreover, the development of the NoC assembler can be a long-term goal. Although all the ideas must be
tested in practice before a ﬁnal judgment can be made, we expect that the NoC communication primitives
can be used for future NoC application design.
39Bibliography
[1] Semiconductor Industry Association. The International Technology Roadmap for Semiconductors.
2001.
[2] K. Keutzer et. al. System level design: Orthogonalization of concerns and platform-based design.
IEEE Transactions on Computer-Aided Design of Circuits and Systems, 19(12), December 2000.
[3] S. Kumar et. al. A network on chip architecture and design methodology. In IEEE Computer Society
Annual Symposium on VLSI, 2002.
[4] A.Jantsch. Networksonchip. InProceedingsoftheConferenceRadiovetenskapochKommunication,
Stockholm, June 2002.
[5] T. Grotker et. al. System Design with SystemC. Kluwer Academic Publishers, 2002.
[6] D. D. Gajski et. al. SpecC: Speciﬁcation Language and Methodology. Kluwer Academic Publishers,
2000.
[7] T. Yen and W. Wolf. Communication synthesis for distributed embeded systems. In IEEE Interna-
tional Conference on Computer-Aided Design, 1995.
[8] Richard Lai and Ajin Jirachiefpattana. Communication Protocol Speciﬁcation and Veriﬁcation.
Kluwer Academic Publishers, 1998.
[9] W. Richard Stevens. Unix Network Programming, Volume 1 - Networking APIs: Sockets and XTI,
second edition. Prentice Hall, 1998.
[10] W. Richard Stevens. Unix Network Programming, Volume 2 - Interprocess Communications, second
edition. Prentice Hall, 1999.
[11] Alok K. Sinha. Network Programming in Windows NT. Addison-Wesley Publishing Company, 1996.
[12] Message Passing Interface. http://www-unix.mcs.anl.gov/mpi.
[13] Wayne Wolf. Computers as Components. Morgan Kaufmann, 2001.
[14] Paul E. Renaud. Introductionto Client/Server Systems - A Practical Guide for Systems Professionals.
John Wiley
& Sons, Inc., 1993.
[15] David E. Culler, Annop Gupta, and Jaswinder Pal Singh. Parallel Computer Architecture, A Hard-
ware/Software Approach. Morgan Kaufmann Publishers, Inc, 1999.
[16] DouglasE. Comer. Computer Networks andInternets with Internet Applications,Third edition.P r e n -
tice Hall, 2001.
[17] W. Richard Stevens. TCP/IP Illustrated, Volume 1: The Protocols. Addison Wesley Professional,
1994.
[18] GaryR. WrightandW. RichardStevens. TCP/IP Illustrated,Volume2: TheImplementation. Addison
Wesley Professional, 1995.
40[19] Larry L. Peterson and Bruce S. Davie. Computer Networks: A Systems Approach, Second Edition.
Morgan Kaufmann Publishers, Inc, 2000.
[20] Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess pro-
grams. IEEE Transactions on Computers, 28(9):690–691,September 1979.
[21] M. Dubois, C. Scheurich, and F.A. Briggs. Memory access buffering in multiprocessors. In Proc.
13th Annual International Symposium on Computer Architecture, pages 434–442, Stockholm, June
1986.
[22] G. R. Andrews. Foundations of Multithreaded Parallel and Distributed Programming. Addison
Wesley Longman, Inc., 2000.
[23] Charles L. Seitz. The cosmic cube. Communications of the ACM, 28(1):22–33,1985.
[24] Robert Christian Steinke. Consistency Model Transitions in shared Memory. PhD thesis, University
of Colorado, 2001.
[25] Bilge E. Saglam and Vincent J. Mooney III. System-on-a-chip processor synchronization support
in hardware. In Proceedings of the DATE 2001 on Design, automation and test in Europe, Munich,
Germany, 2001.
[26] Per Stenstr¨ om. A surveyof cachecoherenceschemesfor multiprocessors. IEEEComputer, 23(6):12–
24, June 1990.
[27] David J. Lilja. Cache coherence in large-scale shared-memory multiprocessors: issues and compar-
isons. ACM Computing Surveys (CSUR), 25(3):303–338,1993.
[28] Chuan-Qi Zhu and Pen-Chung Yew. A scheme to enforce data dependence on large multiprocessor
systems. IEEE Transactions on Software Engineering, 13(6):726–739,June 1987.
[29] Martti Forsell. A scalable high performance computing solution for networks on chips. IEEE Micro,
2002.
[30] Open Core Protocol. http://www.ocpip.org.
[31] VSI Alliance. http://www.vsia.org.
[32] T. Haverinen et. al. SystemC based SoC communication modeling for the OCP protocol.
www.ocpip.org,2002.
41