Architecture, On-Chip Network and Programming Interface Concept for Multiprocessor System-on-Chip by Samman, Faizal Arya et al.
Architecture, On-Chip Network and Programming
Interface Concept for Multiprocessor
System-on-Chip
Faizal Arya Samman
University of Hasanuddin at Makassar
Dept. of Electrical Engineering
Email: faizalas@unhas.ac.id
Bjo¨rn Dollak, Jonatan Antoni
TU Darmstadt, Germany
Fachbereich Elektrotechnik und
Informationstechnik (Students)
Thomas Hollstein
University of Applied Sciences
Frankfurt, Germany
Email:hollstein@fb2.fra-uas.de
Abstract—This paper presents a system architecture, data
communnication scheme and application programming interface
model or concept for a multiprocessor system based on a
network-on-chip (NoC) platform. Each processing node con-
nected to a mesh node has its own local (instruction and
data) memory portion, and a global (shared) memory portion.
The introduced communication scheme gives only a mimimum
overhead in order to offer direct memory-to-memory data
transfer. Each processor can make direct message delivery to
another processor (producer initiated), or make a request to copy
memory blocks from a remote processor (consumer initiated).
The complete data transmission is handled by the network
interface and a special memory controller. The network interface
managed by the specialized memory controller can directly
access the shared memory portion. Thus the processing node can
continue its normal operation and will be not blocked during the
data transfer process.
Keywords—Network-on-Chip, Many Core Processors, Appli-
cation Programming Interface, Network Interface
I. INTRODUCTION
Parallel multiprocessor systems with multiple cores are a state
of the art of next computer generations. Further exploitation of the
full performance of multiprocessor systems brings the challenge to
overcome the common bottleneck, the shared memory in a bus-based
platform. In a bus-based multiprocessor system only one processor
can use the bus to read or write data from or to the memory at a
time. In the meantime the other processors have to wait until they
can perform their memory access. This idle waiting time wastes
processing power of the system, thus the performance can not be fully
exploited. The scaling issue can be encountered with the Network-
on-Chip (NoC) paradigm [3].
Distributed Shared Memory (DSM) has been an interesting issue
in all kinds of multiprocessor systems in the recent years. Memory
access topologies and memory bandwidth are crucial points to
gain the overall targeted system performance. In [6] a performance
evaluation for the Cray X1 DSM architecture is presented. In X1
multistreaming processors (MSPs), memory access is performed via
a cache, which is shared by four single stream processors (SSPs).
Four MSPs share 16 memory banks, having 16 individual memory
controllers. This allows local memory access in parallel to global data
communication, accessing some of the 16 memory banks. Principals
of DSM architecturs have already been presented in [9], where
structure, granularity and coherence issues are described. [5] gives a
clear description and evaluation of producer-consumer mechanisms
in shared memory multiprocessors. Comparing producer-initiated and
consumer-initiated data communication schemes, producer-initiated
mechanisms (as data forwarding and user-level message delivery)
provide the highest efficiency, being comparatively insensitive to
network parameters (latency, bandwidth) [5]. In [2] a dynamic ap-
proach for balancing memory access and avoiding access contention
is presented, which applies memory page migration in consumer-
initated DSM systems. Interesting DSM reference architectures are
the MIT Alewife Machine architecture [1] and the Stanford DASH
(Directory Architecture for Shared Memory) multiprocessor [7].
Bhuyan et al. [4] present a multistage bus-based architecture for the
realisation of a DSM system. In [8] a crossbar NoC architecture
as a platform for a shared-memory architecture has been presented,
where several processing elements, several shared memory units and
a main memory controller are connected to a central crossbar. This
approach also follows the NUMA paradigm.
II. CONTRIBUTION
In this paper, we present an efficient memory architecture, which
is implemented based on scalable mesh-based NoC architecture
(XHiNoC). Conceptionally, the system architecture presented in this
paper is a distributed memory multiprocessor system supporting a
parallel programming model with functional-task-level parallelism.
The presented approach is based on the following main goals
• Slim architecture with reduced administration effort.
• Support of different DSM data exchange paradigms (producer-
initiated message delivery and consumer-initiated programming
models).
• Enhanced benefit from NoC multicast capability for advanta-
geous producer-iniated data communication.
• Low requirements to application programming interfaces
(APIs), which allows to integrate processors, but also dedicated
hardware components with low wrapping effort.
• Applicability to heterogenous NoC-based multiprocessor sys-
tems.
This paper presents also an efficient approach to develop
functional-task-based programming by using instruction library (ap-
plication programming interface, API) which have been developed
to program the MIPS-based multiprocessor systems. Some existing
concepts of the programming models for multiprocessor systems
have been presented in [6], [9], [5], [2], [1], [7] and [4], which are
mainly not dedicated for on-chip multiprocessor systems. The work
in [8] presents the commonly used shared memory programming
model for on-chip multiprocessor. However, the work in [8] cannot
support producer-initiated message delivery programming mode and
has not presented in detail so far how to create a simple computer
978-1-5090-2690-6/16/$31.00 c© 2016 IEEE
