Simulation-based Modeling Frameworks for Networked Multi-processor System-on-Chip by Mahadevan, Shankar
Simulation-based Modeling Frameworks for
Networked Multi-processor System-on-Chip
Shankar Mahadevan
Kongens Lyngby 2006
IMM-PHD-2006-157
Technical University of Denmark
Informatics and Mathematical Modelling
Building 321, DK-2800 Kongens Lyngby, Denmark
Phone +45 45253351, Fax +45 45882673
reception@imm.dtu.dk
www.imm.dtu.dk
IMM-PHD: ISSN 0909-3192
Abstract
This thesis deals with modeling aspects of multi-processor system-on-chip (Mp-
SoC) design affected by the on-chip interconnect, also called the Network-on-
Chip (NoC), at various levels of abstraction. To begin with, we undertook a
comprehensive survey of research and design practices of networked MpSoC.
The survey presents the challenges of modeling and performance analysis of the
hardware and the software components used in such devices. These challenges
are further exasperated in a mixed abstraction workspace, which is typical of
complex MpSoC design environment.
We provide two simulation-based frameworks: namely ARTS and RIPE, that
allows to model hardware (computation time, power consumption, network la-
tency, caching effect, etc.) and software (application partition and mapping,
operating system scheduling, interrupt handling, etc.) aspects from system-level
to cycle-true abstraction. Thereby, we can realistically model the application
executing on the architecture. This includes e.g. accurate modeling of syn-
chronization, cache refills, context switching effects, so on, which are critically
dependent on the architecture and the performance of the NoC. The foundation
of the ARTS model is abstract tasks, while the foundation of the RIPE model
is cycle-count. For ARTS, using different case-studies with over one hundred
tasks (five applications) from the mobile multimedia domain, we show the po-
tential of the framework under real-time constraints. For RIPE, first using six
applications we derive the requirements to model the application and the archi-
tecture properties independent of the NoC, and then use these applications to
successfully validate the approach against a reference cycle-true system.
The presence of a standard socket at the intellectual property (IP) and the NoC
interface in both the ARTS and the RIPE frameworks allows easy incorporation
of IP cores from either frameworks, into a new instance of the design. This
could pave the way for seamless design evaluation from system-level to cycle-
true abstraction in future component-based MpSoC design practice.
ii
Preface
This thesis was prepared at the institute of Informatics Mathematical Mod-
elling, in partial fulfillment of the requirements for acquiring the Ph.D. degree
in Computer Science and Engineering department at the Technical University
of Denmark. The Ph.D. was supervised by Associate Professor Jens Sparsø and
Professor Jan Madsen.
The thesis stems out of the “On-Chip Interconnect Networks” project started
in September 2002. The original Ph.D. study plan proposed an evaluation of
reconfigurable networks for multi-processor systems-on-chip (MPSoC) with fo-
cus on low-power solutions. During the course of the study, it was found that
understanding the application and the architectural properties of the MPSoC
was the first crucial step towards this goal. The investigation of these proper-
ties was found to be a challenge in its own right. In this thesis, the solutions
pursued to meet these challenges are presented for perusal towards the Ph.D.
degree requirements. The outcome of this thesis are the ARTS and the RIPE
frameworks, which can now allow a realistic investigation of the goals stated in
the original study plan.
The thesis consists of a collection of seven research papers written during the
period 2003–2005, and published elsewhere.
Lyngby, March 2006
Shankar Mahadevan
iv
Manuscript Collection
The following list of manuscripts contribute directly to the body of this thesis.
#1: Tobais Bjerregaard, and Shankar Mahadevan. “A Survey of Research
and Practices of Network-on-Chip.” To appear in the Journal of ACM
Computing Surveys. ACM, 2006.
#2: Jan Madsen, Shankar Mahadevan, Kashif Virk and Mercury Gonza-
lez. “Network-on-Chip Modeling for System-Level Multiprocessor Simula-
tion.” In Proceedings of the 24th Real-Time Systems Symposium (RTSS),
Cancun Mexico. IEEE, Dec. 2003: 265-274.
#3: Jan Madsen, Shankar Mahadevan, and Kashif Virk. “Network-Centric
System-Level Model for Multiprocessor System-on-Chip Simulation.”
Interconnect-Centric Design for Advanced SoC and NoC. Eds. Nurmi
J., Tenhunen H., Isoaho J., and Jantsch A. Dordrecht, The Netherlands.
Kluwer Publications, 2004: 341-365.
#4: Shankar Mahadevan, Michael Storgaard, Jan Madsen, and Kashif Virk.
“ARTS: A System-Level Framework for Modeling MPSoC Components
and Analysis of their Causality” Modeling, Analysis and Simulation of
Computer and Telecommunication Systems (MASCOTS), Atlanta USA.
IEEE, Sept. 2005: 480-483.
#5: Shankar Mahadevan, Federico Angiolini, Michael Storgaard, Rasmus G.
Olsen, Jens Sparsø and Jan Madsen. “A Network Traffic Generator Model
for Fast Network-on-Chip Simulation.” In Proceedings of Design, Automa-
tion and Testing in Europe Conference (DATE), Munich Germany. IEEE,
Mar. 2005: 780-785.
vi
#6: Federico Angiolini, Shankar Mahadevan, Jan Madsen, Luca Benini and
Jens Sparsø. “Realistically Rendering SoC Traffic Patterns with Interrupt
Awareness.” IFIP Very Large Scale Integration Systems and their Designs
Conference (VLSI-SoC), Perth Australia. IEEE, Oct. 2005: 211-216.
#7: Shankar Mahadevan, Federico Angiolini, Jens Sparsø, Luca Benini and Jan
Madsen. “A Reactive IP Emulator for Multi-Processor System-on-Chip
Exploration.” Submitted for Journal Publication.
The following maniscripts where also published during the course of this PhD,
but are not part of this thesis.
• Tobias Bjerregaard, Shankar Mahadevan, and Jens Sparsø. ”A Channel
Library for Asynchronous Circuit Design Supporting Mixed-Mode Mod-
eling.” In Proceedings of the 14th International Workshop on Power and
Timing Modeling, Optimization and Simulation (PATMOS), Isle of San-
torini Greece. Springer Publications, 2004: 301-310.
• Tobias Bjerregaard, Shankar Mahadevan, Rasmus G. Olsen, and Jens
Sparsø. “An OCP Compliant Network Adapter for GALS-based SoC De-
sign Using the MANGO Network-on-Chip.” Proceedings of the Interna-
tional Symposium on System-on-Chip (ISSoC), Tempere Finland. IEEE
2005: 171-174.
Acknowledgements
It was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness. . . .
- Charles Dickens, A Tale of Two Cities. London 1859.
In the journey towards my Ph.D. degree, culminating in this thesis, many peo-
ple have shared their wisdom and warned me about pitfalls. My fellow Ph.D.
student and friend, Tobias Bjerregaard for many intense and fruitful discussion.
Thanks Tobias for introducing me to the electronic music scene in Copenhagen.
This thesis would not have been possible without expert guidance and navi-
gation by my supervisors, Associate Professor Jens Sparsø and Professor Jan
Madsen. I am grateful to them for allowing me to follows the path charted
in this thesis. Thanks also goes to Kashif Virk for his patience in answering
my many questions. In Bologna, I am very grateful for the academic stimulus
and the camaraderie of Federico Angiolini and the rest of the gang. Thanks
Federico for introducing me to the best-of-the-best pizza and pasta places in
Bologna. Thank you Prof. Luca Benini for many discussions, but mostly for
allowing me to come to Italy and escape the danish weather. Two Masters’
students, Michael Storgaard and Rasmus Olsen, who partook in the implemen-
tation activities. Thanks Michael for introducing me to <deque> in C/C++.
Maria Jensen for keeping track of my Ph.D. accounts and patience. Per Friis for
twice rescuing my hard disk. For funding my research, I am grateful to Nokia
Denmark, SoC-MobiNET, Thomas B. Thrige Foundation and ARTIST.
Last but not the least, my parents and brother for their love - despite seeing me
only for a few weeks in the past three years!
Shankar Mahadevan
Lyngby, March 2006.
viii
Contents
Abstract i
Preface iii
Manuscript Collection v
Acknowledgements vii
I Preamble 1
1 Introduction 3
1.1 Gist of the Published Work . . . . . . . . . . . . . . . . . . . . . 5
1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Concluding Remarks 13
x CONTENTS
2.1 Contribution of this thesis . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Suggested Future Direction . . . . . . . . . . . . . . . . . . . . . 14
2.3 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . 16
II Body 17
3 Overview of Networked MPSoC 19
4 The ARTS Modeling Environment 75
4.1 Network-Centric System-Level Model for Multiprocessor System-
on-Chip Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 ARTS: A System-Level Framework for Modeling MPSoC Com-
ponents and Analysis of their Causality . . . . . . . . . . . . . . 100
5 The RIPE Modeling Environment 105
III Appendix 129
6 Network-on-Chip Modeling for System-Level Multiprocessor
Simulation 131
7 A Network Traffic Generator Model for Fast Network-on-Chip
Simulation 143
8 Realistically Rendering SoC Traffic Patterns with Interrupt
Awareness 151
Part I
Preamble
Chapter 1
Introduction
Integrated circuit (IC) design is driven by the target application domain, the
architectural choices and the performance trade-offs. Generally, the applications
dictates the architecture and the performance requirements. The architecture is
the composition of hardware and software, while performance is speed, power,
mobility, etc. The flow from specification to a deployable IC is influenced by
the availability and ease of integration of the hardware and the software compo-
nents. Investigating the performance of the IC, deviced by integration of these
components can be a challenge due to many factors. First, the components have
to be designed with a level of accuracy to give confidence in the eventual result.
Second, due to correlations between the behaviors of the components, it is dif-
ficult to postulate how the optimization performed during design of individual
components percolates to the entire IC.
The detail to which extent the IC components are modeled and simulated has
direct impact on the accuracy and the time for understanding its performance.
The closer the design description is to the eventual IC, the higher is the con-
fidence in its performance. For example, a post-layout simulation accounts for
all variables, i.e. wire and gate delays, suggesting a high degree of accuracy of
the design. However, a large investment in man-hours is required for modeling
and simulation at this level of detail. Given the shrinking time-to-market con-
straints, this investment would not be possible for many of the complex designs
of the future.
4 Introduction
A typical approach to IC design starts by taking an existing design methodol-
ogy and apply it to the application and architecture in question. As is observed
in [13], while this approach may indeed work for traditional “well-behaved” ap-
plications and architectures, the attempt is more likely to fail for more complex
applications and architectures that can be expected in the forthcoming years.
This is because of the increase in transistor density and the growing gap in
using them productively in a timely fashion. This has given rise to a new IC
design paradigm namely: networked multi-processor system-on-chip (MpSoC).
We explain this new terminology as follows:
networked: This refers to the interconnect fabric used to bind the architectural
components. As has been motivated in [2, 6], the future of IC design will be
limited not by computation, but by communication. Hence a multi-hop,
concurrent and distributed interconnect model, the so called Network-
on-Chip (NoC) has emerged as candidate solution. We comprehensively
address the issues related to NoC in Chapter 3.
multi-processor: This refers to the class of components termed intellectual
property (IP); such as the computation and the memory units, that com-
prise the architecture. It includes the hardware (ASIC, FPGA, ASIPs,
general purpose processors (GPP)) and the software layers (operating sys-
tem and application) stacked on top of the hardware (where applicable).
Over the last two decades, it is not as much their design, but the way
these components are modeled and used that has changed. The empha-
sis is on re-use; wherein, the interface of these component are now well
defined sockets [18]. Further, traditionally they were generally available
only as RTL entities, while now they are described in a range of abstrac-
tions from un-timed functional to transaction to cycle-true and including
RTL. Thereby, expanding their availability for performance evaluation at
different stages of the design.
system-on-chip: This refers to the deployment of entire systems on a single
chip in a predictable and timely fashion. Generally, it can be viewed as
concurrent activity on two axis: horizontal, where hardware component
are assembled (processors, ASIC, etc connected via the interconnect) and
vertical, where the software components are compiled (application soft-
ware, device drivers and operating system (OS)).
The basic premisses of the networked MpSoC design paradigm is component-
based design practice with emphasis on the separation of computation and com-
munication concerns. This premisses, has created a gap between the existing
design and modeling framework which emphasis top-down step-wise design re-
finement, and the required frameworks that can undertake a mixed abstraction
1.1 Gist of the Published Work 5
design exploration. The goal of the new frameworks must be to provide model-
ing primitives that can realistically capture the application behaviour and the
architectural properties including the assessment of the impact of interconnect
performance. For example, in a networked MpSoC, context switching and cache
refills will be critically affected by the network latency, and thus impact the
processor’s ability to execute the application.
In this thesis, we identify the MpSoC properties affected by the interconnect,
and suggest ways to model them at various levels of abstraction. To assess the
impact of different applications and architectural changes on the performance
of an instance of a networked MpSoC design, we provide two simulation-based
modeling environments: ARTS (at system-level), and RIPE (closer to cycle-true
abstraction). As will be detailed in the body of the thesis, the foundation of the
ARTS framework are abstract tasks, while the foundation of the RIPE frame-
work is cycle-count. In both cases, the execution of the application is abstracted
away into “time-slices”, albeit at different granularity i.e. at functional-block
level in ARTS and at instruction level in RIPE. Using experiments and by val-
idation with other reference systems, we show the potential of our modeling
environments to handle many classes of applications seen in real-life. These
applications are from different domains, showing real-time constraints require-
ments, employing different synchronization schemes, and containing multiple
threads susceptible to interrupts and OS-dependent context switching. The in-
vestigation of such a broad class of application could produce general guidelines
and recommendations to address many issues in the design of MpSoC systems.
The thesis is organized as a collection of published or submitted manuscripts. In
the reminder of this chapter, we attempt to identify a common theme through
these manuscripts. To do this, we first provide the gist of the concepts and
techniques detailed in the manuscripts. This is followed by a discussion on the
scope of this body of work, where we also fill some gaps in the evolution of the
work. Finally we present an outline of the thesis and some notes for the reader
to keep in mind during the reading of the remainder of the thesis.
1.1 Gist of the Published Work
In this section, we present the gist of the published papers that is part of this
thesis. In this process we also categorize the work. Broadly, the papers can be
collected into three groups (seven papers) as follows:
I. A Survey of Networked MpSoC
6 Introduction
#1: A Survey of Research and Practices of Network-on-Chip
(Accepted Journal Publication)
This work highlights many of the challenges in designing and modeling
networked MpSoC. Specifically for this thesis, the motivation and refer-
ence to a large amount of related work can be found in this paper. Overall,
NoC can be application-specific or a generic interconnect which can ac-
commodate several applications. Generally, one can avoid over-design of
the NoC architecture by studying the traffic requirements for a given prob-
lem. The traffic types (latency critical, individual or burst transactions)
generated by the system can vary greatly depending on the application
characteristics and architectural choices. Primarily one can conclude that
these traffic types are the property of the hardware and the software layers
stacked on top of the IP core.
II. The ARTS Modeling Environment
#2: Network-on-Chip Modeling for System-Level Multiproces-
sor Simulation (Conference Publication)
#3: Network-Centric System-Level Model for MpSoC Simula-
tion (Book Chapter)
#4: ARTS: A System-Level Framework for Modeling MpSoC
Components and Analysis of their Causality (Conference Pub-
lication)
This work highlights the requirements to model the application and the
architecture at the system-level while giving a central role to the effects
of the NoC. Overall, the ARTS framework described here is designed to
meet the need for early exploration and understanding of architectural
choices and application mapping in MpSoC designs. It is unlike some of
the previous work at system-level exploration, wherein the frameworks
are limited to exploration of causality between few classes of processors,
memory or interconnect. The ARTS framework is not developed with any
specific problem in mind, but is modularized and extendable in terms of
modeling the different hardware and software layers observed in MpSoC
systems. Further, it allows mixed (in terms of abstraction) instantiation
for complex problems. From this thesis perspective, the modeling of the
NoC in a detailed system-level framework as ARTS, allows us to assess
the impacts of OS dynamics, selection of the hardware components, and
mapping of the software tasks, on the system performance early in the
design phase. A case-study with applications (MP3 decoder, GSM en-
coder/decoder, MPEG encoder/decoder) from the real-time multimedia
application domain consisting of 114 tasks on a 6-processor platform for
a hand-held terminal shows the co-exploration capabilities of ARTS. The
1.2 Discussion 7
case study highlights the impact of changing the underlying processing
element (between ASIC, FPGA and general purpose processor), commu-
nication fabric (bus, mesh and torus) and OS scheduling policy on the
processor utilization, the communication contention and the memory us-
age.
III. The RIPE Modeling Environment
#5: A Network Traffic Generator Model for Fast Network-on-
Chip Simulation (Conference Publication)
#6: Realistically Rendering SoC Traffic Patterns with Interrupt
Awareness (Conference Publication)
#7: A Reactive IP Emulator for Multiprocessor System-on-Chip
Exploration (Submitted for Journal Publication)
This work highlights the requirements to model the application and the
architecture in an environment closer to cycle-true abstraction. The reac-
tive IP emulator (RIPE) described here can model computation behavior
independent of the NoC properties, yet be reactive to changes in NoC ar-
chitecture. Thereby, it effectively decoupled the simulation of the IP cores
from the NoC. Originally deviced to merely mimic processor’s behavior for
NoC exploration, the reactiveness properties identified for emulation has
opened opportunities for alternate uses and are explored in a case study
documented in the above papers. The hardware and software properties
captured in this framework are derived from execution of complex real-
life application templates showcasing semaphore-based synchronization,
OS scheduling based on time-slicing (multi-tasking), pipeline multimedia
data processing, and I/O operations. Further we have validated the ap-
proach with a reference cycle-true framework and have determined that
great accuracy (over 99%) and notable speedup can be achieved with our
RIPE framework.
As will be outlined later, this grouping of the papers not only serves the purpose
of categorizing the work covered in this thesis, but also as chapters of this thesis.
1.2 Discussion
The categorization of the work presented above, may at first glance appear to
have a seemingly diverse focus. Therefore, in this section we attempt to identify
a common theme across the work.
8 Introduction
Abstractions Foundation Framework Papers
System-level View Tasks Paper #2
ARTS Paper #3
Programmer’s View Memory Map Paper #4
with/without timing Paper #5
Cycle Accurate Clock Cycles RIPE Paper #6
Paper #7
Table 1.1: Abstractions of the Networked MpSoC Addressed in the Thesis.
1.2.1 Modeling Scope
The MpSoC design-related problems can be explored either in the analytical or
the simulation domain. The scope, i.e., the problem representation and analysis
style, of the ARTS and the RIPE modeling framework, falls into the simulation
domain. Analytical approaches to solving MpSoC problems also exists and are
well documented in [22, 21, 12, 16, 24]. However, as is also observed in [28], the
performance of complex systems such as NoC is not easily expressed analytically.
The simulation-based approach on the other hand addresses only the average-
case behaviour. We have developed the ARTS and the RIPE framework with the
view that one can easily formulate the problem and compare the results across
different platforms and implementations. The frameworks are not developed to
address any specific design problem, but to provide a necessary set of primitives
to model all the required hardware and software components to instantiate the
given design problem and evaluate it effectively in different abstractions. In
order to take advantage of analytical approaches such as guarantees on best-
case and/or worst-case behaviour, we propose a hybrid simulation/analytical
approach as is done in [14] and [3]. Here, a limited part of the system (shared
resource constraints in [14] and performance analysis in [3]) is described formally
within a larger simulation-based setup. Such a design exploration approach can
also be accomplished in our frameworks.
1.2.2 Modeling Abstractions
The MpSoC design-related problems can also be analyzed at many abstraction
levels, with varying detail of the MpSoC layers (i.e. application, operating sys-
tem and hardware). Table 1.1, adapted from [5], shows a subdivisioning of
various abstractions employed during the MpSoC design. These can be used
system-wide, meaning any component be it the NoC or the IP cores can be de-
scribed at any level of abstraction and then be integrated with other components
1.2 Discussion 9
RIPE
NI
IP core
NI
IP core
ARTS ARTS
NoC 
NI NI
NI NI NI NI
OCP
Interface
IP Emulator
Cycle
Accurate
IP
Simulator
Figure 1.1: System-wide Abstraction for Modeling MpSoC Components.
via suitable interfaces for performance analysis. Such a system using compo-
nents from ARTS, RIPE and cycle accurate (CA) framework is illustrated in
Figure 1.1. Here, the components use standard sockets at the network interface
(NI), which in the case of our frameworks is compatible with open core protocol
(OCP) [17].
In the system-level view (SV), instead of the actual functionality, the execu-
tion time of the task representing the functionality is used to model the applica-
tion’s behavior. In this case, the interdependencies between the tasks translates
into communication carried over the NoC. Taskgraphs are a well-known way to
represent and structure such coarse-grain application behavior at this abstrac-
tion. To associate architecture properties into the application behavior, the
task properties (execution time, memory requirement, power consumption, etc)
are characterized on various IP cores. However, the impact of cache behavior,
consequences of data dependencies, contention over shared resources, and so on,
are difficult to predict at this abstraction, and hence, a degree of tolerance is in-
troduced while assessing these properties. This observation leads to a spectrum
of behaviors from best-case to worst-case scenarios.
Keeping this mind, a range of frameworks have been proposed in the litera-
ture [9, 1, 10, 13, 27]. They investigate the impact of OS scheduling, and limita-
tions posed by the processor and the interconnect architectures such as memory
and bandwidth, for a given application domain. Our ARTS model is inspired
by the desire to undertake similar investigation. However, as is distinguished
in the papers, we also attempt to modularize the framework to include a range
of IP cores e.g. ASIC, GPP and FPGA, and a range of OS scheduling policies
such as earliest-deadline-first (EDF) and rate-monotonic (RM), with support for
preemption. Via the framework’s comprehensive support for both hardware and
software layers, i.e. application, OS and the platform architecture, the design-
ers can investigate problems both in the general and the real-time application
domains.
10 Introduction
To do this investigation, the ARTS framework utilizes three basic blocks: the
allocator, the scheduler and the synchronizer. The allocator controls the owner-
ship of resources: be it execution engine of the processor, or the routers/links of
the NoC. The scheduler controls the order in which the task execute on the re-
source: be it application task on the PE or communication tasks in the NoC, and
the synchronization controls the interdependencies: be it precedence constraint
in application tasks or priortization of communication tasks. The ARTS model-
ing primitive is based on the principle of composition outlined in [26]. As a way
of preserving composition, the above described blocks handle its relevant data
independently of the other. The communication between the application task
and the RTOS blocks is handled by message exchanges. This way the MpSoC
designer can easily combine alternate allocation, scheduling and synchronization
policies without cumbersome recoding of the entire RTOS or compiling of the
framework. This is the motivation for selecting composition based modeling.
Additionally, we have found common characteristics to model both a diverse
range of IP and interconnect behaviors using these three blocks.
The potential of the ARTS modeling framework has been demonstrated via
case studies of a mobile multimedia terminal where the advantages of introduc-
ing NoC has to be traded-off against performance parameters such as memory,
power and program completion time. In some cases even correct operation of
the system cannot be guaranteed. For example, we show (in Paper #3) that
even a small MpSoC system with three processors connected via a torus NoC
(using wormhole routing protocol) could potentially cause system-deadlock due
to OS preemption of the communicating tasks.
In the programmer’s view (PV) of system design, parts of the architecture
is exposed to the application, thereby introducing a degree of accuracy in the
modeling and performance evaluation. As is discussed in [5], in the untimed
PV the absolute behavior is not guaranteed, but the degree of accuracy can
be postulated based on the description of the IP model such as pessimistic,
optimistic, random, typical or a combination of models in these circumstances.
Communication is point-to-point and based on a common, highly efficient trans-
port mechanism. In the timed PV the request and response are completed in a
single transaction and time is indicated as ‘time-passed’ rather that event-per-
clock-tick. This view is analogous to a range of models also described under
transaction-level models (TLM) [4, 7, 11, 19, 20, 23, 25].
By sacrificing simulation speed, the models at this level extract additional accu-
racy for performance evaluation. The goal of such analysis is the same as for SV
i.e. investigating and extracting as much performance as possible out of given
processor and interconnect for a given application. Parts of both the ARTS and
RIPE frameworks straddle this level of abstraction. In the ARTS, the commu-
nication interdependencies are triggered by writing to specific address in the IP
1.2 Discussion 11
cores. In the RIPE, except for a few special purpose registers, the complete
program, data and register files are addressable. Overall, in either frameworks,
the presence of OCP [17] inherently allows to access the public memory of the
IP cores.
The RIPE framework was originally devised to optimize the interconnect perfor-
mance at the cycle-true abstraction. To do this it has to be reactive to the NoC
architectural changes. For example, network latency could have different out-
comes on the system performance in cases where synchronization occurs over
the interconnect and OS-dependent context-switching is involved. The RIPE
can be programmed to account for the impact of communication latency on the
application execution. Via a simple non-pipelined instruction set architecture,
implementing basic flow-control instructions, it can be configured to initiate a
range of communication transactions (single read/write, burst read/write, inter-
rupts) separated by idle waits. Thereby, it can mimic the externally observable
behaviour of an IP core executing an application for the rest of the MpSoC.
By introducing a programmable paradigm, the RIPE can be used in association
with manually written programs to generate traffic patterns typical of IPs still
in the design phase, helping in the tuning of the communication performance
or understanding the causality relationship with other IPs in the MpSoC. This
choice allows us to describe reactiveness characteristics of a wide range of IP
cores at different levels of abstraction. Additionally, this choice allows future
deployment as a hardware device in test chips containing interconnect proto-
types. Through case studies based on real-life applications, such as multimedia
data processing, input/output operation, and OS-aware multi-tasking, we have
demonstrated that the RIPE can handle and emulate a wide class of application
behaviours independent of interconnect aspects.
In the cycle accurate (CA) view of the system design, nearly all aspects of
the architecture are described. The pipelined behavior, the address and data
encoding/decoding and every other atomic (non-interruptible) action sequence
can be tracked at every clock cycle at this abstraction. The work presented in [8,
15] models this abstraction. Such models provide a high degree of accuracy for
investigating both the interconnect and the processor performance. This affords
us the mechanism to validate the proposed frameworks (ARTS and RIPE). As is
outlined in papers in Group III, this thesis covers the work done to validate the
RIPE against the MPARM proposed in [8]. The validation of ARTS framework
is left as future work.
From the above discussion, we can visualize a common theme, stretching from
work related to ARTS to work related to RIPE. The commonalty between the
two frameworks is that, their respective modeling primitives attempt to capture
the interaction among the same three entities i.e. the application, the OS and
the architecture. The difference is that they do so at different abstractions. As
12 Introduction
eluded to before, the presence of OCP at the interface of both the ARTS and
RIPE allows easy mixing of modules from one framework with other (Figure 1.1).
This would allow mixed abstraction design exploration. Though not addressed
in this thesis, a comprehensive framework that can operate at any mode of
abstraction is foreseeable. Instantiation of mixed-abstraction design is already
possible using the components from the ARTS and the RIPE frameworks, which
are the focus of this thesis.
1.3 Outline of the Thesis
The thesis is organized in three parts: Preamble, Body, and Appendix. The
current chapter (Chapter 1) and the following chapter acts as a preamble for
the rest of the thesis. As has been demonstrated in this chapter, the preamble
part sets the scene and draw a common theme for the main body of the thesis
which is a composition of various peer-reviewed published papers. Chapter 2
summarizes the contribution of the paper, and presents concluding remarks and
hints at future direction.
The body of the thesis has three chapters. In Chapter 3, we present the paper
(Paper #1) that provides an overview of issues relating to the NoC aspects and
its impact on MpSoC design and performance. This is followed by two papers
(Paper #3 and #4), which comprise Chapter 4 and detail the work related to
the ARTS framework. In Chapter 5 via the Paper #7, we detail the work related
to the RIPE framework.
Note that we have selectively combined the papers listed in Section 1.1. Papers
#2, #5 and #6 are not part of the main body of the thesis but can be found
in the Appendix part. The reason is as follows. Paper #2 is limited version of
Paper #3, while Paper #5 and #6 are precursors to the Paper #7. Papers #2,
#5 and #6 can be found in Appendix 6, 7 and 8 respectively. This is to ensure
a consistent reading of the thesis, and to avoid revisiting similar concepts spread
across different papers.
The various papers comprising the main body of the thesis have been published
over different stages of the development of the frameworks. Consequently, a
note on the nomenclature is suitable. With regards to the ARTS framework, in
Paper #2 and #3, it is referred to as ‘abstract system-level model’ or ‘system-
level RTOS modeling framework’. With regards to the RIPE framework, in
Paper #5 and #6, it is referred to as simply ‘traffic generator’ or ‘reactive
traffic generator’. The nomenclatures reflects the state of the framework at the
time of publication.
Chapter 2
Concluding Remarks
2.1 Contribution of this thesis
Here, we outline the specific ideas, concepts and techniques that have been
contributed by the author of this thesis. We refer to abstractions outlined in
Table 1.1 (in Section 1.2.2) to structure the research work.
i. A structured overview of the networked MpSoC research has been pre-
sented. There are many challenges and opportunities identified in this
overview, ranging from the design of individual NoC components, such
as routers and links, to higher-level architectural concerns. An outline of
modeling and design issues related to NoC in the wider MpSoC is also
presented.
ii. At the system-level, the identification of modeling primitives to capture
the causality between the hardware and the software components, when
taking the behaviour of the NoC into account, has been the main contribu-
tion. The motivation here is to understand the cross-layer dependencies
of the architecture, the OS, the device drivers and the application lay-
ers. The causality is understood by modeling and implementing the NoC
topology and protocol aspects through the basic blocks of the ARTS model
namely: the allocator, the synchronizer and the schedular. Requirements
14 Concluding Remarks
and implementation of modeling primitives capturing memory dynamics
for abstract task execution and communication was also undertaken. We
have successfully modeled bus, mesh and torus architectures and then per-
formed a co-exploration to demonstrate the impact of these architectures
on the system performance under real-time constraints. The trade-off met-
rics that were monitored include processor utilization, memory usage and
communication contention.
iii. Near the cycle-true abstraction, the contribution of the thesis can be listed
as follows.
• We have identified, the so called reactive behaviour essential to un-
dertake exploration of alternate NoC architectures and features un-
der realistic application behaviour. The idea is to abstract away
the computation time while maintaining data and interrupt depen-
dent communication sensitivity in the application behaviours. The
reactive behaviours include complex synchronization schemes (as is
observed in multimedia data processing) and OS interaction (as in
multi-tasking and input/output operations).
• We have developed a simple instruction set architecture based model
namely, the reactive IP emulator (RIPE), to mimic the IP core’s re-
activeness at its interface with the NoC. This model has three basic
flow-control instructions (IF, JUMP and Set Register) which, we have
found to be sufficient to model the wide class of reactive behaviour
mentioned above. Additional instructions support the range of com-
munication transactions, and parameterized computation time (via
idle waits or cycle-count).
• We have successfully validated our RIPE approach with a cycle-true
reference system via executing templates of applications possessing
these reactiveness properties in a multithreaded environment.
• Finally, we have developed a case study to show the potential of
such abstraction of computation time (into cycle-count) in a design
space exploration for reducing communication latency and therefore
execution time.
2.2 Suggested Future Direction
In Paper #7 we have validated RIPE framework against a cycle-true reference
system. In the near term future, the validation of the ARTS framework against
RIPE or a cycle-true framework is desirable. This step would allow for a seamless
component-based design flow from abstract to cycle-true environment.
2.2 Suggested Future Direction 15
In the long term, the complexity of the MpSoC architecture and applications
can only be expected to grow. Due to modularity, the challenge in designing
individual components would diminish, however the challenge of integrating and
understanding the impact a collection of these components into a MpSoC will
grow. Overall frameworks that support mixed abstraction study in a predictable
and scalable fashion is required.
Given the experience during this thesis work, considerable research potential in
following two fields have been identified:
• Mechanisms and interfaces to complement the simulation-based frame-
works with some analytical models would enhance the solution space cov-
ered during the MpSoC design space exploration.
• A flexible techniques to partition and apply parts of an application in
abstract “task” form and other parts in different (possibly C/C++ code
or cycle-count) form would be very useful during the study of a mixed
abstraction design.
Realization of these goals is not easy by any means. As eluded to in Sec-
tion 1.2.1, work presented in [14] and [3] is already addressing preliminary con-
cerns in mixed simulation/analytical frameworks. For mixed abstraction instan-
tiation, considerable understanding of the application behaviour and structure
(e.g. functional blocks, OS access, etc) and underlying architecture (cache con-
figuration, synchronization means, etc) is needed. The literature in Chapter 3
mentions many efforts to address this issue.
The practical uses of instantiating designs in any and mixed levels of abstraction
are many. First, for design from start, it can take advantage of availability (in
terms of the same entity described in multiple abstraction) and selection of IP
cores for performance evaluation at different stages of the design abstraction.
With insight and moderation, this will allow investigation of a greater number
of design instances much earlier in the design phase. For simpler MpSoC design
problems, one could even envision developing a automated computer-aided tool
for taking the design problem from specification to candidate solution, in a
fast and rigorous manner. Second, for design re-use, it can allows us to access
the impact of replacement of select parts of design without excessive modeling
and time spent on integration and debugging. However, until mechanisms to
accomplish this type of easy mixing of abstraction with detailed description of
both hardware and software components are available, the separation of the
IP and the NoC related concerns, as is prescribed in our work can assist the
networked MpSoC designer to optimize the individual components or the system
as a whole.
16 Concluding Remarks
2.3 Summary and Conclusion
The contribution of this thesis are two simulation-based frameworks, ARTS
and RIPE, that cover a range of abstractions in modeling networked MpSoC.
Crucially, via these frameworks we have attempted to fill the gaps between
the existing design and modeling frameworks, and the required framework for
realistically capturing hardware and software behaviours. Unlike typical MpSoC
frameworks, which operate in one abstraction, these two frameworks can operate
in a mixed abstraction environment. Additionally, they capture many details
of a true MpSoC device, specifically relating to the application behavior in
the presence of interconnect and, when taking into account the IPs’ hardware
characteristics and OS properties.
In the ARTS framework, we have focused on understanding the impact of NoC
in conjunction with IP selection, application mapping and OS dynamics on
system performance (memory peaks, PE utilization, etc). Initial results show
the potential of the framework in providing a flexible and fast way to instantiate
these different components. Via case studies we have attempted to investigate
a couple of design problem associated with mobile multimedia terminal.
In the RIPE framework, we have provided an accurate IP emulation device for
performance evaluation NoC and prototyping IPs under design. A thorough
validation of the framework under diverse conditions in terms of context switch-
ing, synchronization and architecture instances has proven the applicability of
the design methodology.
Overall, the body of work presented in this thesis, can address a class of prob-
lems associated with network MpSoC in a mixed abstraction environment, such
as: impact of NoC topology and protocol on the application flow, impact of
OS scheduling on NoC traffic density, etc. The two frameworks, presented here
allow extensive design space exploration capabilities in their respective abstrac-
tion. More importantly, their concepts and the implementation could allow the
understanding of the percolation of design decisions made at higher abstraction,
to lower levels of abstraction in a predictable and timely fashion.
Part II
Body
Chapter 3
Overview of Networked
MPSoC
This chapter consists of the following papers.
#1. Tobais Bjerregaard, and Shankar Mahadevan. “A Survey of Research
and Practices of Network-on-Chip.” To appear in the Journal of ACM
Computing Surveys. ACM, 2006.
20 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 21
22 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 23
24 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 25
26 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 27
28 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 29
30 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 31
32 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 33
34 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 35
36 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 37
38 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 39
40 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 41
42 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 43
44 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 45
46 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 47
48 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 49
50 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 51
52 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 53
54 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 55
56 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 57
58 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 59
60 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 61
62 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 63
64 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 65
66 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 67
68 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 69
70 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 71
72 Overview of Networked MPSoC
Paper #1: A Survey of Research and Practices of NoC 73
74 Overview of Networked MPSoC
Chapter 4
The ARTS Modeling
Environment
This chapter consists of the following papers.
#2: Jan Madsen, Shankar Mahadevan, Kashif Virk and Mercury Gonza-
lez. “Network-on-Chip Modeling for System-Level Multiprocessor Simula-
tion.” In Proceedings of the 24th Real-Time Systems Symposium (RTSS),
Cancun Mexico. IEEE, Dec. 2003: 265-274.
#3: Jan Madsen, Shankar Mahadevan, and Kashif Virk. “Network-Centric
System-Level Model for Multiprocessor System-on-Chip Simulation.”
Interconnect-Centric Design for Advanced SoC and NoC. Eds. Nurmi
J., Tenhunen H., Isoaho J., and Jantsch A. Dordrecht, The Netherlands.
Kluwer Publications, 2004: 341-365.
#4: Shankar Mahadevan, Michael Storgaard, Jan Madsen, and Kashif Virk.
“ARTS: A System-Level Framework for Modeling MPSoC Components
and Analysis of their Causality” Modeling, Analysis and Simulation of
Computer and Telecommunication Systems (MASCOTS), Atlanta USA.
IEEE, Sept. 2005: 480-483.
From this group only Paper #3 and Paper #4 are presented in this chapter.
Paper #3 covers the concepts and results presented in Paper #2 and therefore,
76 The ARTS Modeling Environment
Paper #2 is not presented here. We refer the interested readers to Appendix 6
for the full text of Paper #2. With regards to nomenclature, the ARTS frame-
work in Paper #2 and #3 is referred to as ‘abstract system-level model’ or
‘system-level RTOS modeling framework’.
4.1 Network-Centric System-Level Model for
Multiprocessor System-on-Chip Simulation
Paper #3: Network-Centric System-Level Model for MPSoC Simulation 77
78 The ARTS Modeling Environment
Paper #3: Network-Centric System-Level Model for MPSoC Simulation 79
80 The ARTS Modeling Environment
Paper #3: Network-Centric System-Level Model for MPSoC Simulation 81
82 The ARTS Modeling Environment
Paper #3: Network-Centric System-Level Model for MPSoC Simulation 83
84 The ARTS Modeling Environment
Paper #3: Network-Centric System-Level Model for MPSoC Simulation 85
86 The ARTS Modeling Environment
Paper #3: Network-Centric System-Level Model for MPSoC Simulation 87
88 The ARTS Modeling Environment
Paper #3: Network-Centric System-Level Model for MPSoC Simulation 89
90 The ARTS Modeling Environment
Paper #3: Network-Centric System-Level Model for MPSoC Simulation 91
92 The ARTS Modeling Environment
Paper #3: Network-Centric System-Level Model for MPSoC Simulation 93
94 The ARTS Modeling Environment
Paper #3: Network-Centric System-Level Model for MPSoC Simulation 95
96 The ARTS Modeling Environment
Paper #3: Network-Centric System-Level Model for MPSoC Simulation 97
98 The ARTS Modeling Environment
Paper #3: Network-Centric System-Level Model for MPSoC Simulation 99
100 The ARTS Modeling Environment
4.2 ARTS: A System-Level Framework for Mod-
eling MPSoC Components and Analysis of
their Causality
Paper #4: ARTS: A System-Level Framework for Modeling MPSoC
Components and Analysis of their Causality 101
102 The ARTS Modeling Environment
Paper #4: ARTS: A System-Level Framework for Modeling MPSoC
Components and Analysis of their Causality 103
104 The ARTS Modeling Environment
Chapter 5
The RIPE Modeling
Environment
This chapter consists of the following papers.
#5: Shankar Mahadevan, Federico Angiolini, Michael Storgaard, Rasmus G.
Olsen, Jens Sparsø and Jan Madsen. “A Network Traffic Generator Model
for Fast Network-on-Chip Simulation.” In Proceedings of Design, Automa-
tion and Testing in Europe Conference (DATE), Munich Germany. IEEE,
Mar. 2005: 780-785.
#6: Federico Angiolini, Shankar Mahadevan, Jan Madsen, Luca Benini and
Jens Sparsø. “Realistically Rendering SoC Traffic Patterns with Interrupt
Awareness.” IFIP Very Large Scale Integration Systems and their Designs
Conference (VLSI-SoC), Perth Australia. IEEE, Oct. 2005: 211-216.
#7: Federico Angiolini, Shankar Mahadevan, Jens Sparsø, Luca Benini and Jan
Madsen. “A Reactive IP Emulator for Multi-Processor System-on-Chip
Exploration.” Submitted for Journal Publication.
From this group only Paper #7 has been presented as it is a comprehensive
extension of concepts presented in Paper #5 and #6 with new implementation
and case studies. We refer the interested readers to Appendix 7 and 8 for the full
106 The RIPE Modeling Environment
text of Paper #5 and #6. With regards to nomenclature, the RIPE framework
in Paper #5 and #6 is referred to as simply ‘traffic generator’ or ‘reactive traffic
generator’.
Paper #7: A Reactive IP Emulator for MPSoC Exploration 107
A Reactive IP Emulator for Multi-Processor System-on-Chip Exploration
Shankar Mahadevan, Student Member, IEEE,, Federico Angiolini,
Jens Sparsø, Luca Benini, Senior Member, IEEE, and Jan Madsen
Abstract
The design of Multi-Processor Systems-on-Chip (MP-
SoCs) emphasizes Intellectual Property (IP) based
communication-centric approaches. Therefore, for the op-
timization of the MPSoC interconnect, the designer must
develop traffic models that realistically capture the appli-
cation behaviour as executing on the IP core. In this paper,
we introduce a Reactive Intellectual Property Emulator
(RIPE) that enables an effective emulation of the IP
core behaviour in multiple (including bit- and cycle-true
simulation) environments. The RIPE is built as a multi-
threaded abstract instruction set processor and it can
generate reactive traffic modeling. We compare the RIPE
models with cycle-true functional simulation of complex
application behaviour (task synchronization, multitasking,
input/output operations). Our results demonstrate high
accuracy and significant speedups. Further, via a case
study we show the potential use of the RIPE in a design
space exploration context.
I. Introduction
The primary design paradigm for Multi-Processor
Systems-on-Chip (MPSoCs) is the separation of the com-
munication and computation concerns, as this enables
Intellectual Property (IP) reuse and shorter design time.
However, to test and optimize the independently developed
IP cores, and assess their collective performance in a
MPSoC platform, one must understand the impact of
the communication fabric on the application executing
on the platform. Fabrics can span over a huge variety
of architectures and topologies, ranging from traditional
shared buses up to packet-switching Networks-on-Chip
(NoC) [10], [14]. To properly assess functionality and
performance, fabric designers build traffic models that
Shankar Mahadevan, Jens Sparsø and Jan Madsen are with the Techni-
cal University of Denmark; Federico Angiolini and Luca Benini are with
the University of Bologna.
test the interconnect under the most realistic application
behaviour. To date, these traffic models can be grouped
into two primary classes: stochastic models and IP-based
models.
The stochastic models provide traffic similar to mathe-
matical distributions such as uniform, Poisson, etc. As seen
in [19] and [11], they have been used in the evaluation of
different interconnect architectures and features. However,
they do not capture the close correlations between differ-
ent events as would be expected in a realistic MPSoC
environment. To make an example, checks for a shared
resource done by polling generate different amounts of
traffic depending on the relative ordering of accesses to
the resource. Thus, the usefulness of stochastic models is
restricted to validating the correctness of the implementa-
tion of the interconnection backbone, and does not extend
to application-specific optimization.
The IP-based models come in several flavours. Some are
described at higher abstraction levels, such as Transaction-
Level Models (TLM), and some at lower abstractions,
such as Cycle-True Models (CTM). The IP-based TLMs
used in [18] and [24] are very useful in fast exploration
of the system fabric, however the loss of accuracy due
to the highly abstracted description of IP models is an
impediment to thorough fabric optimization. The detailed
IP-based CTMs used in [20] and [16] provide an accurate
picture for such an optimization, but are time-consuming
to simulate, which disadvantages them for repeated use
with alternate fabric architectures and/or feature imple-
mentations. The primary drawback, however, is that in
both cases, the complete application, the operating system
(OS) and the architecture have to be described within the
model, in terms of an abstract system behaviour (TLMs) or
detailed instruction-set behaviour (CTMs). Since the MP-
SoC specifications and designs are susceptible to repeated
changes, this drawback is costly in terms of modeling and
validation time, and may impact time-to-market - which is
an ever shrinking constraint.
For the purposes of the interconnect designer, a valu-
able tool for exploration and optimization needs to meet
important criteria, as addressed in [9]. These include
108 The RIPE Modeling Environment
2
NoC
IPIP
SW
NoC
RIPE RIPE
RIPE
ASIC
IP-Noc
Interface
IP
OS
MEM
MEM
MEM
MEM
SW
IP-Noc
Interface
Fig. 1. RIPE as IP Replacement
NoC
IP
SW
NoC
RIPE RIPE
RIPE
IP-Noc
Interface
IP
OS
MEM
MEM
MEM
MEM
SW
IP-Noc
Interface
IP under
Development
Fig. 2. RIPE as IP Mock-up
repeatability across different fabric architectures, flexibility
for easy incorporation of changes in design specifications,
and scalability and simulation speedups compared to other
models. In this paper, we describe a Reactive IP Emulator
(RIPE) which addresses all of the above criteria, and
extends further by accurately capturing the communica-
tion behaviour that results from the multiple constraints
imposed by
• the application,
• the OS,
• the architecture dynamics.
The RIPE enables a versatile and effective emulation of
IP behaviour towards the MPSoC interconnect and other
IPs in multiple test environments (including bit- and cycle-
true). It is built as a multi-threaded instruction set archi-
tecture with OCP (Open Core Protocol) 2.0 [4] sockets
at its ports. The RIPE allows for easy programming of
sequences of communication transactions interleaved with
idle waits, and is also capable of sensing feedback from the
system. Thereby, it is able to capture the communication-
sensitive portion of IP execution behaviour, such as in
case of synchronization and interrupt events. The response
of the RIPE to such events is governed by the state of
the system resources (communication channels and shared
memory areas), and mimics the behaviour observed with
applications and OS executing on an IP core in a real
MPSoC system. This is the essence of reactiveness. The
main contribution of this work are (i) motivating its need,
(ii) deriving its requirements, (iii) validating these require-
ments, and (iv) demonstrating the impact of RIPE in a
co-exploration environment. Our RIPE approach has been
proposed for complete realistic emulation of the hardware
and the software layers which are stacked in an IP core,
and which eventually determine its behaviour at the pinout
boundary. This enables a complete decoupling of the
simulation of the IP cores from the underlying interconnect
fabric. The RIPE can be programmed to reproduce a range
of behaviours from polling to interrupt-triggered context
switches in presence of an OS. The requirements for such
reactive behaviours are explored in detail in subsequent
sections.
The RIPE device is designed for interconnect per-
formance tuning and matches multi-threaded application
requirements with a truly multi-threaded internal architec-
ture, as will be extensively shown in this paper. Some of
the RIPE concepts were originally introduced as a cycle-
true OCP-based Reactive Traffic Generator (RTG) in [21]
and [6]. The main objective there was to use a device
to accurately play prerecorded system traces back. As
illustrated in Figure 1, by swapping away IP cores for
RIPE blocks in the reference cycle-true system, subsequent
design space exploration of the interconnect is allowed to
be performed at the same level of accuracy. We expand
the scope of the RTG architecture in three ways:
• We support multi-threading in the architecture by
maintaining multiple program counters and register
files, in place of inflexible branching within a single
thread.
• To validate this new architecture, the off-line tool-
chain used to convert the system traces into RIPE
programs has been updated extensively.
• We demonstrate how a RIPE program manually writ-
ten by the designer can provide insight on the re-
lationship among the behaviour of the whole system
and of its components. For example, variable densities
of interrupt events can be investigated, or the impact
of cache write-back vs. write-through policies. As
illustrated in Figure 2, this expands the potential of
the RIPE to the modeling of design features that are
not yet fully implemented.
While still stating the suitability of RIPE for cycle-true
environments, we now additionally prove its usefulness as
a design space exploration tool when under less strict tim-
ing constraints. Additionally, we will show traffic profiling
charts that will further motivate and validate the RIPE
approach.
To validate our RIPE model and programming para-
digm, we test the infrastructure against the bit- and cycle-
true detailed MPARM model [20]. MPARM is a homo-
geneous multiprocessor simulation platform that supports
Paper #7: A Reactive IP Emulator for MPSoC Exploration 109
3
many MPSoC platform configurations and application
suites. As part of the validation, we use the MPARM soft-
ware toolchain to partition and compile different bench-
mark applications onto the various IPs. These application
partitions might conceptually be either routines execut-
ing on general purpose microprocessors, dedicated ASIC
blocks, DMA engines, or any other device.
The rest of the paper is organized as follows. Section II
introduces related work and motiviation, and is followed
by a discussion of the requirements for IP emulation
in Section III. Section IV details the RIPE model and
presents a sample program for modeling application flow.
In Section V we discuss the different potential ways to
use the RIPE. Section VI describes the mechanism to
validate the RIPE. Section VII presents results of validation
and simulation for a range of complex benchmarks with
and without OS, while Section VIII shows a case study
where the RIPE is used as a useful tool for design space
exploration. Finally, Section IX provides conclusions.
II. Related Work and Motivation
The use of IP emulation devices such as traffic genera-
tors (TGs) is not new, and several approaches and models
have been proposed.
In [19], a stochastic TG model is used for fabric
exploration, where the IP behavior is statistically repre-
sented by means of uniform, Gaussian, or Poisson dis-
tributions. A similar approach in [30] uses random and
semi-deterministic distributions. The IP model used for
NoC optimization in [11] takes into account the nature of
MPSoC traffic such as real-time, short-data access, bursty,
etc., however the injection rate is governed by statistical
methods. In [29], an extra dimension of self-similarity is
added to the stochastic model which is argued to assist in
precise characterization of multimedia traffic by examining
the “similarities” in traffic traces at the macroblock-level.
Despite the refinements, the inherent probabilistic nature
of the statistical approaches makes it less accurate, as
each TG injects traffic in complete isolation from every
other. As surveyed in [9], such stochastic models are
therefore widely popular for analysis of macro-networks,
e.g. Internet, that exhibit such behaviour, which is unlikely
in MPSoC environment. The simplicity and simulation
speed of stochastic models may make them valuable during
preliminary stages of interconnect development, but, since
the characteristics (functionality and timing) of the IP cores
are not captured, such models are unreliable for optimizing
communication fabric features.
A modeling technique which adds functional accuracy
and causality is Transaction-Level Modeling (TLM), which
has been widely used for SoC design [12], [15], [18],
[22], [23], [25]. In TLM, Inter-Process Communication
(IPC) is realized via channels that implement abstract
blocking or non-blocking communication calls. Thus, it is
argued that TLM enables higher simulation speed than pin-
based interfaces via suppressing “uninteresting” details.
In [22], [23], TLM has been used for bus architecture ex-
ploration. The communication is modeled as read and write
transactions towards the bus. Depending on the required
accuracy of the simulation results, timing information such
as bus arbitration delay is annotated within the bus model.
In [23] an additional layer called “Cycle Count Accurate
at Transaction Boundary” (CCATB) is presented. Here, the
transactions are issued at the same cycle as that observed
in Bus-Cycle-Accurate (BCA) models. Intra-transaction
visibility is traded off for a simulation speed gain. An
average speedup of 1.55x is reported. While modeling
the entire system at TLM, both [22] and [23] present a
methodology for preserving accuracy with gain in simula-
tion speed. Such models are efficient in capturing regular
communication behaviour, but the fundamental problem
of capturing system unpredictability in the presence of
interrupts is not addressed.
In [24], a commercial TLM-based reactive workload
generation framework is presented that is somewhat sim-
ilar to our RIPE approach, wherein users can configure
traffic patterns for handling synchronization and inter-IP
events. Though limited to single-threaded architecture, it
also claims to provide primitives for timing-dependent
behaviour, wherein the user can tigger actions, which
do not depend on application flows but on simulation
time. Other commercial efforts also exists, including the
OpenVERA [28] language and toolchain that, in addition
to modeling concurrency and synchronization, also sup-
port verification from abstract level to RTL. Our RIPE,
while not supporting some classes of timing-dependent
behaviour, supports multi-threading (required for interrupt
driven OS-supported context-switch) and traffic generation
at multiple levels of abstraction, including cycle- and bit-
true environment. More importantly, we have validated
our approach with a cycle-true reference system (details
provide in Section VI), with near 100% accuracy - a step
which the commercial approachs have yet to demostrate,
thereby, limiting the confidence in their usage.
In [20] (MPARM) and [16], complete cycle-true MP-
SoC systems including the full instruction set of the IP
cores and the OS are described. This consequently impacts
the simulation speed and the scalability of system. Further,
the time required to investigate the performance impact
of relatively minor changes in systems modeled in such
a way is often inflated by the implementation time and
then by a relevant simulation time. This hampers the use
of such models for the iterative design space exploration
process. To overcome the speedup limitation of such
simulation-based approaches, an FPGA-based emulation
110 The RIPE Modeling Environment
4IP#1 Semaphore
t t
RD
WR
Resp
IP#2
t
RD
Fail
Resp
RD
Fail
RD
locked
unlocked
locked
tnwk, 
tnwk, 
tunlock
Wait
time
polling
Fig. 3. Typical polling synchronization time-
line.
platform has been proposed in [17]. Here, the registers
in the traffic generator can be configured to generate
different traffic patterns. However, the configurations use
either the stochastic model or the trace-driven approach,
and the reactiveness capability that is needed for accurate
performance optimization is never mentioned. Further, the
requirement of a state-of-the-art FPGA board, as used in
the emulation, is not alway possible to meet.
Based on the above considerations, given the require-
ments of accuracy, repeatability, scalability, speed and
flexibility set in Section I, no true IP emulator model
that spans a range of abstraction levels and usage schemes
seems to be available for the MPSoC designers. Our RIPE
is meant to address this need and we will suggest a process
for its usage at multiple abstractions.
The emulation of the IP core behaviour is not simply
a matter of issuing communication transactions, e.g. by
replaying traces collected from a reference system, an
approach that we might call “cloning”. Such an approach
is clearly inadequate for co-exploration, for example when
taking into account the behaviour of a NoC. Here, the
variance of network latency results in unpredictable re-
sponse delays. Such transaction time variability, either due
to topological reasons or to congestion, should propagate
to subsequent transactions, which would also be delayed in
real systems. A simple example of such critical blocking
is execution resumption after a cache refill request.
This observation leads to the concept of “time-shifting”
behaviour: consecutive transactions are tied to each other,
and are issued at times which are a function of the delay
elapsed before receiving responses to previous transac-
tions. However, even this model fails when multi-master
systems come under scrutiny. The arbitration for resources
in such designs is timing-, and thus architecture-, depen-
dent. Therefore, very different transaction patterns may be
observed as a function of the chosen application model and
interconnection design. For example, consider Figure 3,
where two IP cores (IP#1 and IP#2) attempt to acquire
the semaphore lock and, in case the ownership attempt
fails, poll the location until success. It is clear that such
polling checks for a shared resource will generate different
amounts of traffic depending on the relative ordering of
accesses to the resource. Time-shifting of traces is not
going to be enough to reproduce such behaviour.
The picture is further complicated by the presence of
interrupts. While interrupts in themselves do not typically
imply an intensive load on the communication architecture,
interrupt handling, possibly followed by OS-driven task
rescheduling, can severely strain network resources with
activity peaks, which in turn may indirectly affect other
processors. This event-driven processor reaction must be
realistically modeled in order to accordingly optimize the
underlying interconnect fabric.
These observations motivate the need for an IP emula-
tion device that is reactive to the changes in the system
architecture and the application behaviour. It is only by
taking both the hardware and software into account that
a wide range of synchronization patterns, including OS-
based interrupt handling, can be accurately translated into
a realistic test load for an interconnect infrastructure under
development.
Our RIPE is significantly different from either a purely
behavioural encapsulation of application code into a simu-
lation device, which would be in analogy with TLM mod-
eling environment, or detailed instruction-set simulators,
which would be closer to deployable hardware environ-
ment. However, it spans the behaviour of an IP over this
spectrum of environments. The RIPE model we propose is
aimed at faithfully creating traffic patterns as they would be
generated by an IP running an application, not just by the
application; this includes e.g. accurate modeling of cache
refills and of latencies between accesses, allowing for
cycle-true simulations. We now look at the requirements
for modeling such reactiveness.
III. Reactive IP Emulator Requirements
Communication over a shared fabric can be categorized
according to several different criteria, such as explicit (for
example, data fetching) vs. implicit (for example, instruc-
tion cache refills), or computation-related (data processing)
vs. synchronization-related (exchange of signals among
processors to keep the status of the system consistent).
Another possible criterium is:
I. Processor-initiated communication towards an exclu-
sively owned slave peripheral (e.g. accesses to a
Paper #7: A Reactive IP Emulator for MPSoC Exploration 111
5Master Slave
t t
RD
WR
RD
Resp
Resp
WR
stalled
RD access 
time
RD access 
time
WR access
time
Wait
time
Wait
time
Wait
time
WR access
time
Fig. 4. Communication with Private Memory
private memory),
II. Processor-initiated communication towards an exclu-
sively owned slave peripheral (e.g. accesses to a
private memory),
III. System-initiated communication towards the proces-
sor (e.g. interrupts).
Figure 4 shows a simplistic model of Category I traffic,
i.e. a master accessing a private slave. For such traffic,
a trace containing the type and the timestamp of the
communication events can be captured at the IP ports,
and is subsequently sufficient to emulate the behaviour of
the master via non-preemptive sequential communication
transactions interleaved with an appropriate amount of idle
wait cycles. To elaborate, consider the first two master
transactions, a write (WR) and a read (RD). The time to
service the WR transaction, which is a posted write, is
just the network latency plus the slave WR access time.
The RD, which in our case uses blocking semantics, pays
an additional penalty because the response has to make
its way back to the master. From the emulation point of
view, this pattern is easily recordable: network latency and
slave access time are unimportant factors, and the essential
point to capture is just the delay between WR assertion and
RD assertion (wait time), and between RD response and
the following command. This is the essence of the “time-
shifting” technique discussed in Section II. In a subsequent
simulation with RIPEs replacing IP cores, these delays
will be modeled by explicit idle waits in the RIPE, while
the network latency will be dependent on the interconnect
model under simulation. In the next set of transactions,
where a RD closely follows a WR, the RD command reaches
the slave before the latter has finished servicing the WR, and
is thus stalled at the slave interface. This stalling behavior
does not need to be explicitly captured in a RIPE model,
since, from a processor perspective, it simply appears to
be part of the slave response time.
Modeling requirements of Category I traffic can be
predicted or inferred given an algorithmic specification.
In [29] and [11] such inference is drawn to test the fabric
architecture. However, for categories II and III listed above,
it is almost impossible to predict traffic requirements
without detailed models of the underlying hardware (such
as cache replacement policies) and without simultaneously
tracking the status of each processor and shared resource of
the system. It is due to this requirement that most synthetic
TG approaches find a roadblock limiting their applicability,
but this is also the area RIPE is focusing onto. So, in
describing the requirements for a reactive emulator, we will
not consider dataflow issues, but instead the much more
challenging synchronization traffic patterns. The capability
of handling synchronization and system-initiated traffic is
a first requirement for RIPE.
To understand the implications of these different MP-
SoC traffic categories, we looked at typical application
behaviour in MPSoCs. Specifically, the following ex-
amples from real-life were considered [31]: multimedia
data stream processing, time slicing mechanisms in OS
schedulers, and I/O device handling. Depending on the
underlying hardware architecture and on the application
requirements, a range of synchronization schemes, each
leading to different communication patterns, can be ob-
served in these examples. To derive the requirements for
our realistic IP emulation, we coded templates of these
applications. We leverage the previously introduced (in
Section I) MPARM simulation environment to execute
these templates. By transforming the information collected
during such execution into a RIPE program (a description
of the process will be provided in Section VI), we can
validate our approach, and compare the performance and
accuracy achievable with the RIPE execution engine.
The details of these template programs are described
next. The first example is about synchronization patterns
typical of multimedia data stream processing, where multi-
ple computational blocks are deployed in a pipelined fash-
ion and communicate according to the producer/consumer
paradigm. The latter examples are more strictly related to
interrupt handling in presence of an underlying OS which
performs context switching.
A. poll
In the simplest synchronization case (“poll”), one or
more processors competing for a shared resource may
poll a semaphore, performing an unpredictable number
of accesses prior to lock acquisition and flow resumption.
112 The RIPE Modeling Environment
6
Read Semaphore
Semaphore
Locked?
Normal Computation 
Flow
N
Y
Fig. 5. poll application flow
Producer Consumer
IP#2IP#1 
semaphore
check
(locked)
unlock
semaphore
interrupt
semaphore
recheck
(unlocked)
Fig. 6. Typical interrupt syn-
chronization timeline
Read Semaphore
Semaphore
Locked?
Normal Computation 
Flow
Wait for 
Interrupt
Y
N
Fig. 7. pipe application flow
For this case, a single task is mapped onto every system
core. Tasks are programmed to communicate with each
other in a point-to-point producer-consumer fashion; every
task acts both as a consumer (for an upstream task) and
as a producer (for a downstream task), therefore logical
pipelines can be achieved by instantiating multiple cores
and tasks. Synchronization is needed in every task to
check the availability of input data and of output space
before attempting data transfers. To guarantee data in-
tegrity, semaphores are provided to assess such availability.
For example, the consumer checks a semaphore before
accessing producer output. Figure 3 presented earlier illus-
trates such a scenario. Here, two tasks i.e. IP#1 Producer
and IP#2 Consumer, attempt to gain access to the same
hardware semaphore, which controls an area of shared
memory used for data exchange. IP#1 arrives first and
locks the resource; the attempt by IP#2 thus fails. If the
semaphore is found locked upon the first read, the IP reacts
with a continuous polling strategy, whereby IP#2 regularly
issues read events until eventually the semaphore is found
unlocked. Figure 5 represents the application flow of the
polling IPs. Since the transactions occur over a shared
network fabric, the unlock event (WR) issued by IP#1 and
the success of the next request (RD) event by IP#2 are
interdependent. Only if the IP#2 RD event is issued at
least tnwk,IP#1 + tunlock,S − tnwk,IP#2 after the unlocking
by IP#1, then IP#2 will be granted the semaphore and
additional polling events will not be required. Therefore,
depending on network properties, a variable amount of
transactions might be observed at the OCP interfaces of
IP#1 and IP#2.
B. pipe
An interrupt-based task synchronization scenario
(“pipe”) is illustrated in Figure 6. In terms of functionality,
this case is similar to poll, except for semaphore release
handling, which is now augmented by issuing interrupts. If
the semaphore is found locked upon the first read, a polling
could be performed, at a heavy price in terms of energy
consumption, and possibly contributing to the saturation
of the system interconnect. Instead, in this scenario, we
implemented a mechanism which suspends the consumer
task and resumes it only when the producer has data ready.
The producer will notify this event by both unlocking
the semaphore and sending an interrupt. Figure 7 shows
the corresponding application flow within the IPs. Upon
interrupt delivery, the consumer re-evaluates the semaphore
value for fail-safe operation, and since this time it finds it
free, it goes on to process the available input data. The
producer follows a similar flow when attempting to push
data to the output. In this example, the task is interacting
with the OS of the IP cores to voluntarily suspend should
certain conditions be true (i.e. finding a semaphore locked).
Additionally, the task negotiates with the OS to be resumed
upon interrupt receipt. The task may also want to ignore an
interrupt in the following condition: it is possible that the
upstream producer, or the downstream consumer, notifies
availability of data or buffer space before the actual need
for such resources, because the current task is still busy
with previous internal processing.
C. multi
A task scheduling scenario (“multi”) is illustrated in
Figure 8(a). In this case, two tasks (Task A and Task B) run
Paper #7: A Reactive IP Emulator for MPSoC Exploration 113
7
on each IP; a variable amount of system processors may be
present. No explicit communication is performed between
tasks, neither intra- nor inter-core. The context switching
between tasks is performed by the OS in response to an
external interrupt, which may typically be sent by a timer
device. The end of any task automatically triggers a context
switch to the outstanding suspended task. Any nested
interrupts arriving during the context switch are ignored, as
would be expected in any well-behaving system. The tasks
are not explicitly aware of any system synchronization
going on, as they are not notified upon the receipt of an
interrupt, and are just passively suspended and resumed
by the OS. Since tasks can be asymmetric, a difference in
OS scheduling might in turn translate into different traffic
workloads.
D. IO
An I/O-aware application (“IO”) is illustrated in Fig-
ure 8(b). A single task is running on every system proces-
sor. These tasks do not communicate with each other, and
perform independent computation. However, at random
times, a system I/O device sends an interrupt to the IP
cores to signal availability of data. In response to this
signal, the IP executes an interrupt handler routine, which
moves blocks of data across the system interconnect. When
such handling is finished, normal operation is resumed.
The interrupt handling is part of the functionality of an
I/O device driver, and can be programmed as such.
Our RIPE models emulates a processor running an
application. The application may or may not encompass
OS behaviour, and may or may not be composed of
multiple tasks per core. Apart from the poll scenario, in
all the envisioned applications the OS plays an important
role, with and both multi and IO involving some form of
multiple tasks per core. The support for OS and multitask-
ing modeling represents the second set of requirements for
the RIPE.
The applications described above are timing-sensitive.
However, within the single task, the overall performed
computation does not change depending on the order of
arrival of external events, and the data dependencies can
be captured. Only the amount of computation between
each pair of events can vary. Should an environment
constraint not be satisfied, tasks always enter some form
of suspension, albeit in very different manners in each of
the examples. The different degree of awareness of OS
functionality in each of these templates is important be-
cause it impacts the ability to annotate execution traces, as
will be seen in Section VI. So, while an execution trace of
these benchmarks shows varying traffic patterns depending
on external timings, the major computation blocks are still
interrupt
interrupt
interrupt
interrupt
interrupt
Timer 
Device
IP
Task BTask A
interrupt
interrupt
interrupt
IO 
Device
IP
Task
OS
Handler
(a) (b)
Fig. 8. Typical interrupt-triggered context
switch timeline. (a) multi (b) IO
recognizable. Even though tasks with even more timing-
dependent behaviour do exist, the effort required to model
such tasks requires an intra-task notion of context switch-
ing. It is also worth stressing that, though not all interrupt-
driven behaviours are represented, the applications we try
to analyze here are definitely representative of a vast class
of computation. The model we will propose can capture
all such dynamics with proper insight on the mechanics of
the applications and the OS.
The experimental results will prove that traces collected
at the IP-fabric interface are sufficient to accurately repro-
duce the IP core communication, providing an important
mechanism for RIPE validation. These traces should col-
lect sequences of communication transactions, comprising
of requests, responses and interrupt events, separated by
time intervals with no communication, i.e. idle time. A
reference simulation of the entire system should produce
several traces, one per IP core interface.
IV. The RIPE Model
In this section, we motivate the choice of creating
the RIPE as an instruction set processor, then describe
its operation and implementation, which are capable of
reproducing the required IP core reactiveness behaviour.
114 The RIPE Modeling Environment
8Instruction Size (Words) Description
Communication Instructions:
Read(AddrReg) 1 Read from an address
Write(AddrReg, DataReg) 1 Write to an address
BurstRead(AddrReg, CountReg) 1 Burst read an address set
BurstWrite(AddrReg, DataReg, CountReg) 1 Burst write an address set
Flow Control Instructions:
If(arg1, arg2, operand) 2 Branch on condition
Jump(label) 1 Branch direct
Idle(counter) 1 Wait for given no of cycles
SetRegister(reg, value) 2 Set register (load immediate)
TABLE I. RIPE instruction set.
A. Motivation for the Instruction Set Architecture
The RIPE must generate traffic patterns according to
two different constraints. First, it must follow the directives
set by the designer, who wants to inject a certain type
of traffic in the system, typically shaped to emulate the
traffic requirements of some application. Second, it has to
respond dynamically to the external environment (conges-
tion, synchronization events) in the same way the applica-
tion that is being modeled would. To generate appropriate
communication transactions “on-the-fly” respecting both
requirements, an instruction set, supported by state regis-
ters and by a programming language, is a natural choice.
By introducing a programmable paradigm, the RIPE can
be used in association with manually written programs to
generate traffic patterns typical of IPs still in the design
phase, helping in the tuning of the communication per-
formance or understanding the causality relationship with
other IPs in the MPSoC. Hence, we choose an Instruction
Set Architecture (ISA) for the RIPE implementation. This
choice allows us to describe reactiveness characteristics of
a wide range of IP cores at different levels of abstraction.
Additionally, this choice allows future deployment as
a hardware device in test chips containing interconnect
prototypes. In [17], the potential of this type of archi-
tecture has been shown within an FPGA-based emulation
platform. The ISA approach, with a fixed device and user-
written programs, avoids time consuming operations such
as recompilation, in the case of behavioural models, or
resynthesis, in the case of a hardware flow. Such steps
would be required by a monolithic traffic generation device
to emulate and study different applications on the same
platform. In this paper, program execution will only be
shown within a simulation model.
From the analysis of communication requirements in
Section III, it can be postulated that three different RIPE
entities might be needed:
• A RIPE emulating an IP master (a processor). This
component must be able to issue conditional se-
quences of communication transactions separated by
idle wait periods. Further, it must be sensitive to
arrival of interrupt events and must support multiple
threads.
• A RIPE emulating a private memory. This component
must be able to respond to communication transac-
tions issued by a master. The RIPE just has to model
the access time but it does not have to provide a data
structure for storage. It simply responds to a read
transaction by providing a dummy value.
• A RIPE emulating a shared memory. This component
must contain a data structure modeling an actual
shared memory (since the values read by the masters
may affect the application flow, e.g. current values of
semaphores).
The second and third entities can be extremely simple
in design, as their logic basically involves a small state
machine to handle the communication protocol at the IP
interface and possibly a storage element for corresponding
memory accesses. In any case, for our tests within the
MPARM framework we could use the equivalent MPARM
blocks. Therefore, only the RIPE entity that emulates an
IP master is described next, and is the main focus of this
paper.
B. Instruction Set Architecture
The RIPE is implemented in SystemC [2] as a non-
pipelined processor with a very simple instruction set, as
listed in Table I. The RIPE program that controls the device
behaviour contains code to model one or multiple tasks.
These tasks might be actual tasks running on the IP core
which is being modeled, or chunks of the OS layer, such
as its native interrupt handlers and scheduler. The RIPE is
capable of switching the execution flow among these tasks,
as discussed later. Via the OCP 2.0 [4] master transaction
interface, the RIPE is able to issue a sequence of commu-
nication transactions separated by idle wait periods, based
on the programmed flow control conditions. The choice
of the OCP protocol for the interface is motivated by the
availability of this interface on the interconnect side within
the MPARM reference system. Any other standard, such
as AXI (Advanced eXtensible Interface) [7], could also
Paper #7: A Reactive IP Emulator for MPSoC Exploration 115
9Special Name Usage
Registers
Interrupt Registers:
2 IntrpMaskReg Masks or unmasks interrupts
3 TaskIDReg Stores a task ID
5 SWIntrpReg Sends a software interrupt from
within the program
Other Registers:
4 RDReg Stores the data value returned
by the Read(AddrReg) instruction
TABLE II. RIPE Special Registers.
be supported depending on the interface required by the
interconnection under study.
The RIPE has a Program Counter (PC) register, an
instruction memory and a register file for each task running
on the core, but no data memory. Collectively, this state
information drives the RIPE execution engine, whose state
machine is described in the next section. The instruction
set consists of a group of commands which issue OCP
transactions (arguments are taken from the register file)
and a group of flow instructions allowing the conditional
programming of sequences of transactions and idle waits.
Within the register file, most registers are general purpose,
and their number can be configured.
Some registers are designated as special purpose; for
example, since in specific flow control scenarios the
data returned by a read command must be available for
evaluation, RIPE provides in Register 4 the response to
the preceding read. Table II shows all designated spe-
cial purpose registers. Of the interrupt-related registers,
Register 2 is used to mask critical sections of the RIPE
program from interrupts. As seen in Section III, different
applications require different responses to interrupt events.
For example, in IO modeling, the main task is always
interruptible, while once in the interrupt handling rou-
tine, additional (nested) interrupts should be temporarily
skipped. In pipe modeling, the interrupt handling is more
specialized; interrupts are only enabled after the task has
suspended, while they are masked during normal operation.
Register 5 allows the RIPE program to assert “software
interrupts”, to which the RIPE model will react by loading
the program and register set of the next thread. Register 3
can be programmed to hold the task ID of the next task to
be loaded and run on the RIPE device out of the available
task pool. The usage of the special registers will be shown
in Section IV-D.
Software interrupts are managed internally by the RIPE
model. In contrast, hardware interrupts are routed through
external wires of the system fabric, and are available on the
sideband portion (SInterrupt) of the OCP interface.
C. ISA Implementation
To execute the instructions discussed above, the RIPE
model implements a simple non-pipelined engine where,
within a single cycle, the instruction is fetched, decoded
and executed. The RIPE can either initiate OCP transac-
tions or perform flow control operations, including setting
up register values.
The Set Register instruction executes the load of an
immediate 32-bit value, which is written to the speci-
fied register (SetRegister(reg, value) in Table I).
This opcode is two memory words long, as it has to
accommodate the immediate data. The class of instructions
relating to communication is designed to execute the OCP
transactions. The OCP transactions are initiated with the
address and data values that were set up in the register file
in the preceding cycles. These instructions are blocking,
i.e. the RIPE execution is suspended until completion of
the OCP handshake, which for a read will include the
latency of the response over the network. Currently, we
support the basic signals and the burst extension of the
OCP 2.0 specification. An extension to support out-of-
order transactions could be achieved by the implemen-
tation of an outstanding instruction buffer. The class of
instructions relating to flow control is used to realize the
reactive behaviour. The If and Jump instructions are used
to change the execution flow and the Idle instruction is
used to fake the IP computation latency. The If opcode
is two words long, to accomodate its operands and branch
location.
A context switch among tasks in the task pool is
realized simply by referring to the corresponding set of
PC and register file. The explicit swapping operation, i.e.
storing the state of current thread, loading the state of the
next thread and eventually restoring the suspended thread,
which is described in [6], is a byproduct of the presence of
a single task memory in the device, and is no longer needed
since each task now has its own independent program
memory. The context switching is simultaneous with an
incoming interrupt signal, thus avoiding inconsistencies.
Upon interrupt notification, the PC, register file and pro-
gram instruction memory are updated to the task ID read
from the special-purpose Register 3.
The aforementioned instructions must be combined in
a program and then transformed into a binary executable
format for use within the RIPE ISA. The program syntax
and the tool to generate RIPE executables are described
next.
D. Programming Language and Assembler
The programming language to code traffic patterns of
the RIPE is similar to an assembly language, though
116 The RIPE Modeling Environment
10
additional semantics are provided to make it user-friendly.
It is best explained via the example shown in Figure 9,
where a program to model the IO application is sketched.
Statements starting with a semicolon (;) are inlined com-
ments.
The RIPE program starts with a header describing
the core and the task identifier: MASTER[<coreID>,
<taskID>]. All of the tasks running on any given IP
core are described within a single program, so that there
is one program per RIPE device. Recall that IO models
an application with a linear program flow, which can be
suspended by the OS to process IO interrupts. Therefore,
two tasks are described: task #0 (the main application) and
task #1 (the interrupt handler).
The next few statements express initialization of the
register file for this task. Unique labels should be used
for register names/tags. This allows correct initialization
and easy identification of the registers within the program.
The PC is increasing by either one or two locations along
the trace; this is because SetRegister and If, as
seen in Table I, require longer operands and therefore fill
two instruction slots. For task #0, the main body of the
RIPE program, this is represented by a linear execution
flow, composed of sequences of reads and writes, inter-
leaved with register accesses (mostly, to set up transaction
addresses and data). Flow control instructions might be
inserted where appropriate, but are not needed in this
model. Note the initialization of interrupt-related registers
at the top of task #0; upon a hardware interrupt, the RIPE
swaps the context to the task having the ID provided in
TaskIDReg, i.e. to task #1 (the interrupt handler). Since
task #0 can be suspended by OS to process I/O interrupts,
IntrpMaskReg is set as unmasked, allowing for such
suspension.
The OS-driven context switch traffic and the I/O handler
routine are programmed in task #1. Within the interrupt
routine (starting with label IntrptHandler), which is
the critical section of the flow, interrupts are disabled. At
the end of the flow, a software interrupt is triggered to
restore the normal program flow to task #0. Upon another
HW interrupt in the main task, the interrupt handler routine
will be executed again from PC 0. The flow therefore
mimics Figure 8(a).
An assembler was built to convert the human under-
standable RIPE program into a binary for execution on the
RIPE device. There is a direct one-to-one correspondence
between program instructions and the binary. Within the
binary, the individual task sections are appended in order
of their task ID. A header with a small task lookup table
is prepended.
During setup, the RIPE device loads the binary, and
based on the information encoded at the start of the binary
file, it determines the number of tasks and the amount of
MASTER[1, 0] ; Regular task
; Special Registers
REGISTER IntrpMaskReg 0 ; Unmask Interrupts
REGISTER TaskIDReg 1 ; Next Task ID
; General Purpose Registers (GPRs)
REGISTER AddrReg 0xd0abcdef ; Initialize address GPR
REGISTER DataReg 0 ; Initialize data GPR
...
BEGIN ; Comments PC
; Normal application flow
Idle(10) ; Idle for 10 cycles 0
Read(AddrReg) ; 1
...
SetRegister(AddrReg, 0x10fedcab0) ; Setup an address 121
SetRegister(DataReg, 0x10abcdef0) ; Setup a data value 123
Write(AddrReg, DataReg) ; 125
...
END ; 1078
MASTER[1, 1] ; IO driver task
; Special Registers
REGISTER IntrpMaskReg 0 ; Unmask Interrupts
REGISTER SWIntrpReg 0 ; Disable SW Interrupts
REGISTER TaskIDReg 0 ; Next Task ID
; General Purpose Registers (GPRs)
REGISTER AddrReg 0 ; Initialize address GPR
REGISTER DataReg 0 ; Initialize data GPR
...
BEGIN ; Comments PC
; Interrupt Handling Routine
IntrptHandler
; OS Suspension Routine
SetRegister(IntrpMaskReg, 1) ; Mask Interrupts 0
SetRegister(AddrReg, 0x30bebeef) ; Setup an address 2
Read(AddrReg) ; 4
...
; IO Routine
SetRegister(AddrReg, 0x30beefcd) ; 39
SetRegister(DataReg, 0x10101010) ; 41
Write(AddrReg, DataReg) ; 43
Idle(121) ; 44
...
; OS Release Routine
...
SetRegister(SWIntrpReg, 1) ; Trigger SW Interrupt 106
SetRegister(SWIntrpReg, 0) ; Disable SW Interrupt 108
Jump(IntrptHandler) ; 110
; End Interrupt Handling
END ;
Fig. 9. RIPE Program for “IO" Example.
program memory and the register file size to be allocated
to each one.
V. Using RIPE Programs
Depending on IP model availability to the designer,
different ways exist to write RIPE programs which best
represent the desired type of traffic.
A. Trace Parsing
In this scenario, as is seen in Figure 1, the availability
of a pre-existing model for the IP under study is assumed.
Paper #7: A Reactive IP Emulator for MPSoC Exploration 117
11MCmd WR MAddr 0x01bedfb0 MData 0x00015958 MBurstSingleReq 0 MBurstSeq INCR 0x4 MBurstLength 1 Time 6860265
SCmdAccept Time 6860295
SInterrupt SFlag 0x00000001 Time 6860310
MFlag Time 6860310
MCmd WR MAddr 0x010b48dc MData 0x00000008 MBurst SingleReq 0 MBurstSeq INCR 0x4 MBurstLength 1 Time 6860375
SCmdAccept Time 6860385
MCmd RD MAddr 0x0100acb0 MBurstSingleReq 1 MBurstSeq INCR 0x4 MBurstLength 4 Time 6860720
SCmdAccept Time 6860730
Resp Data 0xe5901000 Time 6860760
Resp Data 0xe2411001 Time 6860780
Resp Data 0xe5801000 Time 6860800
Resp Data 0xe14f0000 Time 6860820
MCmd WR MAddr 0x0102c040 MData 0x00000000 MBurstSingleReq 0 MBurstSeq INCR 0x4 MBurstLength 1 Time 6860830
SCmdAccept Time 6860840
Fig. 10. Trace syntax example.
In this case, the approach for RIPE program generation
goes through two steps. First, a reference simulation is
performed by using the available IP model, even if plugged
into a different MPSoC platform from the final target one.
In fact, since RIPE programs abstract from the transaction
latency factor, a vary fast transaction-level model of the
interconnect can be used in this stage to speed simulation
up. An execution trace is collected. The trace is a very
straightforward log of events on the OCP pinout; entries
include requests, responses and interrupts, all of which
annotated with timestamps. A sample trace snippet is
sketched in Figure 10.
Second, the trace is parsed with an off-line tool. The
output of the tool is the desired RIPE program. The
resulting program is coded to behave exactly as the original
IP model in the native system, and to behave as the core
would do when plugged to a different interconnect. This
program is now ready to be used for cycle-accurate inter-
connect design space exploration with extremely realistic
test traffic.
This type of flow is useful whenever the pre-existing IP
model is not available, due to licensing or technical issues,
for the next co-exploration phase. In this case, the RIPE
can provided a quick functional yet cycle-accurate port of
the IP model to a MPSoC interconnect.
The off-line parsing tool must of course have some
knowledge about the traced application in order to cor-
rectly analyze and rearrange execution traces into RTG
programs. While this effort is not trivial, it is feasible
and provides a path for validation of the presented RIPE
device in a complete cycle-accurate flow, as described in
Section VI.
B. Trace Parsing and Editing
In a related scenario, an IP model might be available,
but it may differ under some respect from the IP that will
eventually be deployed in the SoC device. In this scenario,
the RIPE may be used to approximate the IP, as seen in
Figure 2. The designer may then follow a route similar
to the one outlined above, but with an additional step of
editing the reference trace so that it more closely resembles
that of the target IP. Some examples of the editing steps
which are possible include:
• Removing or adding bus transactions to model a more
or less efficient cache subsystem
• Removing or adding bus transactions to model a
more or less comprehensive target Instruction Set
Architecture (ISA)
• Altering the spacing among bus transactions to reflect
different pipeline designs or timing properties
• Grouping or ungrouping bus accesses to reflect write-
back vs. write-through cache policies
It is certainly reasonable to expect that the alteration
time of the RIPE code will be substantially less than that
required to develop or refine the target IP model, thus
allowing for earlier exploration of the interconnect design
space.
In this scenario, overall cycle accuracy with respect to
the eventual system is of course not guaranteed. However,
the RIPE will still be able to react with cycle accuracy to
any optimization in the SoC interconnect. Provided that the
transaction patterns are kept close to the ones of the target
IP core, the approach will result in valuable guidelines.
C. Direct Development
Of course, RIPE programs can be written from scratch
without reference IP traces. In this case, the flexible RIPE
instruction set allows for a full-featured traffic generation
system. The availability of built-in flow control manage-
ment lets the designer implement the same synchronization
patterns which are present in real world applications (see
Section IV). Additionally, the application chunks enclosed
within synchronization points can quickly be rendered
by exploiting the flexible loop structures provided by
the RIPE ISA, thus providing periodic traffic generation
capabilities at least on par with those of traditional TG
118 The RIPE Modeling Environment
12
implementations as seen in [19] [11] and [17]. In the very
first stages of development, the RIPE can also be deployed
as a validation tool, to check the correct functionality of
the interconenct under the load of the supported transaction
types.
VI. Validation of RIPE
To test RIPE accuracy and viability, we set up a
validation flow in a cycle-true environment, following the
trace-based outline described Section V-A. As a first step,
the user performs a reference simulation of the target
applications where all IP cores are simulated using bit-
and cycle-true models, to collect traces. Subsequently, the
traces are processed into RIPE programs. The following
sections describe these steps in detail.
A. Reference MPSoC System
To achieve validation, the RIPE model was integrated
into the MPARM [20] reference system. MPARM is a ho-
mogeneous multiprocessor instruction-set simulation (ISS)
platform with a configurable number of processors as IP
masters with private and shared memories, and semaphore
and interrupt devices. It also contains a port of RTEMS [3]
- a real-time OS. The IP cores can be plugged onto one
of several interconnect architectures, such as AMBA [8],
STBus [27] and ×pipes [13]. The use of the OCP v2.0
protocol at the interfaces between the IP cores and the
interconnect allows for easy exchange of native cores
with RIPE blocks (Figure 1). To record execution traces,
the OCP interface modules within the MPARM system
(the AMBA AHB bus master) were adapted to collect
traces of OCP requests, responses and interrupt events in
a predefined file format (.trc).
It is worth stressing that the complexity of the appli-
cations described in Section III is not trivial from the
modeling point of view. The amount of annotations that
can be extracted from the application and its traces reflects
the programmer’s degree of knowledge and access to
the application synchronization schemes, to the interrupt
routines and to the OS internals.
B. Trace to RIPE Program
The RIPE validation flow is illustrated in Figure 11.
During the reference simulation, traces are collected from
all OCP interfaces in the system. The address and (if
any) data fields of the transactions were also observed.
Trace entries may contain one of many transaction types:
single or burst read/write requests, assertion of hardware
interrupt, arrival of response, etc. Figure 12(a) shows an
example trace.
The next step is to convert the traces into corresponding
RIPE programs (.tgp). The off-line translator tool outputs
symbolic code; Figure 12(b) shows the RIPE program
derived from traces in Figure 12(a). We will explain the
translator operation in detail in SubsectionVI-C. Finally, an
assembler is used to convert the symbolic RIPE program
into a binary image (.bin) which can be loaded into the
RIPE instruction memory and executed.
The off-line tool for trace to RIPE program conversion
is written to exploit the sophisticated way application
tasks can be described in RIPE programs and the multi-
tasked architecture described in Section IV. The automated
algorithm in the conversion flow is capable of detecting
and capturing many synchronization behaviours, without
the need for the designer to handle them manually, and
is explained next. Validation of the trace collection and
processing mechanisms can be achieved by collecting
traces with IP cores running on different interconnects, and
verifying the resulting .tgp and .bin programs to match.
The conversion process is fully automated and the time
taken for this process is discussed in Section VII.
C. Translator Operation
In this section we detail the working of the translator.
We use the system traces given in Figure 12(a) as an
example source for transformation into a RIPE program,
and the result is in Figure 12(b).
As discussed in Section III, some prior knowledge about
the IP core used in the reference simulation is required
to accurately program the RIPE device. Apart from the
sequence of transaction requests and responses, following
is a list of information needed for correct operation of the
translator:
• The global identifier of the IP core in the MPSoC
system
• The clock period of the IP core
• The addressing ranges representing semaphore (pol-
lable) resources
• The timestamp of interrupt events
• The timestamp of the return from an interrupt han-
dling routine
• The timestamp of a spontaneous control yield
The first three pieces of information are encoded in
the trace filename, the rest are explicitly or implicitly
(provided some knowledge of the application functions)
annotated within the trace file. For example, incoming
interrupts are detected on the OCP pinout and explicitly
recorded in the trace. On the other hand, returns from
interrupt handling routines must be located implicitly by
detecting known behaviour, such as a specific memory
access at the end of the handler or at the return point in
the main code. Based on the above information, we first
Paper #7: A Reactive IP Emulator for MPSoC Exploration 119
13
MPARM
Benchmark
Trace (.trc)
Translator
Assembler
RIPE Binary
(.bin)
RIPE Program
File (.tgp)
Trace Collector
RIPE Model
Fig. 11. Trace to RIPE Program Flow.
; Simple RD/WR/WRNP
RD 0x 00000104 @55ns
Resp Data 0x088000f0 @ 75ns
WR 0x00000020  0x00000111 @90ns
RD 0x 00000031 @140ns
Resp Data 0x00002236 @165ns
..
..
; pollin g a semaphore!!
RD 0x 000000ff @210ns
Resp Data 0x00000000 @270ns
RD 0x 000000ff @285ns
Resp Data 0x00000000 @310ns
RD 0x 000000ff @305ns
Resp Data 0x00000001 @320ns
..
Network
latency
Next IP comm
transac tion interval
(a)
; Master Core
MASTER[<coreID>,<thrdID>]
; Initializations
..
REGISTER rdreg 0 ; holds value of RD
REGISTER tempreg 0 
REGISTER addr 0x00000104 
REGISTER data 0
..
BEGIN
Start
Idle(11) ; wait for first inst
Read(addr, rd) 
SetRegister(addr, 0x00000020)
SetRegister(data, 0x00000111)
Idle(1)  
Write(addr, data, wr) 
SetRegister(addr, 0x00000031)
Idle(9)  
Read(addr, rd) 
..
..
; polling a semaphore location!!
SetRegister(addr, 0x000000ff)
SetRegister(tempreg, 0x00000001)
Semchk
read(addr, rd) 
If rdreg != tempreg then Semchk
..
Jump(start) ; rewind
END
(b)
Fig. 12. (a) MPARM Trace (b) RIPE Program.
describe the insertion of the SetRegister instruction
within the RIPE program, which is critical to initiate the
correct OCP transactions and flow control behaviour; and
then we describe how the reactiveness is realized.
As seen in Figure 12(b), and described in Section IV-
D, the RIPE program starts with the typical core iden-
tifiers. For the illustrative example in Figure 12(a), let
the clock period be 5ns and the semaphore location be
0x000000ff. Register RDReg is defined as the name of
the special register where the value of read transactions is
stored (Table II).
At the beginning of the trace file, the first communi-
cation request, a read (RD), occurs at 55ns, meaning the
RIPE has to perform 11 (55/5) cycles of idle wait in
the first place. Therefore, an Idle wait is observed in
the RIPE program. When parsing this trace statement, the
translator collects the RD address and initializes one of the
registers marked as available in the register table (tagged as
addr on top of the program). The response is received at
75ns; the translator simply skips to this timestamp, since
response latencies are only dependent on the underlying
network and the IP core (and so the RIPE) is simply
blocked in the meanwhile. The next trace event of interest
is the write WR request at 90ns. This means three ((90-
75)/5) cycles have elapsed since the previous response is
received. New values have to be set up in the address and
data registers, which takes a cycle each (either for updating
the already used addr and data or for setting up a new
pair of registers). An ensuing Idle wait is added to fill the
gap. This represents the “time-shifted’ behaviour discussed
in Section II: if the RIPE program is run on a different
interconnect where the read response latency is different,
the write request will be accordingly shifted backward or
forward from the 90ns timestamp.
Then, the following read request is translated into the
corresponding RD program call, which is issued after ten
cycles, one spent to set up the target address and nine
in idle waiting. Please notice that write transactions in the
OCP protocol can be posted, as we assume in this example;
hence the time gap (equivalent to some processing time
within the IP core that is being replaced) between the
previous write command and the current read is noted by
the translator in the RIPE program. The read is blocking
until a response is received, five cycles later.
Now, consider the trace entries from time 210ns to
320ns. By identifying the address as belonging to a
semaphore location and knowing the polling behaviour of
the original IP core, the translator inserts the Semchk label
and an If conditional statement. This statement checks
whether the read value is equal to “1”, which reflects an
unblocked semaphore. This loop effectively models the
semaphore polling behavior. The semaphore address and
expected unblock value are set up prior to the loop label to
avoid repeated initialization, thus allowing for continuous
polling at the maximum frequency rate for unlimited
periods. Idle waits can obviously be added in the loop
should the original IP core have a low-frequency polling
behaviour. All master devices attempting to accessing this
120 The RIPE Modeling Environment
14
; multi trace for Core ID #3
RD 0x00000104 @15ns
Resp Data 0x088000f0 @45ns
WR 0x00000020 0x00000111 @95ns
...
RD 0x00000031 @120ns
Resp Data 0x00002236 @225ns
SInterrupt @365ns
RD 0x00000031 @440ns
Resp Data 0x00002236 @465ns
RD 0x0000beef @540ns
Resp Data 0x00002236 @565ns
...
WR 0x00000020 0x00000111 @390ns
SInterrupt @595ns
Burst RD MAddr 0x01009340 Length 4 @620ns
Resp Data 0x00027864 @710ns
Resp Data 0x00029994 @730ns
Resp Data 0xe52de004 @750ns
Resp Data 0xe59f0004 @770ns
...
...
...
(a)
MASTER[3, 0] ; Initializations Task A
; Special Registers
REGISTER IntrpMaskReg 0 ; Unmask Interrupts
REGISTER TaskIDReg 1 ; Next Task ID upon Interrupt
; General Purpose Registers (GPRs)
REGISTER AddrReg 0x00000104 ; Initialize GPR labeled AddrReg
REGISTER DataReg 0 ; Initialize GPR labeled DataReg
...
BEGIN ; Comments
Idle(3)
Read(AddrReg) ; RD @15ns
Idle(8)
SetRegister(AddrReg, 0x00000020)
SetRegister(DataReg, 0x00000111)
Write(AddrReg, DataReg) ; WR @95ns
...
SetRegister(AddrReg, 0x00000031)
Read(AddrReg) ; RD @120ns
Idle(15)
SetRegister(AddrReg, 0x01009340)
SetRegister(CountReg, 0x4)
BurstRead(AddrReg, CountReg) ; Burst RD @620ns
...
SetRegister(SWIntrpReg, 0x00000001) ; Trigger SW Interrupt
SetRegister(SWIntrpReg, 0x00000000) ; Disable SW Interrupt
END
(b)
MASTER[3, 1] ; Initializations Task B
; Special Registers
REGISTER IntrpMaskReg 0 ; Unmask Interrupts
REGISTER SWIntrpReg 0 ; Diable SW Interrupts
REGISTER TaskIDReg 0 ; Next Task ID upon Interrupt
; General Purpose Registers (GPRs)
REGISTER AddrReg 0x00000031 ; Initialize GPR labeled AddrReg
REGISTER DataReg 0 ; Initialize GPR labeled DataReg
...
BEGIN ; Comments
Idle(26)
Read(AddrReg) ; RD @440ns
Idle(74)
SetRegister(AddrReg, 0x0000beef)
Read(AddrReg) ; RD @565ns
...
SetRegister(AddrReg, 0x00000020) ; Trigger SW Interrupt
SetRegister(DataReg, 0x00000111) ; Disable SW Interrupt
Write(AddrReg, DataReg) ; WR @390ns
...
SetRegister(SWIntrpReg, 0x00000001) ; Trigger SW Interrupt
SetRegister(SWIntrpReg, 0x00000000) ; Disable SW Interrupt
END ;
(c)
Fig. 13. RIPE Program for “multi" Example. (a) MPARM trace, (b) Task A, and (b) Task B.
semaphore incorporate the same routine in their RIPE
program, thus capturing the system dynamics.
Within the translator, a register allocation algorithm
correctly sets up all the required data in registers before
the OCP or the flow-control instructions that need them
are scheduled for execution. It is possible that streams
of closely packed communication requests may leave few
or no interleaved idle cycles available for preparing their
address (and data, if any). The solution is to exploit the
slack (idle wait time) available further above in the trans-
action sequence for setting up register values for upcoming
instructions. The translator algorithm attempts to use such
slack as much as possible to prefetch register contents.
However, if packed streams are very long, the problem
may be further compounded by lack of free registers. In
this case, the only solution is to increase the size of the
register file. We expect the problem to occur with minimal
frequency, as two idle cycles (for writes) or even just one
(for reads) among transaction entries are enough to allow
for streams of arbitrary length. Otherwise, the maximum
length of streams will be directly limited by register file
Paper #7: A Reactive IP Emulator for MPSoC Exploration 121
15
size. This is of no importance in the context of a simulation
RIPE device (as in this paper), but would have an area
penalty in a hardware implementation. In the event of lack
of registers, the translator tool prompts the user to increase
the size of the register file in the RIPE architecture and to
attempt the translation again.
D. Handling Interrupt Reactiveness
As mentioned before in Section III, the amount of
annotations that can be extracted from a trace reflects
the degree of access the programmer has to the interrupt
routine and to the OS internals. Specific locations within
the trace file, such as interrupt handling routine entry and
exit points, have to be recognized by the translator tool to
optimally insert the corresponding code as a task into the
RIPE task pool.
The trace files are always annotated with the time of
occurrence of interrupt events. For the IO benchmark, the
interrupt handling routine is supposed to be accessible by
the programmer, as described in Section III; thus, a marker
(a dummy transaction to a known address) can be added
at the exit of the routine to tag it. The transactions within
these bounds are detected as interrupt handling code and
are encapsulated as such in the RIPE program. In Figure 9
we have seen the backbone of the IO RIPE program,
where interrupt response blocks are handled so as to mimic
Figure 8(b).
Using multi as an example, let us consider the interrupt-
triggered reactiveness in more detail. Here, the trace files
are annotated only with the time of occurrence of interrupt
events. Indeed, recall that in the multi benchmark the
interrupt handler is supposed to be completely out of the
programmer’s control, as it is tied to the OS scheduling
code. The IP core toggles among the two tasks upon these
interrupts. Additionally, control is never spontaneously
released by means of SW interrupts: the previously active
task is only resumed upon arrival of a HW interrupt. Thus,
the translator’s job is simply to capture the OCP transaction
stream between two successive interrupts (identified by
the SInterrupt tag in the trace) and append it to the
corresponding task program, knowing that the scheduling
pattern will be alternating. A minor inaccuracy in this
approach is that the code of the OS which manages the
rescheduling cannot be isolated by the translator, and
is instead captured as a part of the instructions of the
next task. Despite the above approximation, experimental
results show a negligible accuracy skew.
Figure 13 shows the trace (a) and RIPE program (b) and
(c) for a processor (in this case ID 3) performing two tasks
in multi scenario. By default, in Figure 13(a), the set of
instructions until the first HW interrupt (at 365ns) are iden-
tified with task A, which is then coded into corresponding
Semaphore 
locked?
Idle Wait
Interrupt?
Task 
Suspend
Task 
Resume
Spontaneous 
Suspension
Resumption on 
Interrupt
Task 
Execution
Normal 
Computation 
Flow
Yes
Yes
No
No
OS Routines
OS TaskPrimary Task Idle Wait
Fig. 14. Application flow of pipe.
program in Figure 13(b). Upon the HW interrupt, the next
set of events are mapped to task B, which is then coded into
corresponding program in Figure 13(c). Upon encountering
the next interrupt (at 595ns), the translator toggles back to
coding task A and this operation continues to the end of the
trace. When appending subsequent execution blocks of the
same task, the translator automatically adjusts the relative
timing between transactions as if the task had executed
without interruption. At the end of each task listing, a
SW interrupt routine is inserted to yield control to any
other task running on the processor whose execution is still
incomplete. This matches what could be expected of well
behaving OSes, where the end of one task prompts a non-
timer-triggered rescheduling to switch to other pending
tasks to finish the remaining portion of their instructions.
Any further HW interrupts from the timer device are
internally masked as meaningless during this final phase
of execution, since there is only one schedulable task
remaining. During execution, the RIPE ISS automatically
supports context switching, as described in Section IV:
upon an HW interrupt, the RIPE device simply loads the
next instruction from the task whose ID is found in the
TaskIDReg special register.
In the pipe scenario, the task is explicitly interacting
with the OS internals, as described in Section III. Usually
this interaction can be achieved by OS API calls, without
direct access to the interrupt handler code, whose exit
point is therefore assumed to be not accessible to the
programmer. As a result, the only annotations of signif-
icance within the trace file are the synchronization points
(semaphore checks) and the interrupt arrival time. The
RIPE program thus mimics the flow shown in Figure 8(c),
first by reading the semaphore location, then choosing
to continue or suspend depending on the lock. Upon
resumption by HW interrupt, a final (re-)check of the
semaphore unlock is done to ensure safe task operation.
Figure 14 shows the equivalent flow. In the RIPE program,
this is realized via three tasks (dotted lines mark their
122 The RIPE Modeling Environment
16
boundaries). The primary task represents the main applica-
tion flow. The interrupts are masked here, as the application
is insensitive to HW interrupts unless in suspension state.
If the semaphore is found locked, the flow is derouted to
load the OS routine which leads the processor to an idle
wait. The translator captures the chunk of trace after the
semaphore check in an independent OS task, which always
yields control to a third task consisting of an infinite loop
of idle wait instructions. The easily identifiable sequence
of transactions between the eventual arrival of the HW
interrupt and the semaphore re-check is the OS wake-up
routine to reschedule the suspended main program, and the
translator appends it as a part of the OS task. In the RIPE
program, HW interrupts are used to wake up from the
suspension state within OS routines, while SW interrupts
redirect the execution flow towards the main task. Note
that IntrpMaskReg is set to “masked” for the regular
program and OS execution, and is only unmasked within
the suspension task.
After performing the translation described in this Sec-
tion and after RIPE program assembling, a second set of
simulations can be run on a platform with RIPE and a vari-
ety of interconnect fabrics, thereby evaluating performance
of interconnect design alternatives.
VII. Validation Results
As discussed earlier, for validation we simulated the
different benchmarks within the MPARM framework, first
using the native ARM cores and then using the RIPE
model, and compared the resulting benchmark statistics.
We undertook this experiment for six benchmarks. Each
was tested with one to twelve (1P-12P) system processors
simultaneously plugged to the system interconnect, except
where the application needed at least two or three cores
for functional reasons. The aim was to ascertain the
accuracy of the RIPE approach when stressed by complex
transactions.
Four of the benchmarks are the applications described
in Section III. Two more applications were added as a
reference. Cacheloop is a dummy program, which contin-
uously performs cache fetches. As such, it is generating no
bus transactions, except for a few at boot and shutdown.
It is intended as a metric of the maximum simulation
time speedup achievable by replacement of IP cores with
another simulation device. Matrix is a benchmark where
the application involves one task per processor performing
some private computation. Since no inter-core synchro-
nization is used at all, modeling is very simple and could
be achieved also by traditional TG approaches. The only
source of uncertainty is due to the fact that all tasks
compete for access to the same interconnection resource,
which impacts transaction latency. This test is useful to
MPARM+
AMBA
MPARM+
xPIPES
Benchmark
Trace (.trc) Trace (.trc)
RIPE Binary
(.bin)
Off-line Toolchain
Translator
Assembler
RIPE Binary  
(.bin)
RIPE
Program
File (.tgp)
RIPE+AMBA +xPIPES
equivalent
RIPE
Program
File (.tgp)
equivalent
RIPE
Fig. 15. RIPE and MPARM Accuracy Test.
see if RIPE is correctly responding in a “time-shifting”
scenario, as discussed in Section II and III.
For multi and IO, we devoted one of the system cores
to the generation of interrupts, emulating the role of a
timer or an IO device; this processor is not generating any
other traffic on the bus, and is just idling between interrupt
generation events. The pipe benchmark does not need this,
since interrupts are directly triggered by the same tasks
which perform the computation.
In the first experiment, we only aimed at validating
the trace collection and off-line processing environment.
Figure 15 outlines the process. We ran the same bench-
marks over two of the interconnects of MPARM, namely
AMBA AHB [1] and the ×pipes [26] NoC, noticing
very different execution times due to different latency
and scalability features. Execution traces reflected these
differences. However, after translation, a check across .tgp
programs showed no difference at all, because the network
latency factor is completely abstracted from in the RIPE
programs. As a consequence, a trace collected on one
interconnect could be used to generate a program to be run
on another; the resulting execution would match that of the
same benchmark natively run on the second interconnect.
This result strengthens the postulate of the feasibility of
an effective approach which decouples simulation of the
IP cores and of the underlying interconnect fabric.
Table III summarizes the results of simulations done
on the AMBA AHB interconnect with ARM processors
from MPARM and then with RIPEs. The different columns
relate to cumulative execution (Cmlt. Exec.) cycles of
the benchmarks, the number of single read (SR), single
writes (SW) and burst reads (BR) transactions observed on
P
a
p
e
r
#
7
:
A
R
e
a
c
tiv
e
IP
E
m
u
la
to
r
fo
r
M
P
S
o
C
E
x
p
lo
ra
tio
n
1
2
3
17
Benchmarks # IPs RIPE MPARM Comparison
Cult. Exec. SR SW BR Sim Cmlt. Exec. SR SW BR Sim Accuracy Speedup
Cycles Time (s) Cycles Time (s) Exec % SR % SW % BR % (x)
SP Cacheloop 1 2500692 0 16 25 8 2500700 0 16 25 15 0.000% 0.000% 0.000% 0.000% 1.88
SP matrix 1 1324132 0 58751 92 5 1324138 0 58751 92 9 0.000% 0.000% 0.000% 0.000% 1.80
Cacheloop 2 2500916 0 32 51 10 2500908 0 32 51 26 0.000% 0.000% 0.000% 0.000% 2.60
4 2501721 0 64 106 15 2501714 0 64 106 49 0.000% 0.000% 0.000% 0.000% 3.27
6 2502565 0 96 156 22 2502558 0 96 156 67 0.000% 0.000% 0.000% 0.000% 3.05
8 2503321 0 128 201 28 2503314 0 128 201 87 0.000% 0.000% 0.000% 0.000% 3.11
10 2504137 0 160 251 35 2504130 0 160 251 117 0.000% 0.000% 0.000% 0.000% 3.34
12 2504953 0 192 301 40 2504946 0 192 301 141 0.000% 0.000% 0.000% 0.000% 3.53
Matrix 2 1324711 0 117502 186 7 1324717 0 117502 186 16 0.000% 0.000% 0.000% 0.000% 2.29
4 1326582 0 235004 374 12 1326588 0 235004 374 28 0.000% 0.000% 0.000% 0.000% 2.33
6 1330971 0 352506 562 16 1330977 0 352506 562 39 0.000% 0.000% 0.000% 0.000% 2.44
8 1421281 0 470008 750 22 1421272 0 470008 750 52 0.001% 0.000% 0.000% 0.000% 2.36
10 1776352 0 587510 921 32 1776343 0 587510 921 77 0.001% 0.000% 0.000% 0.000% 2.41
12 2131618 0 705012 1105 45 2131609 0 705012 1105 104 0.000% 0.000% 0.000% 0.000% 2.31
poll 2 881839 7176 71764 254 4 883977 7201 71764 254 10 0.242% 0.347% 0.000% 0.000% 2.50
4 975267 18241 143596 508 8 976488 18183 143596 508 20 0.125% 0.319% 0.000% 0.000% 2.50
6 1049145 31057 215460 762 12 1049965 31101 215460 762 30 0.078% 0.141% 0.000% 0.000% 2.50
8 1139110 46044 287356 1016 17 1140199 46300 287356 1016 44 0.096% 0.553% 0.000% 0.000% 2.59
10 1385053 71989 359284 1270 24 1385007 71966 359284 1270 62 0.003% 0.032% 0.000% 0.000% 2.58
12 1678901 96756 431244 1524 36 1678804 96689 431244 1524 84 0.006% 0.069% 0.000% 0.000% 2.33
multi 2 1823882 14 85729 24764 9 1824135 14 85729 24764 19 0.014% 0.000% 0.000% 0.000% 2.11
4 2224333 42 192745 52242 17 2225867 42 192745 52242 37 0.069% 0.000% 0.000% 0.000% 2.18
6 2818936 70 299963 80158 30 2820912 70 299963 80158 60 0.070% 0.000% 0.000% 0.000% 2.00
8 3482223 98 407707 109820 48 3482793 98 407707 109820 91 0.016% 0.000% 0.000% 0.000% 1.90
10 4129205 126 515815 138427 64 4135736 126 515815 138427 136 0.158% 0.000% 0.000% 0.000% 2.13
12 4800566 154 624107 167789 89 4801433 154 624107 167789 184 0.018% 0.000% 0.000% 0.000% 2.07
IO 2 1156047 2560 68494 18271 6 1158639 2560 68495 18271 12 0.224% 0.000% 0.001% 0.000% 2.00
4 1446888 2560 145826 36966 11 1449109 2560 145827 36966 24 0.153% 0.000% 0.001% 0.000% 2.18
6 1870491 2560 223166 55654 20 1872248 2560 223167 55654 39 0.094% 0.000% 0.000% 0.000% 1.95
8 2325228 2560 300514 74435 31 2325625 2560 300515 74435 60 0.017% 0.000% 0.000% 0.000% 1.94
10 2780595 2560 377947 93274 44 2781660 2560 377948 93274 95 0.038% 0.000% 0.000% 0.000% 2.16
12 3241959 2560 455465 112037 62 3242080 2560 455466 112037 111 0.004% 0.000% 0.000% 0.000% 1.79
pipe 2 745386 2601 56004 16293 4 754998 2601 56004 16293 7 1.273% 0.000% 0.000% 0.000% 1.75
4 1051512 5246 114118 33257 9 1055056 5247 114298 33313 16 0.336% 0.019% 0.157% 0.168% 1.78
6 1430317 7888 171880 49895 16 1436149 7888 171880 49895 29 0.406% 0.000% 0.000% 0.000% 1.81
8 1829005 10530 229675 66321 25 1833183 10530 229675 66321 44 0.228% 0.000% 0.000% 0.000% 1.76
10 2240354 13172 287435 83114 37 2243537 13175 287975 83296 66 0.142% 0.023% 0.188% 0.218% 1.78
TABLE III. RIPE vs. ARM performance with AMBA.
124 The RIPE Modeling Environment
18
0
0.5
1
1.5
2
2.5
3
3.5
Ca
ch
el
oo
p 
- 2
P
Ca
ch
el
oo
p 
- 4
P
Ca
ch
el
oo
p 
- 6
P
Ca
ch
el
oo
p 
- 8
P
Ca
ch
el
oo
p 
- 1
0P
Ca
ch
el
oo
p 
- 1
2P
M
a
tix
 
-
 
2P
M
a
tix
 
-
 
4P
M
a
tix
 
-
 
6P
M
a
tix
 
-
 
8P
M
a
tix
 
-
 
10
P
M
a
tix
 
-
 
12
P
Po
ll -
 2
P
Po
ll -
 4
P
Po
ll -
 6
P
Po
ll -
 8
P
Po
ll -
 1
0P
Po
ll -
 1
2P
m
u
lti 
- 2
P
m
u
lti 
- 4
P
m
u
lti 
- 6
P
m
u
lti 
- 8
P
m
u
lti 
- 1
0P
m
u
lti 
- 1
2P
IO
 
-
 
2P
IO
 
-
 
4P
IO
 
-
 
6P
IO
 
-
 
8P
IO
 
-
 
10
P
IO
 
-
 
12
P
pi
pe
 - 
2P
pi
pe
 - 
4P
pi
pe
 - 
6P
pi
pe
 - 
8P
pi
pe
 - 
10
P
S
p
e
e
d
u
p
 (
x
)
Fig. 16. RIPE vs MPARM Speedup.
the bus. The simulation time1 The simulation time (Sim
Time) is accounted in seconds. The column “Accuracy”
is a measure of the accuracy of replacing IP cores with
RIPEs, based upon the difference in simulated cycles and
bus accesses, while the column “Speedup” describes the
improvement in simulation time.
The table shows that replacing ARM processors with
RIPEs yields excellent accuracy, over 99% in most cases,
resulting in a faithful reproduction of the original execution
flow and traffic pattern. The near-matching amount of
read and write accesses validates the correctness of our
RIPE program translation (see Section VI). Inaccuracies
in execution time can be explained as follows. In poll, the
amount of single reads is the primary source of inaccuracy.
This is due to the compounding of minimal timing mis-
matches caused by the semaphore polling mechanism in
RIPE programs. In the real system, the first few semaphore
polls were found to occur at a slightly different rate
than subsequent ones, due to assembler-level and caching
1Benchmarks taken on a Pentium 4® 2.26GHz with 1 GB of RAM.
The absence of disk swapping effects was checked during simulation.
Especially for benchmarks with a short duration, time measurements were
taken by averaging over multiple runs and care was put in minimizing
disk loading effects.
effects. Eventually, polling occurs at periodic intervals.
This initial timing mismatch is not captured in the RIPE,
which performs all polling loops at the asymptotic rate.
This causes RIPE to be affected by a small timing skew,
which impacts subsequent simulation. As results prove,
this has negligible consequences on the application flow,
which is dominated by the interconnect delay.
The inaccuracies in OS- and interrupt-related bench-
marks are due to minor issues in properly pinpointing
different sections of OS code in the execution trace, as dis-
cussed before in Section VI. The near-matching statistics
however fully prove the role of the RIPE as a powerful
design tool to mimic complex application behaviour in
replacement of a real IP core.
Scalability tests, performed by increasing the number
of processors attached to the bus, exhibit two main dif-
ferent trends, as seen in Figure 16. Cacheloop exhibits
a fundamentally monotonic trend, showing the advantage
of replacing a progressively increasing amount of system
cores with a faster device model. Other benchmarks show
a fundamentally constant figure, or an increase with the
number of processors which gets capped at some point
(for example, Matrix). This seemingly strange behaviour
can be explained by recalling that the system being sim-
Paper #7: A Reactive IP Emulator for MPSoC Exploration 125
19
ulated is also composed of the interconnect model and of
some simulation support (simulation scheduler, statistics
collection, etc.). Therefore, the simulation time cannot be
decreased below a certain threshold. Further, an increase
in the number of processors also implies more traffic on
the interconnect, shifting the simulation load towards the
latter and hindering any speedup. At a certain point, the
fabric becomes completely saturated. In this condition, no
further speedup is achievable at all because both ARM
and RIPE execution time is dominated by idle waits for
bus responses - a situation where the ARM simulation
model can be as fast as whatever possible replacement. To
support this analysis, we observe that the lowest speedup
is achieved for pipe, which is also found to be the
benchmark with the highest bandwidth requirements (and
therefore the highest load on the interconnect model). We
would like to stress that, as Cacheloop demonstrates, this
decrease in simulation speedup is not a shortcoming of
our RIPE approach, and is instead a direct consequence
of benchmark and system behavior. IO and multi speedup
figures are a bit higher than those of pipe also thanks
to the presence of one basically idle processor devoted
only to interrupt generation. In absolute terms, a gain of
1.75x to 3.53x was observed when running the benchmark
code on RIPEs as opposed to ARM ISSs. This speedup
is due to the removal of the computation logic within
cores. It is noteworthy that even though speedup is not
the primary objective of RIPE, it compares favorably to
previous work in the area (a speedup of 1.55x is reported
in [23]), especially given the fact that it is achieved at the
cycle-true level of abstraction.
The time penalty for trace collection is small, and is
incurred only once. For example, when running the rela-
tively complex pipe benchmark on the AMBA interconnect
with four ARM processors, a benchmark run augmented
to collect reference traces takes 20 s, and subsequent
translation and elaboration requires an additional 12 s for
a 5.6 MB trace file. Only one such iteration is needed to
validate the RIPE model and for subsequent design space
exploration. Additionally, since processed RIPE programs
are identical regardless of the reference interconnect in
which raw traces were collected, such collection could be
performed on top of a transactional fabric model, further
reducing the impact of the reference simulation.
VIII. Case Study
To demonstrate the potential of the RIPE as a co-
exploration tool, we look at a variant of the multi ap-
plication, first discussed in Section III-C, in more detail.
Specifically, we consider a five processor bus-based system
with one RIPE configured to act like a timer device.
This core triggers the delivery of interrupts at regular
Interval among interrupts Notes
to same core (ms)
Reference 2
Case I 1
Case II 2 Processors receive interrupts
staggered by a 0.5 ms offset
Case III 2 Two processors receive an extra
interrupt just after the boot
TABLE IV. Interrupt issue frequency for four
different multitasking patterns
intervals to the other four RIPE devices, which as a
result switch among two tasks. The two tasks are tuned
to have very different bandwidth requirements; one task
performs matrix manipulations (MM), and heavily relies
on data caches to minimize memory transactions, while the
second task performs streams of writes (WS) to a memory
attached to the bus. The WS task is very demanding on
the interconnect and can easily saturate it, therefore hurting
overall system performance.
In this case study, using the RIPE, we test the behaviour
of this system for different interrupt delivery policies and
study the resulting traffic profiles (Fig. 17-20). This type
of exploration may be useful to schedule bus accesses
for real-time tasks in critical systems. The traffic plots
show the profile of the bus traffic over time, expressed
as transferred data words over a time window of 2 µs.
This method of presentation is useful to note the load on
the bus over the complete execution period, without the
need for cumbersome investigation of correlation among
different processors via individual bus activity plots.
For these experiments, to achieve maximum realism, the
RIPE programs modeling the tasks on the four computa-
tion cores were created by translating MPARM execution
traces. However, they could have easily been written by
hand. In MPARM, interrupts are triggered by writing to a
specific address of a memory-mapped device; therefore, to
trigger the interrupts that should come from a timer device,
we wrote a small RIPE program issuing OCP writes at the
right times. In turn, this is achieved by parameterized idle
waits. Such a program was written in a dozen of lines of
RIPE code.
In all the plots, until about the 6000 µs mark, the bus
activity during the OS boot is observed. The boot activity
is irregular, but on average pretty intensive in terms of
required bandwidth, since all the processors are loading the
OS and application instructions from the memory across
the interconnect. After this mark, application code begins
to be executed. In Fig. 17, a straightforward scheduling
policy is used: a timer interrupt is sent to each core
simultaneously, therefore causing all of the cores to switch
among MM and WS at the same time. Since interrupts
126 The RIPE Modeling Environment
20
0
20
40
60
80
100
120
140
0 5000 10000 15000 20000 25000 30000
Time (us)
B
u
s
 U
s
a
g
e
 (
tr
a
n
s
fe
rr
e
d
 w
o
rd
s
)
Fig. 17. Reference traffic pattern
0
20
40
60
80
100
120
140
0 5000 10000 15000 20000 25000 30000
Time (us)
B
u
s
 U
s
a
g
e
 (
tr
a
n
s
fe
rr
e
d
 w
o
rd
s
)
Fig. 18. Case I
0
20
40
60
80
100
120
140
0 5000 10000 15000 20000 25000 30000
Time (us)
B
u
s
 U
s
a
g
e
 (
tr
a
n
s
fe
rr
e
d
 w
o
rd
s
)
Fig. 19. Case II
0
20
40
60
80
100
120
140
0 5000 10000 15000 20000 25000 30000
Time (us)
B
u
s
 U
s
a
g
e
 (
tr
a
n
s
fe
rr
e
d
 w
o
rd
s
)
Fig. 20. Case III
in the same task group during any given time slice of
execution. As expected, the bus load shifts depending
on the task characteristics; the traffic profile exhibits a
clear alternating pattern among two disproportionate usage
values, with peaks above 130 and a floor of around 20
transactions per time window. The number of transitions
between these two limits and the width of each peak
correspond to the number of issued interrupt events and
the interval between them (see Table IV). The tail of the
plot is representing shutdown code, and is not relevant.
Since excessive contention inflates the response latency
of the bus and therefore hurts performance, the traffic
profile must be reshaped to decrease congestion. As is
observed in Fig. 18, as compared to Fig. 17, doubling
the interrupt issue frequency does little to mitigate the bus
congestion issue; it only shifts the contention to a different
time slot. Execution time remains constant at about 28200
µs.
Let us now consider the impact on the bus activity
of staggering the interrupt events. In Fig. 19, we see the
impact of issuing interrupts to each processor at the same
frequency as in the reference case; however, the interrupts
sent to each processor are staggered with respect to the
interrupts sent to other cores by 25% of the original time
window. As a result, an interrupt is sent every 500 µs, but
two interrupts to the same processor are spaced 2000 µs
apart. The traffic profile is smoother; thanks to staggering,
MM tasks on some cores run in parallel to WS tasks on
other cores. Over time, the system shifts from running
four MM tasks to running four WS tasks and back, which
results in a sinusoidal-like trend with visible steps. Peak
congestion is only reached during a shorter fraction of the
time, therefore reducing the execution time to about 26000
µs.
To balance the traffic even better, the clear choice is
to always overlap two MM and two WS tasks. This is
achieved in Fig. 20, where two processors are forced to
perform a context switch just after the OS boot, and the
subsequent interrupt pattern is the same as in Fig. 17.
Thanks to much better traffic balancing, the bus never
saturates, providing good performance and decreasing the
execution time to 25200 µs.
Paper #7: A Reactive IP Emulator for MPSoC Exploration 127
21
Fig. 21. Performance of the four synchroniza-
tion patterns under test
In Fig. 21, the benchmark execution time and the
average communication latency for a write transaction on
the bus are plotted for the four configurations. As can be
seen, Case I exhibits basically identical performance to the
baseline, while Case II improves 18% on communication
latency (and thus 8% on execution time) and Case III
improves 24% on latency (and thus 11% on execution
time). Therefore, Case III is the best among the alternatives
under evaluation.
These experiments highlight that RIPE can be an ex-
tremely useful tool to explore communication bottlenecks
even without having the real IP cores and benchmarks
attached to the fabric. The flexibility guaranteed by the
interrupt handling support provides the designer with
additional degrees of freedom and accuracy, allowing a
realistic system exploration even in presence of complex
communication and synchronization patterns.
IX. Conclusions
In this paper, we identified the requirements to split the
design of computation and communication entities in an
MPSoC. Modeling requirements were derived from real-
life applications, and they represent complex scenarios
including an operating system layer and asynchronous
interrupt-based synchronization. The key piece of the puz-
zle can be identified in reactiveness to external events
and state. In this paper, we presented the RIPE device
and its programming interface to provide support for the
previously identified traffic generation functionality.
We have shown the usefulness of the RIPE device
within different co-exploration domains, either to replace
existing IP cores in new domains or to provide emulation
of IP cores that are under development or even yet to be
designed.
Experimental results show excellent accuracy figures
when validating the RIPE against a reference system,
and a respectable gain in simulation speed when taking
into account previous literature and the cycle-accurate
abstraction level. A case study is supplied to show the
usefulness of RIPE in a design space exploration context.
Future work may carry the current RIPE design to
silicon for on-chip traffic generation.
X. Acknowledgments
The authors from University of Bologna acknowledge
financial support by Semiconductor Research Corporation
(SRC) under contract 1188.
References
[1] The Advanced Microcontroller Bus Architecture (AMBA) home-
page. www.arm.com/products/solutions/AMBAHomePage.html.
[2] The SystemC discussion forum. Web Forum (www.systemc.org).
[3] The Real-Time Operating System for Multiprocessor Systems.
http://www.rtems.com.
[4] Open Core Protocol Specification, Release 2.0, 2003.
[5] IEEE, March 2005.
[6] F. Angiolini, S. Mahadevan, J. Madsen, L. Benini, and J. Sparsø.
Realistically rendering SoC traffic patterns with interrupt awareness.
In IFIP International Conference on Very Large Scale Integration
(VLSI-SoC), September 2005.
[7] ARM. AMBA AXI Protocol Specification, version 1.0.
www.arm.com, March 2004.
[8] ARM Holdings PLC. Advanced Microcontroller Bus Architecture
(AMBA) specification rev 2.0, 2001.
[9] S. Avallone, A. Pescape, and G. Ventre. Analysis and experimenta-
tion of internet traffic generator. In Proceedings of FTDCS, 2004.
[10] L. Benini and G. D. Micheli. Networks on chips: A new SoC
paradigm. IEEE Computer, 35(1):70 – 78, January 2002.
[11] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny. QNoC: QoS
architecture and design process for network on chip. In Journal of
Systems Architecture. Elsevier, 2004.
[12] L. Cai and D. Gajski. Transaction level modeling in system
level design. CECS technical report 03-10, Center for Embedded
Computer Systems, Information and Computer Science, University
of California, Irvine, March 2003.
[13] M. Dall’Osso, G. Biccari, L. Giovannini, D. Bertozzi, and L. Benini.
xpipes: A latency insensitive parameterized Network-on-Chip ar-
chitecture for multi-processor SoCs. In Proceedings of the Inter-
national Conference on Computer Design (ICCD). IEEE Computer
Society, 2003.
[14] W. J. Dally and B. Towles. Route packets, not wires: On-chip
interconnection networks. In Proceedings of the 38th Design
Automation Conference, pages 684–689, June 2001.
[15] F. Fummi, P. Gallo, S. Martini, G. Perbellini, M. Poncino, and
F. Ricciato. A timing-accurate modeling and simulation environ-
ment for networked embedded systems. In Proceedings of the 42th
Design Automation Conference (DAC), pages 42–47, June 2003.
[16] F. Fummi, S. Martini, G. Perbellini, M. Poncino, F. Ricciato, and
M. Turolla. Heterogeneous co-simulation of networked embedded
systems. In Proceedings of Design, Automation and Testing in
Europe Conference 2004 (DATE). IEEE, Febuary 2004.
[17] N. Genko, D. Atienza, G. D. Micheli, L. Benini, J. M. Mendias,
R. Hermida, and F. Catthoor. A novel approach for network on chip
emulation. In International Symposium on Circuits and Systems,
pages 2365–2368. IEEE, 2005.
128 The RIPE Modeling Environment
22
[18] T. Gro¨tker, S. Liao, G. Martin, and S. Swan. System Design with
SystemC. Kluwer Academic Publishers, 2002.
[19] K. Lahiri, A. Raghunathan, and S. Dey. Evaluation of the traffic-
performance characteristics of System-on-Chip communication ar-
chitectures. In Proceedings of the 14th International Conference on
VLSI Design, pages 29–35, 2001.
[20] M. Loghi, F. Angiolini, D. Bertozzi, L. Benini, and R. Zafalon.
Analyzing on-chip communication in a MPSoC environment. In
Proceedings of the Design, Automation and Test in Europe Confer-
ence (DATE). IEEE, 2004.
[21] S. Mahadevan, F. Angiolini, M. Storgaard, R. G. Olsen, J. Sparsø,
and J. Madsen. A network traffic generator model for fast network-
on-chip simulation. In Proceedings of Design, Automation and
Testing in Europe Conference 2005 (DATE) [5].
[22] O. Ogawa, S. B. de Noyer, P. Chauvet, K. Shinohara, Y. Watanabe,
H. Niizuma, T. Sasaki, and Y. Takai. A practical approach for
bus architecture optimization at transaction level. In Proceedings
of Design, Automation and Testing in Europe Conference 2004
(DATE). IEEE, March 2003.
[23] S. Pasricha, N. Dutt, and M. Ben-Romdhane. Extending the trans-
action level modeling approach for fast communication architecture
exploration. In Proceedings of 38th Design Automation Conference
(DAC), pages 113–118. ACM, 2004.
[24] S. Schneider, U. Mueller, and D. Tiegelbekkers. A reactive
workload generation framework for simulation-based performance
engineering of system interconnects. In Modeling, Analysis and
Simulation of Computer and Telecommunication Systems (MAS-
COTS). IEEE, September 2005.
[25] M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabaey, and
A. Sangiovanni-Vincentelli. Addressing the System-on-Chip inter-
connect woes through communication-based design. In Proceedings
of the 38th Design Automation Conference (DAC’01), pages 667 –
672, June 2001.
[26] S. Stergiou, F. Angiolini, S. Carta, L. Raffo, D. Bertozzi, and
G. D. Micheli. ×pipes Lite: A synthesis oriented design library
for networks on chips. In Proceedings of Design, Automation and
Testing in Europe Conference 2005 (DATE) [5], pages 1188–1193.
[27] STMicroelectronics. The ST Bus. http://www.st.com/stonline/,
2004.
[28] Synopsys. OpenVERA Technology Backgrounder. White paper
available from http://www.open-vera.com/, 2001.
[29] G. V. Varatkar and R. Marculescu. On-chip traffic modeling and
synthesis for MPEG-2 video applications. In Transcations on Very
Larget Scale Integration (VLSI) Systems, number 1, pages 108–119.
IEEE, JANUARY 2004.
[30] D. Wiklund, S. Sathe, and D. Liu. Network on chip simulations
for benchmarking. In Proceedings of the 4th IEEE International
Workshop on System-on-Chip for Real-Time Applications (IWSOC).
IEEE, 2004.
[31] W. Wolf. Computers as Components:Principles of Embedded
Computing System Design, chapter 3. Morgan Kaufmann, 2001.
Part III
Appendix
Chapter 6
Network-on-Chip Modeling
for System-Level
Multiprocessor Simulation
Published in the Proceedings of the 24th Real-Time Systems Symposium 2003.
Complete citation:
Jan Madsen, Shankar Mahadevan, Kashif Virk and Mercury Gonzalez,
”Network-on-Chip Modeling for System-Level Multiprocessor Simulation.” In
Proceedings of the 24th Real-Time Systems Symposium (RTSS), Cancun Mex-
ico. IEEE, Dec. 2003: 265-274.
132 The ARTS Modeling Environment
Paper #2: NoC Modeling for System-Level Multiprocessor Simulation 133
134 The ARTS Modeling Environment
Paper #2: NoC Modeling for System-Level Multiprocessor Simulation 135
136 The ARTS Modeling Environment
Paper #2: NoC Modeling for System-Level Multiprocessor Simulation 137
138 The ARTS Modeling Environment
Paper #2: NoC Modeling for System-Level Multiprocessor Simulation 139
140 The ARTS Modeling Environment
Paper #2: NoC Modeling for System-Level Multiprocessor Simulation 141
142 The ARTS Modeling Environment
Chapter 7
A Network Traffic Generator
Model for Fast
Network-on-Chip Simulation
Published in the Proceedings of Design, Automation and Testing in Europe
Conference 2005.
Complete citation:
Shankar Mahadevan, Federico Angiolini, Michael Storgaard, Rasmus G. Olsen,
Jens Sparsø and Jan Madsen. “A Network Traffic Generator Model for Fast
Network-on-Chip Simulation.” In Proceedings of Design, Automation and Test-
ing in Europe Conference (DATE), Munich Germany. IEEE, Mar. 2005: 780-
785.
144 Appendix 7
Paper #5: A Traffic Generator Model for Fast NoC Simulation 145
146 The RIPE Modeling Environment
Paper #5: A Traffic Generator Model for Fast NoC Simulation 147
148 The RIPE Modeling Environment
Paper #5: A Traffic Generator Model for Fast NoC Simulation 149
150 The RIPE Modeling Environment
Chapter 8
Realistically Rendering SoC
Traffic Patterns with Interrupt
Awareness
Published in the IFIP Very Large Scale Integration Systems and their Designs
Conference 2005.
Complete citation:
Federico Angiolini, Shankar Mahadevan, Jan Madsen, Luca Benini and Jens
Sparsø. “Realistically Rendering SoC Traffic Patterns with Interrupt Aware-
ness.” IFIP Very Large Scale Integration Systems and their Designs Conference
(VLSI-SoC), Perth Australia. IEEE, Oct. 2005: 211-216.
152 Appendix 8
Paper #6: Rendering SoC Traffic Patterns with Interrupt Awareness 153
154 The RIPE Modeling Environment
Paper #6: Rendering SoC Traffic Patterns with Interrupt Awareness 155
156 The RIPE Modeling Environment
Paper #6: Rendering SoC Traffic Patterns with Interrupt Awareness 157
158 The RIPE Modeling Environment
Bibliography
[1] A. Baghdadi and N-E. Zergainoh. Design Space Exploration for Hard-
ware/Software Codesign of Multiprocessor Systems. In Proceedings of the
11th International Workshop on Rapid System Prototyping (RSP), pages
8–13. IEEE, June 2000.
[2] Luca Benini and Giovanni De Micheli. Networks on chips: A new SoC
paradigm. IEEE Computer, 35(1):70–78, January 2002.
[3] A. Bobrek, J. J. Pieper, J. E. Nelson, J. M. Paul, and D. E. Thomas.
Modeling shared resource contention using a hybrid simulation/analytical
approach. In Proceedings of Design, Automation and Testing in Europe
Conference (DATE), pages 1144–1149. IEEE, Febuary 2004.
[4] Lukai Cai and Daniel Gajski. Transaction level modeling in system level
design. CECS technical report 03-10, Center for Embedded Computer Sys-
tems, Information and Computer Science, University of California, Irvine,
March 2003.
[5] Jon Connell. Arm system-level modeling. Available from ARM website
(http:// www.arm.com), June 2003.
[6] William J. Dally and Brian Towles. Route packets, not wires: On-chip
interconnection networks. In Proceedings of the 38th Design Automation
Conference (DAC), pages 684–689. IEEE, June 2001.
[7] Franco Fummi, Paolo Gallo, Stefano Martini, Giovanni Perbellini, Massimo
Poncino, and Fabio Ricciato. A timing-accurate modeling and simulation
environment for networked embedded systems. In Proceedings of the 42th
Design Automation Conference (DAC), pages 42–47, June 2003.
160 BIBLIOGRAPHY
[8] Franco Fummi, Stefano Martini, Giovanni Perbellini, Massimo Poncino,
Fabio Ricciato, and Maura Turolla. Heterogeneous co-simulation of net-
worked embedded systems. In Proceedings of Design, Automation and
Testing in Europe Conference (DATE). IEEE, Febuary 2004.
[9] Paolo Gai, Luca Abeni, and Giorgio Buttazzo. Multiprocessor DSP
Scheduling in System-on-a-chip Architectures. In Proceedings of the 14th
Euromicro Conference on Real-Time Systems (ECRTS), pages 231–238.
IEEE, June 2002.
[10] A. Gerstlauer, H. Yu, and D.D. Gajski. RTOS modeling for system level de-
sign. In Proceedings of Design, Automation and Test in Europe, DATE’03,
pages 130–135, March 2003.
[11] Thorsten Gro¨tker, Stan Liao, Grant Martin, and Stuart Swan. System
Design with SystemC. Kluwer Academic Publishers, 2002.
[12] R. Henia, A. Hamann, M. Jersak, R. Racu, K. Richter, and R. Ernst.
System level performance analysis - the SymTA/S approach. In IEE Pro-
ceedings - Computers and Digital Techniques, March 2005.
[13] Jon Jonsson. The Impact of Application and Architecture Properties on
Real-Time Multiprocessor Scheduling. PhD thesis, School of Electrical and
Computer Engineering, Chalmers University of Technology, Goteborg, Swe-
den, August 1997. Ph.D. Thesis No. 311.
[14] K. Lahiri, A. Raghunathan, and S. Dey. Design space exploration for opti-
mizing on-chip communication architectures. In IEEE Trans. on Computer-
Aided Design of Integrated Circuits and Systems, 2004.
[15] M. Loghi, F. Angiolini, D. Bertozzi, L. Benini, and R. Zafalon. Analyzing
on-chip communication in a MPSoC environment. In Proceedings of Design,
Automation and Testing in Europe Conference (DATE), pages 752–757.
IEEE, Febuary 2004.
[16] G. De Micheli, R. Ernst, and W. Wolf. Readings in Hardware/Software
Co-Design. Morgan Kaufmann, 2001. 1st edition.
[17] OCPIP. Open Core Protocol (OCP) Specification, Release 1.0, 2001.
[18] OCPIP. The importance of sockets in SoC design. White paper download-
able from http://www.ocpip.org, 2003.
[19] Osamu Ogawa, Sylvain Bayon de Noyer, Pascal Chauvet, Katsuya Shino-
hara, Yoshiharu Watanabe, Hiroshi Niizuma, Takayuki Sasaki, and Yuji
Takai. A practical approach for bus architecture optimization at transac-
tion level. In Proceedings of Design, Automation and Testing in Europe
Conference (DATE). IEEE, March 2003.
BIBLIOGRAPHY 161
[20] Sudeep Pasricha, Nikil Dutt, and Mohamed Ben-Romdhane. Extending the
transaction level modeling approach for fast communication architecture
exploration. In Proceedings of 38th Design Automation Conference (DAC),
pages 113–118. ACM, 2004.
[21] P. Pop, P. Eles, and Z. Peng. Analysis and optimization of heterogeneous
multiprocessor SoC. In IEE Proceedings - Computers and Digital Tech-
niques, March 2005.
[22] K. Richter, M. Jersak, and R. Ernst. A formal approach to mpsoc perfor-
mance verification. IEEE Computer, 36(4):60 – 67, April 2003.
[23] James A. Rowson and Alblerto Sangiovanni-Vincentelli. Interface-based de-
sign. In Proceedings of the 34th Design Automation Conference (DAC’97),
pages 178–183, June 1997.
[24] Marcus T. Schmitz, Bashir M. Al-Hashimi, and Petru Eles. System-Level
Design Techniques for Energy-Efficient Embedded Systems. Kluwer Acad-
emic Publishers, 2004.
[25] M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabaey, and Al-
blerto Sangiovanni-Vincentelli. Addressing the system-on-chip interconnect
woes through communication-based design. pages 667 – 672.
[26] J. Sifakis. Modeling real-time systems - challenges and work directions. In
EMSOFT, Lecture Notes in Computer Science Vol. 2211, pages 373–389.
October 2001.
[27] Andreas Wieferink, Tim Kogel, Rainer Leupers, Gerd Ascheid, Hein-
rich Meyr, Gunnar Braun, and Achim Nohl. A system level
processor/communication co-exploration methodology for multi-processor
system-on-chip platforms. In Proceedings of Design, Automation and Test-
ing in Europe Conference (DATE), pages 1256–1261. IEEE Computer So-
ciety, Febuary 2004.
[28] Daniel Wiklund. Development and Performance Evaluation of Networks
on Chip. PhD thesis, Department of Electrical Engineering, Linkoping
University, 2005. Dissertation No. 932.
