A Simulation framework for hierarchical Network-on-Chip systems by San Pedro Martín, Javier de
Departament de Llenguatges i Sistemes Informàtics
UNIVERSITAT POLITÈCNICA DE CATALUNYA
Master in Computing
Master of Science Thesis
A simulation framework for hierarchical
Network-on-Chip systems
Student: Javier de San Pedro Mart´ın
Advisors: Josep Carmona Vargas and Jordi Cortadella Fortuny
Barcelona, June 22, 2012

Abstract
Today, even the simplest laptop processor has at least four cores and a graph-
ics card containing tens of cores. It is not hard to find more performance-
oriented processors with hundreds of cores, and it is expected to see proces-
sors with thousands of cores in the not very far future.
In these and future processors, the design of the interconnection network
between the cores and the memory subsystem is a key design aspect.
Simple topologies like buses or rings provide great efficiency, but do not
scale as good as meshes once the number of cores increases. We explore
the use of hierarchical network designs as an alternative, where different
topologies are stacked in a single network. The lowest layers use rings or
buses, taking advantage of locality, while other layers use meshes or more
complex topologies.
To fully explore these and other chip multiprocessor design aspects, we
build an interconnection network simulator that is capable of simulating ar-
bitrary hierarchies of multiple network topologies. We propose using param-
eterizable automata as traffic sources, as a trade-off between full processor
simulation and simulation using purely random traffic. By altering the au-
tomaton high-level parameters, changes in the processor workload can be
simulated, such as the expected average memory traffic, the locality of the
memory accesses, the additional traffic caused by different cache coherency
protocols, etc.
Acknowledgements
This work has been made possible thanks to a grant from Intel Corpora-
tion and CICYT project TIN2007-66523 (FORMALISM). It is a part of the
research project “Floorplanning and performance evaluation for on-die com-
munication fabrics”.
Additionally, I would like to thank both of my advisors, Josep Carmona
and Jordi Cortadella, for without their patience and support this work would
have never been completed. I would also like to express my gratitude to
Nikita Nikitin, who is also a participant of the same research project, for his
discussions and explanations.

Contents
1 Introduction 5
1.1 Design space exploration . . . . . . . . . . . . . . . . . . . . . 7
1.2 Goals of this project . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Chip multiprocessors 13
2.1 Processing cores . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Single-core performance improvements . . . . . . . . . 16
2.2 The memory hierarchy . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Cache memory . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Consistency and coherency . . . . . . . . . . . . . . . . 24
2.3 On-chip interconnects . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Crossbars . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Network-On-Chip systems . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.2 Routing algorithms . . . . . . . . . . . . . . . . . . . . 31
2.4.3 Flow control . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.4 Router architecture . . . . . . . . . . . . . . . . . . . . 36
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Hierarchical network topologies 38
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Routing in a hierarchical topology . . . . . . . . . . . . . . . . 40
3.2.1 Network interfaces . . . . . . . . . . . . . . . . . . . . 41
3.3 Global mesh based topologies . . . . . . . . . . . . . . . . . . 41
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Simulation of a chip multiprocessor 44
4.1 Levels of detail . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Steady-state simulation . . . . . . . . . . . . . . . . . . . . . . 45
3
44.2.1 Initialization bias . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Output analysis and stopping criteria . . . . . . . . . . 49
4.3 Simulator design . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.1 Simulating cores . . . . . . . . . . . . . . . . . . . . . 53
4.3.2 Simulating the memory hierarchy . . . . . . . . . . . . 59
4.3.3 Simulating the interconnect . . . . . . . . . . . . . . . 60
4.3.4 Estimating power and area . . . . . . . . . . . . . . . . 61
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Results 63
5.1 Experimental conditions . . . . . . . . . . . . . . . . . . . . . 63
5.2 The case for a CMP . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Memory hierarchy design . . . . . . . . . . . . . . . . . . . . . 64
5.4 Hierarchical topologies . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6 Conclusions 69
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Bibliography 71
Glossary 76
A Simulator user manual 77
A.1 Main simulation configuration . . . . . . . . . . . . . . . . . . 78
A.2 Hierarchical topology description file . . . . . . . . . . . . . . 79
B Analytical performance modeling 82
Chapter 1
Introduction
Computing performance has been steadily increasing ever since the dawn of
computing. Until the past decade, one of the most reliable ways to increase
the performance of a computer system while keeping the logical design intact
has been to increase the clock frequency, a strategy allowed by the constant
improvements to the design and manufacturing process of integrated circuits.
However, increasing the operating frequency causes an equivalent increase
in the dynamic power consumption of a circuit [1], and therefore, in dissipated
heat. This is an important budget drain on supercomputing installations, but
is a critical issue in the increasingly mobile mainstream computing market.
In fact, it can be seen on Figure 1.1 that since 2005 the average power
consumption of all released mainstream processors has been about the same,
and the operating frequency has not experienced a large increase either.
Since Moore’s law is still in effect, transistor density is still increasing
every year, allowing for more complex designs that increase throughput, all
while keeping the same total chip area.
One of the most promising strategies to improve performance is to use the
0.0 GHz
0.5 GHz
1.0 GHz
1.5 GHz
2.0 GHz
2.5 GHz
3.0 GHz
3.5 GHz
4.0 GHz
1990 1995 2000 2006
0 W
20 W
40 W
60 W
80 W
100 W
120 W
140 W
1990 1995 2000 2006
Figure 1.1: Evolution of clock speed and power consumption from mainstream computer
processors from 1990 to 2010 [2].
5
6 CHAPTER 1. INTRODUCTION
A
B
Figure 1.2: A route between A and B is found despite no direct link between them.
additional transistor budget to replicate functional blocks from an existing
processor design (adders, etc.), so that more than one instruction can be
run in parallel. This technique exploits the parallelism inherent to most
program instruction flows: if the result of an instruction does not influence
the result of a following instruction, then both instructions can be executed
in parallel. Out-of-order processors even manipulate the order of instructions
in the program flow so as to maximize this inherent parallelism.
Still, it is clear that this approach has its limits. While some algorithms
might be fully parallelizable, the average instruction level parallelism for a
single instruction flow has been found to be less than 7 [3]. Additionally, there
is a cost to the infrastructure necessary to reach such a level of parallelism
(e.g. issue queue, multi-port register files [4]).
As an alternative, a chip multiprocessor (CMP) employs the additional
transistor budget to put more than one instance of a processor in the same
chip. Like in a multiprocessor system, each processor runs its own instruc-
tion stream, and therefore can be parallelized more aggressively than by
exploiting instruction level parallelism only, by either running more than one
independent application or several threads of the same application. Unlike
a multiprocessor system though, the processors are all in a single chip and
thus can communicate at very high clock rates.
Traditionally, multiple cores on a single chip communicate using buses
with very high bandwidth. However, this puts yet another performance ceil-
ing: only one transaction can happen at any given moment in a bus, which
quickly becomes the bottleneck as the number of elements that want to com-
municate increases. Making a direct, private connection between each pair
of cores (a full crossbar) might be too costly.
Instead, we can use fewer links but share them. In modern telephone
networks, there is no direct link from each phone service subscriber to every
other subscriber; rather, when someone places a call a route between both
1.1. DESIGN SPACE EXPLORATION 7
Processors
External 
memory 
controllers
Graphic 
processing 
units
. . .
Network on chip
Figure 1.3: High-level block diagram of a Network-on-Chip.
subscribers is found using the existing links, and it is reserved (no other
phone call can being able to use those) for the entire duration of the call (as
in Figure 1.2). This is called circuit switching (see Section 2.4.3).
Similarly, we can talk about a network-on-chip (NoC), with complex
topologies and link administration strategies that allow them to more ef-
ficiently utilize a limited set of links between the components of the chip
(Figure 1.3).
1.1 Design space exploration
Core Core
Core Core
Shared L3 
cache
Shared L3 
cache
RR
R R
RR
Extern
al m
em
o
ry co
n
tro
ller
Private L1/L2 Private L1/L2
Private L1/L2Private L1/L2
Extern
al m
em
o
ry co
n
tro
ller
R
Core
L1/L2
R
Core
L1/L2
R
Core
L1/L2
R
Core
L1/L2
R
Core
L1/L2
R
Core
L1/L2
R
Core
L1/L2
R
Core
L1/L2
R
Core
L1/L2
Extern
al m
em
o
ry co
n
tro
ller
Extern
al m
em
o
ry co
n
tro
ller
R
Core
L1/L2
R
Core
L1/L2
R
Core
L1/L2
Figure 1.4: Two alternative CMP designs. The one on the left promotes larger, more
complex cores and contains shared L3 cache. The one on the right contains a larger
number of simpler cores and no shared cache at all.
The obvious goal for a computer architect is to find the optimal design
according to a set of criteria. A list of the most common criteria is shown
on Table 1.1. A design is evaluated using each of the criteria in order to
determine its quality.
8 CHAPTER 1. INTRODUCTION
Performance Increasing performance is quite often the main goal
when designing a CMP. But many factors affect the
final performance of a CMP: throughput of the indi-
vidual processors, of the memory subsystem; band-
width and latency of the interconnection, etc.
Power consumption As seen on Figure 1.1, power consumption has re-
cently increased in importance, up to being the most
limiting factor in the construction of a CMP, both for
mainstream users and supercomputing installations,
where every watt saved in power is another watt saved
in cooling, power supply inefficiencies, . . .
Physical size The number and distribution of transistors in a de-
sign can affect the die size, and very large die sizes
negatively affect the yield when manufacturing ICs.
Heat Heat dissipation is another big problem. It is however
very related with power consumption (as heat emis-
sion unavoidably increases with power consumption).
Cost The cost of physically manufacturing the design.
Table 1.1: Most common goals when designing a CMP.
We call design space exploration the process of finding the candidate
optimal design to be evaluated.
As the size of the design space to be explored is determined by the ranges
of all software (i.e., making more efficient and power-conscious algorithms)
and hardware design options [5], the design space is way too large for an
unguided complete search. Even with the hardware options only, the main
area for this thesis, the designer’s freedom is still unmanageably large (a list
of the main options is in Table 1.2).
For the evaluation of a given CMP design, mainly 3 strategies are avail-
able. Each of them trades off speed for accuracy, as seen in Figure 1.5. The
most accurate option is to build the real processor. Due to the high cost and
construction time required, this option is unavailable during the exploration
stage. Conversely, as the most precise evaluation option, it is often required
for the final verification stages.
At the opposite side of the spectrum, we find the analytical methods.
These are the most efficient of the evaluation strategies and therefore gener-
ally fit for the exploration process. An analytical model usually expands on
queuing theory [6] or network calculus [7] in order to provide quick mathemat-
ical approximations to the worst-case characteristics of the interconnection
of a CMP. Its main disadvantages are its inherent inaccuracy and the relative
difficulty of development, specially when nondeterministic behavior is to be
1.1. DESIGN SPACE EXPLORATION 9
Processors The number and type of processing cores. See Sec-
tion 2.1.
Memory hierarchy Size and organization of memories, caches. See Sec-
tion 2.2.
Manufacturing The manufacturing technology plays a key role in the
power consumption and die size. This topic is, how-
ever, outside of the scope of this work.
Interconnection A bad interconnection choice in a CMP can effectively
ruin bandwidth and therefore destroy performance.
Other interconnects can waste cost by providing much
more bandwidth than what is needed between two
nodes.
In this project we explore the simplest interconnection
types, and also network-on-chips (which are in itself
a interconnection type). See Section 2.3.
Network topology If the CMP uses a network-on-chip, how the chan-
nels and nodes in the network are distributed. Some
examples:
Mesh Ring Butterfly
A good topology provides the maximum bandwidth
with a minimum number of links. Additionally, some
topologies help avoid deadlocks. See Section 2.4.1.
Routing The routing strategy defines how links are selected
when trying to find a route between A and B. See
Section 2.4.2.
A
B
Flow control We have already seen circuit switching in the intro-
duction, but there are also packet switching and other
alternatives.
Generally, flow control defines how links and other
resources are reserved as transactions happen on the
network. For more details, see Section 2.4.3.
Table 1.2: Some of the most common CMP design parameters.
10 CHAPTER 1. INTRODUCTION
Simulation
Analytical 
modeling
Hardware prototyping
Speed
A
cc
u
ra
cy
Figure 1.5: Different design evaluation strategies.
modeled.
Computer simulation presents itself as one flexible option: it can be made
as accurate as a hardware prototype (albeit by which point it would be orders
of magnitude slower than the hardware prototype), or it can be abstracted
so that it might be run at a faster speed. A simulator is a computer program
that behaves like the system to be evaluated: when both simulator and the
system under test are provided the same inputs, ideally, the same outputs
are produced. Therefore, the designer is able to run, e.g., a benchmark on
the simulator and from it quickly measure the characteristics of the system
in a similar way the benchmark would be used in the real system. Since
there is no lower bound on the speed of the simulated system, a designer is
able to stop the simulator at will and explore the internals of the system at
a given moment, effectively giving an insight into the design that not even a
hardware prototype might be able to give.
Unfortunately, there might be an upper bound on the complexity of the
simulated system. It is not uncommon for a full simulator to be at least mil-
lions of times slower than the real system – the price to pay for such accuracy.
Most often, the simulated system is abstracted, or a mix of simulation and
analytical modeling is used. Because we are exploring not yet constructed
CMPs, with no applications designed yet for our system, it might be hard to
create a set of inputs to feed to the simulator. Alternatively, we might not
be interested in all of the internal details of the system. Different levels of
simulation will be explored in Section 4.1.
As the entire design space is considerably large, and the evaluation pro-
cess, as we have seen, costly, the entire design space cannot be evaluated.
Alternative exploration strategies have to be envisioned that avoid having to
call a potentially costly evaluation process: local search, heuristics, . . . .
1.2. GOALS OF THIS PROJECT 11
A
B
A
A
A
A
A
A
A
A
A
Figure 1.6: An example of a hierarchical topology: a mesh of rings. For more details, see
Section 3.3.
1.2 Goals of this project
In this thesis we want to explore the utilization of hierarchical topologies,
a concept that is not new in NoC design but that is sparingly used in the
mainstream world so far.
A hierarchical topology is basically a combination of two or more standard
NoC topologies, in a tiered layout. The upper layer (also called the global
interconnect) has few links and therefore reduced bandwidth. The lower
layers (or local interconnects) are much simpler topologies that have more
bandwidth at the cost of increased cost and power requirements.
The motivation for the use of hierarchical topologies is to make better use
of the locality, both in memory accesses and interprocessor communication,
of most computer programs. Because of this locality, it is expected that most
of the traffic will remain on the local interconnects.
Additionally, these days one can easily find mainstream CMPs with tens
of cores, and designs with hundreds of cores are not unheard of. Traditionally
used flat topologies, such as a mesh, scale poorly in performance as the
number of cores increases. A hierarchical topology is an excellent way to
keep the individual topologies sizes at manageable levels.
Within the project, we will explore, select and evaluate some promising
hierarchical topologies. For the evaluation, we will construct a CMP simula-
tor. However, and as we will have no real applications to run on computer
systems with hundreds of cores, we will use approximated stochastic task
models as inputs to this simulator. Lacking the memory behavior of real
applications, this will also force the use of an approximate model for the
memory subsystem.
12 CHAPTER 1. INTRODUCTION
Furthermore, the simulation of the interconnection network will be accu-
rate up to the clock cycle level. To better offset the loss of accuracy that
comes from using an stochastic model for the inputs of the simulator, we
will investigate the use of processor models based on Markov chains. Such
models will more slightly accurately mimic the behavior of both in-order and
out-of-order processors.
1.3 Related work
The exploration of the CMPs design space is a widely researched area [5, 8],
with an emphasis on the use of simulation tools: BookSim [9], GEMS/Gar-
net [10, 11], Noxim [12], etc.
However, the simulator that we have built in this project is different in
that:
• It allows arbitrary hierarchical nesting of topologies, up to any depth.
• It does not use traces; rather, it uses probabilistic methods that enable
a quick reconfiguration of the simulated workloads (see Section 4.3.1).
Hierarchical network topologies have already been introduced in [13] and
[14]. In [14], specifically, the mesh of buses topology is studied using a custom
simulator. Apart from evaluating the same topology (in Section 3.3), this
work also evaluates mesh of rings and other topologies. Since they have
buses only at the local interconnect layers, a simple bus interface is enough
to handle the transition of packets between the local bus and the global mesh;
in this work we assume that there will a much more generic network interface
that is able to handle such transitions when having arbitrary topologies in
each level.
Chapter 2
Chip multiprocessors
A CMP is just the integration of one or more nearly independent computer
processors into a single die.
Figure 2.1: On the left, a die photo of a desktop chip multiprocessor, Intel’s Sandy Bridge,
with 4 homogeneous cores and a graphics processor in the same die. On the right, Texas In-
strument’s OMAP3530, a mobile processor, with a general-purpose core, DSP and graphics
on the same die.
As discussed in Chapter 1, the motivation for building a CMP is to keep
on using the increased transistor budget available on a single die (the Sandy
Bridge die in Figure 2.1 has 1000 million transistors) for expanding perfor-
mance, now that rising the frequency is no longer an option.
Traditionally, this new budget has been utilized for improving the perfor-
mance of a single program flow. More specifically, by:
• Increasing cache size. We will show in Section 2.2 that there has been
an increasing gap in the performances of cores and the memory system.
Increasing the capacity of the on-chip caches will raise the hit-ratio, and
therefore reduce the number of accesses to on-chip memory.
13
14 CHAPTER 2. CHIP MULTIPROCESSORS
However, this comes at a great cost: larger caches usually have longer
latencies, and consume more power. On some designs, caches already
consume more than 50% of the entire power budget [15].
• Exploit instruction level parallelism (ILP) more efficiently by adding
additional functional units (superscalar cores), partitioning long la-
tency instructions into shorter stages so that more can be run at the
same time (pipelining), rearranging the instruction order to minimize
inter-instruction dependencies out-of-order execution), or forcing the
programmer to make ILP explicit (very long instruction word (VLIW)
architectures) [16].
The actual amount of available ILP in most programs is limited [3].
Additionally, the infrastructure required to exploit ILP does not come
cheap [4]: each functional unit that can handle register reads or writes
operations will require an additional register file port; reading more
than one instruction or a very long instruction requires support for
wide instruction memory fetches; etc. These additions all translate to
an increased complexity, die area and power usage.
The original idea of a CMP was to put very simple cores that did not
try to exploit ILP, so that the problems presented above were avoided [4].
With this design, increased performance comes from exploiting thread level
parallelism (TLP) (instead of ILP), where a program might be decomposed
into more than a single instruction flow, or more than one program run at the
same time. For example, assume a very large matrix multiplication program
(A × B = C): instead of a single program looping over each cell of C and
then computing the results by looking at the correspondent row of A and
column of B, C can be partitioned in blocks, and new programs that can be
written so that each only computes his own part of C. Then, each program
is an independent1 instruction flow.
The disadvantage of TLP is that it is generally hard to split a single
program into several threads, albeit there are parallelized versions of some
popular algorithms [17]. The impact of this problem is so large that one
of the first commercially released general-purpose computing CMPs, the
IBM Power4 [18], used two fully out-of-order cores that were more com-
plex than the cores used in previous designs. Most of today programs are
single-threaded, and using simpler cores for the Power4 would have mean
that those programs would have run slower.
However, some algorithms are easily converted into TLP versions, or scale
virtually linearly in performance along with the number of cores (usually
1the programs are not truly independent in that they share memory. See Section 2.2.2.
2.1. PROCESSING CORES 15
called embarrassingly parallel). Additionally, a processor that exploits TLP
can usually run more than one program, instead of a single program with
multiple threads, and running more than one program is clearly a common
use-case in both user and datacenter environments. Therefore, current de-
signs both try to exploit TLP and ILP, by either adding a reduced number
of more advanced cores, or mixing advanced with simpler cores. It is the
job of the CMP designer to find the appropriate balance between both ap-
proaches. For example, the successor to the IBM Power4, the Power5 [19],
already started using simultaneous multi-threading (SMT), an alternative
methodology that also tries to exploit TLP.
In this chapter the main components of a CMP will be presented, to
understand some of the important design decisions in CMPs.
2.1 Processing cores
The cores are the components that execute the instructions streams. We
define the speed at which a processor executes instructions as the throughput.
This speed is the main performance metric of a CMP. Given the length of a
program (in instructions), we can obtain the time it will take to execute in
a given core by the following simple equation:
Program execution time =
Number of instructions
Throughput
× Clock cycle period
In the multiprocessor world, the throughput of the entire CMP is usually
defined as the sum of the throughputs of all individual cores.
Cores have a design-time number of instructions per cycle (IPC). Repli-
cating functional units in a core is one of the many available techniques that
contributes to increase the IPC. We will see the most popular methods in
Section 2.1.1, but others exist in the literature [20].
However, throughput is always less than the ideal IPC, and this is because
processors can in certain conditions stall. Stalls can be prompted by several
causes:
• A core might have special instructions with different IPC. For example,
most general-purpose cores have both integer arithmetic and floating
point arithmetic instructions, with floating point instructions being
more than twice slower than integer.
• An instruction might need to access resources outside of the core (mem-
ory, input/output), and this resource might either be slower than the
core, or be already in use by another core.
16 CHAPTER 2. CHIP MULTIPROCESSORS
F
F R
F R X
Read registers +
Execute operation +
Write results
Fetch from 
Program
memory
Instruction
1 instruction per cycle
Instruction Instruction
1 clock cycle
Fetch from 
Program
memory
Read Exec. Write
1 clock cycle
F R X W F R X W F R X W
F R X W
F R X W
F R X W
F R X W
F R X W
F R X W F R X W
F R X W
F R X W
1 instruction per cycle
Small cycles
Long cycles
Figure 2.2: A 4-stage pipeline example
• Data dependencies between instructions reduce the ILP of a program
and therefore diminishes the ability to exploit parallelism.
• Intentionally (to temporarily slowdown processing and reduce power,
etc.)
Good core designs try to minimize the causes for those stalls, hide them
or increase the base IPC – at the cost of complexity, area and power. In
the CMP world, we do not have to assume that only copies of the same core
design will be used. For example, the OMAP3530 die (Figure 2.1) contains
both a general-purpose core as well as a an application-specific digital signal
processor (DSP). Another design option might be to mix cores that are all
general-purpose but have different performance levels or use different meth-
ods from those in Section 2.1.1: a few out-of-order processors occupying a
large area of the chip for tasks with low TLP and high ILP, with an addi-
tional set of many smaller in-order processors serving the rest of tasks. Those
combinations are known as heterogeneous CMPs, and they add many degrees
of freedom to explore during the construction of a CMP.
2.1.1 Single-core performance improvements
Pipelining
In a traditional core design, every cycle a single instruction is executed,
including fetching the instruction from the instruction memory, reading the
operands from the register bank, plus calculating and writing the results back
to the register bank. Therefore, a single clock cycle period must be large
enough to at least accommodate sufficient time for all of those operations.
2.1. PROCESSING CORES 17
Hazard Description Solutions
Structural Two instructions in different
stages try to access the same
component (e.g. register bank)
Stalling, replicat-
ing components
Data dependen-
cies
A instruction in the read stage
reads an operand before a previous
instruction in the normal program
order has reached the write stage
(or viceversa).
Stalling, shortcir-
cuits, out-of-order
execution
Control Instructions that in the program
order are positioned after a branch
instruction execute before the
branch instruction itself.
Stalling, branch
prediction
Table 2.1: Types of hazards in pipelined processors
Pipelining [21] a design divides it into several stages, with registers ca-
pable of storing all the intermediate information required from one stage to
the next one between the actual stages circuitry. Once this is done, the clock
period only needs to be as large as required to entirely hold the slowest of
those stages, but not all of them. See Figure 2.2 for an example exploration.
The larger number of stages, the shorter each of them is; and therefore, the
more clock frequency can be increased.
To keep executing one instruction per cycle, each stage of the pipeline has
to be able to accept a new instruction every clock cycle; as many instructions
as stages might be at any given moment running on different stages of the
pipeline. Therefore additional logic is required to account for all the potential
hazards that might happen on such situations, as seen in Table 2.1.
When taken to the extremes, this technique is known as superpipelining
[16]. As an example, Intel’s NetBurst architecture [22], used in the Pentium
4 processor series, had more than 20 stages. Intel predicted this design to be
able to scale up to frequencies as large as 10 Ghz, albeit such designs never
materialized – because, as mentioned in Chapter 1, the power figures started
becoming unreasonable.
Therefore, pipelining is one simple form to augment the maximum op-
erating frequency of a existing core design, but it does not hide the power
problems associated with increasing the frequency.
18 CHAPTER 2. CHIP MULTIPROCESSORS
F
F R
F R X
Fetch from 
Program
memory
Read Exec. Write
1 clock cycle
F R X W F R X W F R X W
F R X W
F R X W
F R X W
F R X W
F R X W
F R X W F R X W
F R X W
F R X W
1 instruction per cycle
F R
F R X
Fetch from 
Program
memory
Read Exec. Write
1 clock cycle
F R X W F R X W F R X W
F R X W F R X W F R X W
2 instructions per cycle
Read Exec. Write
F RF R X W F R X W F R X W
F R XF R X W F R X W F R X W
Figure 2.3: A superscalar core example
Superscalar cores
We showed in the introduction of Chapter 2 that the average instruction
stream has some innate parallelism, known as instruction level parallelism
(ILP). For example, assume the following program that tries to compute
(r1 + r2)(r1 + r3):
r1 + r2 → r4
r1 + r3 → r5
r4 × r5 → r6
Listing 2.1: An example program with some ILP
With two integer arithmetic units, a core would be able to run the two
first additions in parallel. Notice however that even in the presence of three
integer arithmetic units, it would be incorrect to run the product in parallel
to the sums: the product uses the results of the additions as operands.
A superscalar architecture therefore exploits the available ILP by repli-
cating functional units. An example of a superscalar architecture where the
entire pipeline has been replicated is shown on Figure 2.3 – each pipeline
stage can handle two instructions simultaneously. Obviously, a superscalar
architecture requires the ability to fetch more than one instruction at the
same time, and will be limited by the available program ILP.
Out-of-order execution
For the example program in Listing 2.1, any superscalar processor capable
of executing two arithmetic instructions at the same time would be able
to execute both additions in parallel. However, now assume the following
extension to the program in Listing 2.1:
2.1. PROCESSING CORES 19
r1 + r2 → r4
r1 + r3 → r5
r4 × r5 → r6
r2 + r2 → r7
r2 + r3 → r8
r7 × r8 → r9
Listing 2.2: Another sample program with additional ILP, only exploitable by an out-of-
order core
For this program, a superscalar core able to run 4 arithmetic instructions
in parallel should, like in the previous section, not run the first product
instruction alongside the first two additions. What about the third and fourth
additions? They could in theory be run alongside the first two additions: they
do not have data dependencies to the result of any of them nor the product.
Yet running them alongside the first additions would mean executing them
before the product instruction, which would violate the program order –
albeit the computations would still produce the same value. A core that
does this is called out-of-order, and those cores are able to exploit ILP much
more effectively than in-order cores.
Several techniques are widely known to allow for even more thorough
exploitation of available ILP. With register renaming (the Tomasulo algo-
rithm [23]), parallelism can be exploited even when the original instruction
flow causes fake dependencies between registers by reusing them for otherwise
unrelated computations, such as the one in Listing 2.3.
r1 + r2 → r4
r1 + r3 → r1
Listing 2.3: Sample program with ILP that can only be exploited with register renaming
or similar technique (otherwise second addition has to wait for completion of first, due to
use of r1 as output register)
Very long instruction word
VLIW is another interesting technique that exploits ILP [24]. Instead of the
processor trying to find groups of instructions that can be run in parallel, it
is the work of the programmer to provide such groups. For that, a single in-
struction in a VLIW core might be actually composed of several instructions,
which the programmer knows have no data dependencies between them.
Therefore, VLIW does not share as much control complexity as does
out-of-order execution, but puts the work of discovering parallelism to the
programmer (or the compiler, if using a high-level language). As a conse-
20 CHAPTER 2. CHIP MULTIPROCESSORS
quence, the compiler and the core architecture get more tightly coupled; the
compiler needs knowledge of the number of functional units the core has, etc.
VLIW is currently popular in embedded architectures. For example, the
DSP on the OMAP3530 that was shown in Figure 2.1 is a VLIW architecture,
with the ability to fetch 8 instructions in a single cycle [25].
Simultaneous multi-threading
Unlike all of the other strategies seen so far, SMT [26] exploits TLP instead
of ILP. Like a superscalar core, an SMT core contains replicated functional
units. Like a VLIW core, it is the job of the programmer to make the available
parallelism explicit. But in SMT the programmer does that by dividing the
single instruction flow into several (threads).
Each of these threads will share the same program and data memory,
but each have independent program counters and registers. Therefore, if
one thread is stalled because of a pending memory access, or because of
lack of ILP in that thread, the processor continues using its pipeline with
instructions from the other thread. Only when none of the threads is ready
to run will the SMT core stall. Therefore, while we are not actually reducing
any of the causes for stalls we are effectively hiding them, allowing other
program flows to run on a core that would otherwise be idling.
The main disadvantage of SMT is that the programmer might have to
entirely rewrite the algorithm in order to create a version using threads and
TLP. But, as we argued in Chapter 1, this problem is also faced by the CMP
programmer. It is probably for this reason that combinations of SMT and
CMP [27] are becoming increasingly popular.
2.2 The memory hierarchy
Memory is the place where both program and data are stored. This means
that there is at least one memory access for each instruction, in addition to
the data memory accesses already present, memory speed being critical on
the general throughput of a computer processor.
Unfortunately, and while the speed of cores has increased at a nearly
exponential rate over the years, the speed of memories has not increased
at nowhere near the same speed, making the memory subsystem a common
bottleneck (see Figure 2.4). This is known as the memory wall problem [29].
It is not a purely speed problem. It is possible to make very speedy memory
elements, such as for example the ones used in the register bank of an high
speed core; however, it would be unreasonably expensive to provide very high-
2.2. THE MEMORY HIERARCHY 21
1
10
100
1000
10000
1980 1985 1990 1995 2000
Processor performance
DRAM performance
Figure 2.4: The growing gap in the performance of processors and DRAM (assuming 50%
yearly growth for processors and 7% for DRAM [28])
speed large-capacity memories: to keep costs down, there has to be a balance
between capacity and speed. The exact ratios have varied over the ages, as
new memory technologies appear; they are also wildly different depending on
whether we are talking about volatile or non-volatile (permanent) storages.
Fortunately, the average computer program presents very useful proper-
ties traditionally aggregated under the banner of the principle of locality :
• Temporal locality refers to the large probability that a given zone of
the memory is accessed if it was accessed in the past.
• Spatial locality is the observation that if at a certain moment a point in
memory is accessed, it is very probable that the vicinity of that point
will be accessed in the near future.
For this reason, and to keep a reasonable speed-capacity ratio, it makes
sense to build a hierarchical memory system, composed of several layers of
different memory technologies: from the fastest, with the lowest capacities,
to the slowest but largely sized. Most modern computers follow this pattern;
see an example in Table 2.2.
2.2.1 Cache memory
The fastest layers in the hierarchy are always the in-core registers, used
directly as the operands of the current in-progress execution and the storage
for its results. Usually the program itself explicitly moves data between those
registers and the main memory with load or store instructions.
22 CHAPTER 2. CHIP MULTIPROCESSORS
Capacity Access time
Core registers 100 bytes 1 cycle
L1 cache Tens of KBs 2-4 cycles
L2 cache MBs 10 cycles
L3 cache Tens of MBs Tens of cycles
Main memory GBs Hundreds of cycles
Solid-state permanent storage Thousands of GBs Thousands of cycles
Mechanical permanent storage TBs Millions of cycles
Table 2.2: Approximate size and time scales for a recent computer memory hierarchy
Core Core
Private 
cache
Private 
cache
Shared cache
Main memory
Core Core
Private 
cache
Private 
cache
Shared cache
(Globally) shared cache
Sp
ee
d
C
ap
acity
Figure 2.5: Private vs shared caches
Between main memory and the registers, there are one or more of several
cache memories [30], which might be composed of different capacities and
technologies. Each cache memory transparently mirrors a noncontiguous
portion of main memory, by automatically loading and maintaining up-to-
date copies of the most accessed zones of main memory. Therefore, and
because most programs repeatedly access the same positions of memory, due
to the principle of locality, the memory wall problem is effectively alleviated
(though the bandwidth between the core and the cache itself has to be taken
into consideration).
Their physical locations (inside the core, in the same die, outside it...) is
one CMP design decision. Caches on the same die as the core usually have
to use a similar manufacturing technology than the core itself, which might
make enlarging them prohibitively expensive. On the other hand, the latency
required to access them will be very low. Having one cache for all cores of the
2.2. THE MEMORY HIERARCHY 23
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 KiB 10 KiB 20 KiB 30 KiB 40 KiB 50 KiB 60 KiB 70 KiB
Figure 2.6: Example of miss ratio vs cache size approximated using power law curve
CMP, individual caches per core, or a mixture of options is another important
design decision (see Figure 2.5 for an example).
The design goals of any cache memory system are [30]:
• Minimize the miss ratio: the probability that the target of a memory
access from a core is not in the cache.
• Minimize the access time.
• Minimize the miss delay when a memory access cannot be served from
the cache.
• Minimize the overheads caused by having to maintain consistency and
coherency (see Section 2.2.2).
The miss ratio, because of the principle of locality, can be approximated
by a power law curve [31], defined by the following formula:
M = M0C
−α
Where C is the cache size, and M0 and α are constants related to the
workload (0 < α < 1, usually 0.5).
There are many policies to select which regions of memory are picked out
to be mirrored into the cache [30], as well as two different policies to select
what to do when writing to main memory to keep the cache mirror updated
as well:
• Write through: all writes to memory update the cached copy first, then
immediately go to main memory.
24 CHAPTER 2. CHIP MULTIPROCESSORS
• Write back : all writes to memory update only the cached copy; main
memory is updated only when the cached copy is discarded from the
cache (because a more used region of memory is being pushed into it
or the consistency/coherency requirements need it).
2.2.2 Consistency and coherency
An important aspect to consider about the memory hierarchy in a CMP
scenario is how the cache memories are distributed. If there is more than
one cache in the system, such as in Figure 2.5, then it becomes important to
ensure that the resulting system is both consistent and coherent.
• A CMP is sequentially consistent [32] when the order in which mem-
ory operations from a processor appear to other processors respects
program order. An example of non-consistency can be seen in Fig-
ure 2.7. This could happen if for example a core does not wait for the
data written to main memory to be actually written before continuing
execution.
• In a coherent CMP system, when a core writes data to a memory loca-
tion, every other core that reads from that memory location retrieves
that data. Non-coherent behavior can be seen in Figure 2.8, which
could happen if a cache in write-back mode does not write a updated
value back to main memory before another core decides to retrieve the
same value from main memory (because it was missing from its cache).
To ensure that both constraints are maintained, CMP systems tradition-
ally use a cache coherence protocol. There are three main categories:
Initial conditions: a = false, b = false.
Core 1:
a = true ;
b = true ;
Core 2:
u n t i l ( a or b)
wait ( ) ;
p r i n t a , b
Consistent result : Core 2 never prints “a is false, b is true”.
Figure 2.7: Example of a program where lack of consistency can be seen. In a non-
consistent CMP, the b assignment might be seen before the a assignment (because e.g. it
propagated faster over the network), printing “false, true” instead of “true, false” or “true,
true”.
2.3. ON-CHIP INTERCONNECTS 25
Initial conditions: a = 0, b = false.
Core 1:
a = 4 ;
b = true ;
Core 2:
a = 1 ;
u n t i l (b )
wait ( ) ;
p r i n t a
Coherent result : Core 2 prints “a is 4”.
Figure 2.8: Example of a program where lack of coherence can be seen. In a non-coherent
CMP, Core 2 misses the a = 4 assignment because it decided to read it from its private
cache, which did not see the write made by Core 1.
• Protocols based on snooping assume that every processor can see every
transaction from every other processor in the entire CMP. These are
mostly used on interconnects where broadcasts are the only or the most
common kind of communication, such as a bus (see Section 2.3.1); its
usefulness is reduced in other topologies.
• Directory based protocols have a central memory element which main-
tains the list of which memory regions are mirrored and by which cache
elements. These have the problem that this directory has to be notified
of every change in each and every cache it monitors, potentially con-
suming large amounts of bandwidth, and of course the additional area
and power consumed by this central memory, which can also become
the bottleneck.
• Token-based protocols, a recent innovation [33], try to capture both
the best features of directory based protocols and protocols based on
snooping.
2.3 On-chip interconnects
On-chip interconnects connect the cores, caches, as well as the external pe-
ripherals (such as the main memory) on-chip controllers. On this section
we will see two basic interconnection types, then proceed onto Networks-On-
Chip in Section 2.4.
26 CHAPTER 2. CHIP MULTIPROCESSORS
A
B
A
B
Figure 2.9: A bus (left) and a 2 multibus (right) between 4 elements
2.3.1 Buses
A bus [34] is the simplest of all interconnections. A single link runs between
all components of the CMP, as seen in Figure 2.9. This single link is shared
between all the components of the CMP: when two components initiate a
transmission, every other component must refrain from using the bus – at
the risk of interfering with the in-progress transmission. Therefore, buses
usually have arbiters that ensure that there is only one owner of the bus at
every moment – called the bus master. Alternative, carrier sense/collision
detection schemes are utilized where a component checks if the bus is cur-
rently in use right before transmitting on it, retrying after a random amount
of time if it is indeed occupied.
The width of the link (literally, the number of wires) determines the num-
ber of bits that can traverse the bus simultaneously, and therefore, determines
the bandwidth of the bus.
Buses are very simple to implement, and for low number of components,
extremely cheap. Additionally, they have the interesting property that every
transmission can be read by every component in the bus, which makes it
interesting for certain use cases (e.g. cache coherence, see Section 2.2.2). The
disadvantages of buses, though, become specially noticeable as the number
of components increases:
• Long link sizes: a single link has to traverse each and every component
in the CMP. Such a long link might require too much power to be
reasonable, require lots of repeaters to maintain adequate signal level
for all segments of the bus, or be outright impossible to produce in a
VLSI setting.
• Bandwidth is shared between all components in the network, and lim-
2.3. ON-CHIP INTERCONNECTS 27
A
B
Figure 2.10: A crossbar between 4 elements. Note how a direct, private link exists between
every of the 4 components.
ited by the clock frequency and link width.
• Long access times: messages written to a bus might take a significant
time to propagate over the entire link. For adequate operation, no other
message can be written to the bus until ample time has been given for
such propagation.
An extension to the bus concept, the multibus, is composed of multiple
parallel buses that can operate simultaneously and independently. Such an
extension obviously has increased bandwidth, but also increases the cost
linearly as the number of buses grows.
2.3.2 Crossbars
In a crossbar, there exists a link between each pair of components (a peer-
to-peer link). Since these links are independent, a component that wants
to initiate a transaction can do so at any time as long as it is does not
already have an active transaction with that same component. The link is
also private; thus, each two components have the maximum available link
bandwidth between them.
A crossbar might be seen as the complete opposite to a bus: instead of a
single shared link, a crossbar has no shared links at all. The main disadvan-
tage of crossbars is the while they are extremely scalable, the number of links
required increases quadratically with the number of components. Therefore,
they become very expensive. It is for this reason that, while crossbars have
been used in commercial CMP designs, it is hard to see designs with more
than 32 components. However, attempts have been made to construct cross-
bars of up to 128 components [35].
28 CHAPTER 2. CHIP MULTIPROCESSORS
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
0 10 20 30 40 50 60 70
C
o
m
b
in
e
d
 t
h
ro
u
gh
p
u
t 
o
f 
al
l c
o
re
s 
Number of cores 
Bus
4-Bus
8-Bus
Mesh
Figure 2.11: Meshes scale much better as the number of cores increases
2.4 Network-On-Chip systems
NoC systems have recently emerged as an alternative interconnection for
chip multiprocessors. We have seen that the traditionally used interconnects
have several problems as the number of components in the CMP grows.
Specially buses, once the pinnacle of computer interconnections, have seen
their advantages disappear as the link technology has evolved more slowly
than the ever increasing demands on the bandwidth and wire length. In a
bus, additionally, only a single pair of cores can talk at any time; envision
the performance of a interconnect with such a limitation serving hundreds
of cores (as in Figure 2.11). Adding more links (e.g. to either construct a
multibus or a full crossbar) requires more area and power, and additionally,
reduce the average utilization of each link, making them seem wasteful.
NoCs are an interesting alternative because they use short links and ad-
ditionally allow sharing those links [34]. We saw in Chapter 1 that circuit
switching allowed reducing the number of wires in telephone networks all
while maximizing the number of subscribers that could be connected; only
when a call was placed was a route between both ends of the call reserved un-
til the call ended. A more granular sharing paradigm that has been already
successfully used in computer networks for years is to use packet switch-
ing : instead of reserving channels between communicating computers for the
entire session, the messages transferred are divided into units of a fixed max-
imum size, packets, and channels are reserved only for the duration of the
2.4. NETWORK-ON-CHIP SYSTEMS 29
transmission of a single packet. Afterwards, the channel might be used for a
packet from the same communication session, or from a conversation between
another completely unrelated pair of computers.
A typical NoC is composed of the following components (all of them can
be seen in Figure 2.12):
• The terminal nodes (or just nodes) are the components in the CMP
that are actually participating in the network, such as cores, caches,
peripherals, etc. Nodes generate (or inject) packets into the network,
expecting them to reach a certain destination; but unlike in a classical
interconnection, the route that those packets are to follow is not yet
decided. Once a packet reaches its destination node, it is ejected from
the network.
• Routers are a new architectural component introduced in NoCs. Usu-
ally, it is the presence of routers what distinguishes a NoC from a
traditional interconnect. The router receives packets from the terminal
nodes and forwards them using any of the links available to the router
that best fulfills the goal – reaching the destination. That link might
not be a direct link to the destination node (either because that link
is currently in use, or because such link does not exist at all). In this
case, it uses that link to forward the packet to another router that will
try to repeat the same operation. Eventually, the packet will reach a
router that is directly connected to the terminal node and be ejected.
For a given packet, each router traversed is called a hop.
Each connection between a router and a link is called a port.
• Buffers temporarily store packets. A network might have one or more
buffers with the capacity to store up to a fixed number of packets.
Usually, each input port of a router has a buffer in order to temporarily
hold packets while the router is busy processing another packet.
• Links (or channels) are the physical wires. In a NoC, links are always
direct, peer-to-peer connections between either two routers or a router
and a terminal node (in which case they are known as injection or
ejection links, depending on the direction of traffic).
A link might be composed of more than one wire. The number of
parallel wires determines the bandwidth of the link. On the other hand,
the latency of a link is the time required for a bit of information to
traverse it, and is usually dependent on the length of the link.
30 CHAPTER 2. CHIP MULTIPROCESSORS
Core
Memory 
controller
Router Router
Router
Core
NoC
Figure 2.12: A simple NoC communicating 3 nodes
Note that, since a terminal node is always attached to a router 2, the
word node might be used to refer to both the terminal node and its attached
router.
Four parameters characterize a NoC: the topology, the routing algorithm,
the flow control algorithm, and the router design. These will be described in
the following sections.
2.4.1 Topology
The network topology describes the static arrangement of all the components
in the network: links, routers and nodes. In a sense, the topology of a
network is like a map of it. Of course, good topologies try to maximize the
bandwidth and minimize the latency, all within the available technology and
at reasonable costs.
To achieve low latencies, a topology must have reduced hop counts for
the most common packets. A good way to reduce the number of hops is
to increase the number of ports per router (concentration), so that we can
actually eliminate some routers entirely from the network and reduce the
number of routers that need to be traversed. However, increasing the number
of ports means that each router now has lots of additional wires going through
it. To alleviate this area problem, a designer might decide to reduce the
number of wires per link: such an action will reduce the available bandwidth.
Depending on the size of the average packets, packets might now need to be
sent in several cycles instead of one because there is just not enough wires to
sent the entire packet in parallel. Therefore, latency is increased. Thus, as
usual, it is the job of the designer to balance all these parameters to obtain
the desired results.
2However, a router might be linked to more than one terminal node, or none at all
2.4. NETWORK-ON-CHIP SYSTEMS 31
Fortunately, there exists a quantitative method to estimate the bandwidth
of a topology: the bisection bandwidth [34]. This is defined as the minimum
bandwidth over all bisections of the network.
There is a virtually infinite number of possible network topologies. The
most common ones are mesh-based, because meshes are regular and can
usually be easily laid out in a chip – see Table 2.3 for an enumeration. The
second most popular are often butterfly based (Table 2.4), because of their
high-concentration and low hop counts.
2.4.2 Routing algorithms
In a network, a route or path between two nodes is the sequence of links
l1, l2, . . . , ln so that l1 is a injection channel of the source node, ln is a ejection
channel for the destination node, and for every i, the router at the output of
link li has an output port to link li+1.
The routing algorithm determines, for each packet, the route to be taken
between the source and the destination node. While aiming for the shortest
(minimal) route would result in the lowest hop count possible and thus the
least latency, a routing algorithm may decide to intentionally avoid choosing
the minimal route, in order to distribute the traffic between all the routes
available. Often, each router in the path chooses the next router.
Depending on how they select a path when there is more than one path
available, routing algorithms can be classified into:
• Deterministic routing algorithms always choose the same route given
the same source and destination nodes; therefore, they do not actually
do any traffic balancing at all. They are the simplest to implement,
making them quite common.
• Oblivious algorithms do choose different paths, but they do so without
taking into account the current load of the network. For example, by
randomly choosing the next peer every time. But randomness is usually
hard to find in a environment as restricted as a CMP.
• Adaptative algorithms choose different paths depending on the state of
the network. For example, a router might select as output port the one
where the least amount of packets have been sent in the previous cycles
– i.e. the one that will most probably be least congested. However,
keeping this information might be costly.
Of course, not every routing algorithm works with every topology. For
some topologies such as butterflies there is only one routing algorithm at all.
32 CHAPTER 2. CHIP MULTIPROCESSORS
Mesh Meshes are regular grids of n dimensions. An example
of a 2D one can be seen in the figure. Meshes are very
regular, which makes them easy to physically layout
on a chip. Additionally, links are very short, and for
any given pair of nodes, there is a variety of paths
available in case one is congested.
Unfortunately, this comes at a price: for physically
distant nodes, meshes have large hop counts.
Torus A torus is like a mesh except that additional
“wraparound” links have been added in each dimen-
sion. This way, the average hop count is divided by
half: packets that previously would have needed to
traverse the mesh (e.g. source and destination are
each in the extremes) can now just use one of the
additional links to reach destination in one hop.
As a disadvantage, those wraparound channels might
need to be as long as the chip itself.
Ring A ring is actually a one-dimensional torus. By careful
placement of the components of the ring, long links
can be avoided, as in the figure. Unlike meshes how-
ever, there is little path choice for a packet: either it
traverse the ring in one direction, or the other direc-
tion. Links are still very useful because of the reduced
number of links and link length.
3D Mesh In the figure an example of tri-dimensional mesh can
be seen. Tri-dimensional meshes have been used spe-
cially where the packaging or the technology allows
them to fit better (e.g. 3d stacking).
Table 2.3: Common mesh-like network topologies
2.4. NETWORK-ON-CHIP SYSTEMS 33
Butterfly Butterfly networks are constructed by adding inter-
mediate stages between the components, containing
only a number of routers; each of the injection ports
connects to one router in stage 1, each of the output
ports of those routers connect to the input port of
stage 2 router, and so on.
In the figure, a butterfly network with 1 stage can be
seen.
Unfortunately, a butterfly network has exactly one
possible route from each source node to each desti-
nation node. Each possible route additionally always
has the same hop count: the number of stages.
Table 2.4: Other topologies
A
B
Figure 2.13: X-Y routing firstly routes in the x direction, then in the y direction
34 CHAPTER 2. CHIP MULTIPROCESSORS
For meshes, there is a variety of algorithms [34]; however, for the complexity
reasons mentioned above, the most used one (on 2D meshes) is X-Y routing
(Figure 2.13): a packet is firstly forwarded to the left or the right (dimension
x) as necessary until the packet is in the same column as the destination
node; then routed up or down until it reaches destination.
The generalization of X-Y routing for n-dimension meshes, dimension
order routing, simply establishes a sorting for each of the n dimensions, so
that it will route first in the first dimension and so on.
2.4.3 Flow control
Flow control algorithms decide how the network elements (links, buffers,
routers) are allocated to packets as they traverse the network. For example,
in Chapter 1 circuit switching was mentioned, where when a call is placed,
the entire route between both ends of the call is reserved: no other call can
use the same wires. The same concept can be applied to a NoC: reserve
the entire path between the two nodes (including all links and routers) when
any of them wants to send a packet. This way, there is no potential for
collisions, because once a path has been allocated, the packet will always
successfully traverse it; no other packet will use the same path. Therefore,
circuit switching is an example of bufferless flow control because there is no
need for buffers in the network.
However, the addition of buffers offers a plethora of additional flow control
strategies that are much more granular. The simplest of which, called store-
and-forward, does not do any reservation of links. Instead of that, it depends
on the presence of buffers at each input port that temporarily store a packet
if a router is not able to forward a packet to the next router at this stage.
To successfully forward a packet, a router must first ensure that all of these
conditions apply:
• The link to be used is free.
• There is enough free capacity for the entire packet in the buffer at the
other end of the link.
If the high-level packets have to be fragmented (because the width of a
link is smaller than the size of a packet), they are usually segmented in flow
control digits or flits. (see Figure 2.14). In these cases, store-and-forward
has a great disadvantage in that the entire packet, including all flits, has
to be received and stored in the buffer before being forwarded – increasing
transmission latency.
2.4. NETWORK-ON-CHIP SYSTEMS 35
Destination
Source
Packet type
….
Data
Header flit Tail flitFlit #1 Flit #2
...
Figure 2.14: In a NoC context, packets are usually decomposed into flits
A minor improvement in this case is to increase the granularity of the
store-and-forward: in cut-through flow control, once the first flit of a packet
is received, an attempt to forward it to the next router in the path is made
instead of waiting for the entire packet. However, and like store-and-forward,
this is only done if there is enough space in the remote buffer for the entire
packet.
Wormhole flow control is an extension to cut-through. When a header
flit arrives on a buffer (see Figure 2.14), the router ensures, that the link
to use is free and that there is enough space in the next buffer for at least
one flit. However, once the allocation has been successful and the header flit
is forwarded, the allocation is not undone – the space in the buffer is kept
reserved and will be used by the next flit. In fact, all the required resources
for the traversal of the rest of the flits have been acquired, so, like in circuit
switching, the packet will now have no more collisions while traversing the
network. In a sense, a virtual circuit has been allocated. This virtual circuit
will be teared down once the last flit in the packet, the tail flit, passes.
An additional flow control method, called virtual channel (VC) flow con-
trol, physical links are multiplexed into n virtual channels, by dividing the
existing buffer into n equally-sized segments. Packets are marked with a vir-
tual channel number either by the source or by coming from a certain input
port. If a packet that is currently holding a link can not be forwarded for
any reason, and therefore the link would remain idle for that cycle, a packet
marked with an alternative virtual channel identifier can instead use that
link, even if both packets have the same route.
Virtual channel flow control is popularly known as the Swiss-Army knife
of interconnection networks, as it can solve a variety of problems, including
some kinds of deadlocks [34].
All of these methods assume that a router can receive the status of the
buffers from the routers connected to its output ports. Simple signaling like
acknowledgment/no-acknowledgment signals can do this; if a more precise
36 CHAPTER 2. CHIP MULTIPROCESSORS
Input port ...
Buffer for VC 1
Buffer for VC n
...
Input port
Crossbar
switch
Output port
Routing logic
VC Allocation
Switch Allocation
Credits out
Credits out
Credits in
Figure 2.15: A virtual-channer router high-level block diagram
amount of free space reporting is required, a credit-based system can be
used: receiving routers send credit to sending routers to signal that a new
slot in the buffer is free.
2.4.4 Router architecture
Routers are a critical path of the interconnection network; their processing
time influences the speed of the entire interconnection, and yet, they cannot
be too complex as depending on the topology there might be a large number
of routers in a many-core CMP.
One of the basic components of the router is the crossbar switch. It allows
arbitrarily connecting each input port to any output port of the router. As
it was noted in Section 2.1.1, this means that the complexity of a router
strongly depends on the number of ports it has.
In this section we will briefly describe the architecture of a router (Fig-
ure 2.15) by following the different stages of a flit as it is processed inside it.
Real router designs might actually implement each of these stages into one
single clock cycle, or, as a performance improvement, be pipelined (as seen
in Section 2.1.1).
1. When a flit arrives to a router via a input port, it is stored in the
2.5. CONCLUSIONS 37
correspondent buffer for that input port and virtual channel (if any).
2. The next stage is route calculation; the routing algorithm (Section 2.4.2)
is applied to determine which one of the output ports should be used
to forward this flit. For the simplest routing algorithms, such as X-Y,
this stage can be done by simple combinational logic.
3. Given the output port, a virtual channel must now be allocated (VC
allocation) if the flow control algorithm demands it (Section 2.4.3).
4. After this, a channel now needs to reserve the crossbar switch (switch
allocation) and configure it to connect the flit’s input port to the cor-
respondent output port.
5. The flit now needs to traverse the crossbar switch.
6. Credit might be send now, if the flow control algorithm demands it.
2.5 Conclusions
In this chapter chip multiprocessors have been described, and the main com-
ponents of a CMP have been introduced:
• Cores, which execute instructions. Several ways to enhance the per-
formance of a single core have been shown, presenting the trade-off
between the basic IPC of a core and its power and area.
• The memory hierarchy, that exploits the principle of locality to mini-
mize the number of accesses to slow external memory. The distribution
and sizing of cache memories is one important parameter of the de-
sign space exploration, and presents another important area and power
trade-off.
• The interconnection, of which we have described the most recent ad-
vances – Networks-on-Chip – and design parameters such as the topol-
ogy.
Chapter 3
Hierarchical network topologies
In this chapter we discuss the network topologies that we are interested in
exploring: hierarchical network topologies [13, 14].
We define a hierarchical network as a combination of more than one clas-
sical topology (a mesh, ring, bus, . . . ) in a hybrid fashion. Using the first
topology, we construct a network that will not interconnect the leaf compo-
nents of the CMP (cores, cache memories, peripherals); rather, each node in
this network is actually a subnetwork that follows the second topology, and
in each of these subnetworks, each node is an additional subnetwork following
the third topology, and so on until reaching the last level, where the nodes
are the actual CMP components.
Usually, the top network is also called the global interconnect, while the
leaf subnetworks, those at the bottom of the hierarchy and interconnecting
the actual CMP components are often called local interconnects.
3.1 Motivation
The reason that hierarchical network topologies are interesting is because
we assume that most of the traffic generated by one core will target other
components in its vicinity (communication locality).
In a CMP without any cache memory at all, the only traffic present would
be:
1. Requests to read from memory (from the cores to the main memory
controllers), and their replies (containing the actual memory data, from
the memory controllers to the cores).
2. Requests to write to memory (containing data, from the cores to to
38
3.1. MOTIVATION 39
Core
L1 cache
L2 cache
Main memory
Core Core Cache Cache
Local interconnect
Core Core Cache Cache
Local interconnect
Core Core Cache Cache
Local interconnect
Global interconnect Main memory
Secondary local interconnect Secondary local interconnect
Figure 3.1: Like the memory hierarchy, a hierarchical topology tries to exploit the principle
of locality
the main memory controllers), and their replies (from the memory con-
trollers to the cores)
3. Communication between cores (e.g. interprocessor interrupts)
When compared to the traffic generated by (1) and (2), (3) is negligible.
Thus, most of the traffic is actually memory traffic, the kind of traffic that
is reduced by the presence of a healthy memory hierarchy (see Section 2.2).
As we said on that section, one of the design aspects of a memory hierarchy
is controlling the placement of cache memories. The memory hierarchy is
constructed so that most of the traffic goes into the most local levels (L1 or
L2), that we expect will be placed near the cores. There will still be traffic
to main memory controllers and the global cache levels, which corresponds
to misses in the local caches levels. However, due to the principle of locality,
this kind of traffic is low when compared to the local traffic.
R R
R R
R
R
R R R
Global interconnect:
...
...
...
...
...
...
.........
Network Interface (NI)
Core
Cache
...
Local interconnect:
Figure 3.2: A hierarchical network topology: a mesh of buses
40 CHAPTER 3. HIERARCHICAL NETWORK TOPOLOGIES
Destination:
2
4
3
Data
Header flit Flit #1 Flit #2
#1
Core
Source node
Source subnetwork:
(local interconnect)
#2
NI
#3
...
#1
R
#3
R
#2
R
#4
R
Global interconnect:
#1
#2
NI
#3
Cache ...
Destination subnetwork:
Destination node
Figure 3.3: Routing in a hierarchical network: each head flit contains the full sequence
of addresses for each hop the package must go through. NI nodes are network interfaces,
while R nodes are routers.
3.2 Routing in a hierarchical topology
As explained in Section 2.4.2, routing algorithms are based on packets having
a destination on the header flit. In a hierarchical topology, this destination
can be in the same subnetwork as the source (local), or in a different subnet-
work at the same or different hierarchy level. In the former, routing is done
exactly as explained in Section 2.4.2. In the latter, when a packet traverses
different hierarchy levels, it should include all the necessary information in
order to be routed to the destination, which in this case implies the traversal
of other subnetworks that are in the path between the source and the target
nodes.
In practice, the header flit is modified to contain the sequence of hops that
need to be taken in order to reach the destination, as seen in Figure 3.3. In
the case of a local destination, the sequence length is one and is the address of
the target. However, in the case of a non-local destination, every intermediate
hop in the sequence corresponds to the address a network interface, a new
architectural element in a hierarchical topology which handles the routing
of packets between subnetworks. The following section describes this new
component of a hierarchical topology.
3.3. GLOBAL MESH BASED TOPOLOGIES 41
3.2.1 Network interfaces
A network interface connects two different subnetworks in a hierarchical
topology. Unlike a router, a network interface only has two ports. Addition-
ally, since the hierarchy is a tree-like structure, the relation between these
two subnetworks is parent-child, i.e., one subnetwork pertains to a higher
level than the other. This means, that the characteristics of the networks
connected via a network interface may be different, e.g., the topology, the
link width, among others. In particular, since subnetworks connected by a
network interface may have different link width, flits cannot be forwarded as
is from one subnetwork to the other: when receiving a packet from one of
the subnetworks, the network interface waits until all flits have been received
(store-and-forward, see Section 2.4.3). Then it may need to adapt the packets
to the characteristics of the target subnetwork (fragment/defragment).
Moreover, a network interface keeps the correct routing information, i.e.,
removes its own address from the sequence of hops, so that the immediate
destination is in the front of the sequence. As, traditionally, the source
address is also stored in a header flit, a network interface might also set the
source address of the packet to its own address.
3.3 Global mesh based topologies
Meshes are usually used as the global interconnection because it makes sense
from a layout point of view to divide a CMP in tiles, in a grid layout, and then
connect each of these using a mesh [13]. Experimentally, when compared to
other topologies like torus, meshes perform better in terms of bandwidth and
power due to being able to handle more traffic before reaching saturation [14].
In the following subsections we review some promising interconnects that
use mesh as global topology.
Mesh of meshes
In this topology every level of the hierarchy is a mesh. An example of this
topology can be seen in Figure 3.4. While meshes are attractive when used
as a global topology due to to the regularity, short wires and high bandwidth,
at local level ad-hoc topologies like ring, bus, or others might perform better
in terms of power, specially when the number of components is reduced.
Exploring whereas this is the case is one of the contributions of this work.
42 CHAPTER 3. HIERARCHICAL NETWORK TOPOLOGIES
Mesh of buses
This is a two-level hierarchy where the global interconnect is a mesh, and
local interconnects are buses (or multibuses). An example of this topology
can be seen in Figure 3.2. This topology has already been described in [14].
The reason to use buses at local level are low latency, low power and high
bandwidth, as long as the number of components is reduced. As explained
in Section 2.3.1, the bus is a resource that is arbitrated, which increases the
complexity of the logic at the network interface.
As well as a mesh of buses, we can consider a mesh of multibuses: this
is a trivial extension, also described in Section 2.3.1, which increases the
bandwidth of the bus at the cost of power. Multibuses having few parallel
links may represent an interesting middle point between buses and other
more power-hungry topologies.
Mesh of rings
This is a two-level hierarchy where the global interconnect is a mesh, and
local interconnects are rings. An example of this topology can be seen in
Figure 3.5. As with the previous topology, this one also has high bandwidth
and relatively low power, without having some of the inherent problems a
bus has regarding scalability.
3.4 Conclusions
In this chapter, the concept of hierarchical network topologies has been de-
veloped. Hierarchical topologies provide increased benefits all while keeping
reduced costs in a way that no flat topology could achieve, assuming that
traffic between components in a CMP has significant spatial locality.
Additionally, several example topologies have been presented alongside
with the features that make them interesting for exploration.
3.4. CONCLUSIONS 43
R R
R R
R
R
R R R
Global interconnect:
...
...
...
...
...
...
.........
Network Interface (NI)
Local interconnect:
R R
R R
R
R
R R R
C C
CCC
C C C
Figure 3.4: A hierarchical network topology: a two-level mesh of meshes
R R
R R
R
R
R R R
Global interconnect:
...
...
...
...
...
...
.........
Network Interface (NI)
Local interconnect:
R R
R R
R
R
C C
CCC
Figure 3.5: A hierarchical network topology: a two-level mesh of rings
Chapter 4
Simulation of a chip
multiprocessor
Simulation is one of of the most common options selected when evaluating
new CMP designs. While analytical modeling techniques are usually pre-
ferred for the exploration of a large design space, there are situations where
the abstract mathematical models are just too inaccurate for the needs of the
designer. The concept of simulation, on the other hand, is very flexible. It
allows a designer to experiment with systems that would even be impractical
to materially construct.
However, care must be taken: the simulator of a complex CMP system
is at least as complex as the system itself. Additionally, it is very easy to
construct otherwise logically-perfect data that does not have anything in
common with the real world system that is to be simulated. In general,
simulation is traditionally compared to a surgeon’s scalpel: in right hands it
can accomplish tremendous good, but it must be used with great care [36].
There are three basic questions that have to be answered in order to ensure
that the simulation will actually produce data that is comparable to the real
system:
1. Is the simulation model accurately representing the real system?
2. Is the input data sufficiently representative of real-life data?
3. What is the confidence level of the simulation results?
In this chapter, we will describe how are we simulating the hierarchical
CMP systems that we have described in chapters 2 and 3. In addition, we
will answer the above questions. We will first mention in high-level terms
how is the simulation being done and compare it to several strategies that
44
4.1. LEVELS OF DETAIL 45
have different levels of accuracy (in Section 4.1), trying to give a meaningful
answer to question 1. In Section 4.2, a few methods to assess the confidence
in a result set produced by a simulator will be described, answering question
3. In Section 4.3.1, we will describe what is the input data to the simulation
of a CMP model: the workloads, or applications to be run; as well as giving
an insight into the models of the simulation with the rest of sections in this
chapter.
4.1 Levels of detail
We mentioned in Section 1.1 that simulation is an evaluation method that
can be generally made as precise as required, at the cost of speed. This is
usually accomplished by selecting the level of detail at which a simulation
is to be done [34]: a lower level of detail implies complex, detailed models
which might model even the underlying physical processes, while a higher
level of detail uses more abstract models, or even replaces some models with
random processes of a similar characterization.
A simulator does not have to use a single level of detail for the entire
model; in a complex system, maybe only the parts under study need to be
simulated with the most accurate detail while the other components can be
simulated in a more lightweight way. In the world of CMPs, each of the
parts as described in Chapter 2 are simulated with different models; some
of the most common levels of detail for simulating cores and interconnects
are shown in Table 4.1 and Table 4.2. Apart from the additional precision,
increased levels of detail might provide additional information that is not
easy to obtain when using a more abstracted model.
4.2 Steady-state simulation
Depending on what exactly we are measuring of the system under simulation,
we can divide simulations in two types:
• Terminating simulations (also called finite-horizon): the system only
needs to be studied for a finite amount of time, i.e. from a specific event
(e.g. power on of the system) to another specific event (e.g. once the
program that is currently being executed is over). Hence, a terminating
simulation has a very well defined set of initial conditions (the status
of the system when the simulation is to be started), as well as a clear
stopping condition.
46 CHAPTER 4. SIMULATION OF A CHIP MULTIPROCESSOR
Behavioral level The simulation just centers on the external com-
portment of the core (how many memory re-
quests, instructions executed per second, etc.).
Usually using a random distribution, or maybe
by continuously replaying a finite recording of
the real behavior of a processor (a trace).
Instruction level The simulator reads a source program and pro-
duces the correct result for each instruction.
This way, one can check which results programs
written in the instruction simulated core will
produce.
Cycle level Modeling the timings of each instruction in the
instruction set, and keeping a (virtualized) time
counter, so that all simulated operations are
guaranteed to be executed in the proper virtu-
alized time they would on the real processor. A
program’s execution time can be measured using
this level of detail.
Register transfer
level
Modeling the data flow between the individual
registers inside the core.
Gate level Modeling the contents and activities by each
logic gate, usually too low level and only used
when debugging hardware implementation con-
cerns.
Table 4.1: A summary of common core simulation levels of detail, sorted from the least
detailed to most.
Protocol level Only the behavioral aspects of the interconnect
are simulated; maybe using a distance based ap-
proximation to estimate the traversal time of a
packet.
Flit level Tracking the movement of each flit inside the
network, modeling behavior of routers, buffers,
switches, links, . . . . Can provide accurate flit
latencies.
Hardware level Modeling physical hardware details, such as in
the lower levels of Table 4.1.
Table 4.2: A summary of common CMP interconnection simulation levels of detail, sorted
from the least detailed to most.
4.2. STEADY-STATE SIMULATION 47
After such a simulation, the resulting data is accurate only for the
conditions during that time period and nothing else. As an example,
in the context of a core simulated until a specified program finished,
data that can be obtained includes: the total runtime of the program,
the final result, the number of times any component from the system
was utilized during the program execution, total energy consumption,
etc.
• On the other hand, for a steady-state (alternatively, infinite-horizon)
simulation, it is desirable to study the system for an infinite period
of time. We assume that the long-run conditions of the system under
study will be mostly unchanging, and we want to obtain average data
of the system running under those conditions.
For example, the average number of web pages served by a web server:
there is no easy way to determine when the simulation should be started
or ended so that we get a meaningful number; on the other hand, the
average number of incoming web requests, as well as the rate at which
the server serves one request can be considered unchanging.
Both simulation types are interesting in a CMP context. Even in a design
exploration context, a terminating simulation is sometimes used: a bench-
mark program execution is simulated on all the candidate designs, and the
designs finishing the benchmark in the least amount of time are selected as
best designs. However, as we will argue in Section 4.3.1, we cannot easily
produce such benchmarks in the kind of many-core designs we are exploring
in this work. Additionally, a steady state simulation can often be performed
faster than a terminating simulation, as the system only needs to be sim-
ulated until the desired conditions are reached for a sufficient time. This
makes steady-state simulation the best choice for the exploration of a large
design space.
However, a steady state simulation has some problems that must be con-
sidered; namely detecting when the steady state has been reached and cal-
culating when to stop the simulation. These are detailed in the following
subsections.
4.2.1 Initialization bias
In a steady-state simulation, as well as a terminating one, startup conditions
have to be provided. Ideally, the selection of a set of initial conditions should
not have an effect in the output of the simulator, because we are interested in
48 CHAPTER 4. SIMULATION OF A CHIP MULTIPROCESSOR
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
1
10 19 28 37 46 55 64 73 82 91
10
0
10
9
11
8
12
7
13
6
14
5
15
4
16
3
17
2
18
1
19
0
19
9
A
ve
ra
ge
 c
o
re
 p
e
rf
o
rm
an
ce
 
Total cycles simulated 
Figure 4.1: An example of initialization bias; the warm-up period is marked in red.
the steady-state conditions. However, the initial conditions are usually cho-
sen arbitrarily. For example, we might choose the “empty” state – no packets
– as the initial conditions for a interconnection network, and it might take an
unusual long time for the participants in the network to inject enough traffic
so that we reach we desired steady-state of the network. During this time
the network status is uncharacteristic of the steady-state. The initialization
bias problem thus appears when this warm-up time is sufficiently long that
the initial conditions have a significant impact in the final output of the
simulation.
An example of the initialization bias problem can be seen in Figure 4.1.
It is a very simple simulation of a core in order to estimate its performance.
However, it is initialized will all its caches empty. In the first half of the
simulation the measured performance varies wildly as the caches get filled.
This causes the total average to be lower than the actual steady state average
where the memory hierarchy is functioning correctly.
There are many methods to deal with initialization bias:
• Make the initial conditions similar to the steady-state conditions. Un-
fortunately, these are exactly the conditions that we are trying to mea-
sure!
• Simulate a long enough time so that the effects of the initialization bias
are overwhelmed with steady-state data. As an example, if we were to
4.2. STEADY-STATE SIMULATION 49
simulate the system in Figure 4.1 for 20,000 cycles instead of 200, the
bias in the first 100 cycles would be hidden, negligible. However, a
problem is that it is usually desirable to minimize the total runtime of
the simulation, not enlarge it.
• Remove the offending data when calculating the average. This is often
called the removal of the warm-up time. Still, this leaves the problem
of how to determine the cutting point; that is, how to actually calculate
the required warm-up time to remove.
Generally, knowledge of the simulated system is enough to provide a
reasonable warm-up time (e.g. for a network, the time required for a
handful of packets to traverse it), or alternatively can be estimated by
simple manual observation over several plots of the evolution of the
measured variable (popularly known as Welch’s method). Additional
statistical methods for the automated calculation of the warm-up pe-
riod exist [37, 38], but those may require computationally expensive
calculations.
4.2.2 Output analysis and stopping criteria
Additional problems arise even once the simulation reaches the steady-state,
after an appropriate warm-up period. Firstly, we need to know how to convert
the measurements taken from the simulator so that a meaningful simulation
output can be provided.
For example, a very low-level core simulator might produce a new temper-
ature measurement for every cycle of simulation (X1, X2, . . . , Xn). A inter-
connection network will produce, every time a flit arrives to its destination,
the latency for that flit. Usually, what is desired as the output of the simula-
tion is actually a function of all of those measurements, such as the arithmetic
mean (average temperature, average latency):
X¯n =
1
n
n∑
i=1
Xi
This points us to the second problem, and to the original question 3 from
the introduction of this chapter: how well does the output data represent the
underlying system? Can statistical tests be used with the output from the
simulation? This is specially interesting when randomness is used as input
to the simulator. Intuitively, random data in would produce truly random
results. However, even when using random data as simulator input, the se-
quences of measurements taken from simulation are very autocorrelated; in
50 CHAPTER 4. SIMULATION OF A CHIP MULTIPROCESSOR
particular, Xn+1 will be very correlated with Xn, as the state of the simulated
system is kept from one measurement to the next – if the system was con-
gested in measurement n, it will certainly be congested in measurement n+1
too. Therefore, estimating the variance of X¯n using the classical methods is
meaningless.
A statistical indicator that can work in these scenarios is required. There
are many methods to handle this problem [38], but we will center on the
two that are most known: the independent replications and the batch means
methods.
Finally, there is the problem of how many samples need to be taken from
the simulation, so that the simulation can be stopped as early as possible.
Fortunately, the solution to this problem is very related to the previous one:
we can simply run the simulation until the statistical indicator currently
in use reaches a satisfactory accuracy threshold. For example, when the
confidence interval, that we will see on this section, decreases below a certain
length.
Independent replications
Assume that we want to estimate a parameter µ of a system using simulation.
The independent replications method runs k simulations (each of length n)
of the system, each of them fully independent from each other, using different
random number generator seeds. Now, we have a sequence of measurements
for each run r: Xr1, Xr2, . . . , Xrn. We can calculate the average for each
independent run:
X¯r =
1
n
n∑
i=1
Xri
Notice that, since every run r was independent from each other, all X¯r
averages are truly independent random variables. Therefore, we can now
estimate µ using the average of all independent runs:
X¯ =
1
k
k∑
i=1
X¯k
As well as its variance and standard deviation using the classical sample
variance formula:
σ2X =
1
k − 1
k∑
i=1
(X¯ − X¯i)2
4.2. STEADY-STATE SIMULATION 51
3,355
3,36
3,365
3,37
3,375
3,38
3,385
3,39
1
38 75
1
1
2
14
9
1
8
6
22
3
2
6
0
29
7
3
3
4
3
7
1
40
8
4
4
5
48
2
5
1
9
55
6
5
9
3
6
3
0
66
7
7
0
4
74
1
7
7
8
81
5
8
5
2
8
8
9
92
6
9
6
3
Figure 4.2: Using large batches reduces correlation between one batch to the next. The
temporarily low performance situation is completely inside the first batch, as is the the
subsequent burst inside the second batch.
Note the importance of the selection of the parameter k: it is the only
means of controlling the sample size. A too small sample size will inevitably
cause large variances.
Batch means
One problem with the independent replications method is its main require-
ment of k multiple independent replications. For each of them, the cost of
simulation setup, warm-up time, and results collection is incurred, thereby
increasing simulation time by more time than if we were to just run a simu-
lation for k times longer time.
In the batch means method that is exactly what is done: assume that, in
order to measure a parameter µ, a simulation of length n has been run. We
split the measurements X1, X2, . . . , Xn into k batches, each of a fixed size
b, so that n = kb. Therefore, batch number r will contain the sequence of
measurements: Xb(r−1)+1, Xb(r−1)+2, . . . , Xbr, and the batch mean for batch r
will be:
B¯r =
1
b
b∑
i=1
Xb(r−1)+i
And the mean of all the batch means:
52 CHAPTER 4. SIMULATION OF A CHIP MULTIPROCESSOR
B¯ =
1
k
k∑
i=1
B¯r
Notice how the above mean is the same as taking the arithmetic mean
from the original sequence of measurements X1, X2, . . . , Xn. However, with
the batch means method, and assuming that b is sufficiently large, the dif-
ferent batch means B¯r should not be correlated at all. This is because, even
though the measurements are not truly independent, the large batch size
implies that the batch mean has a reduced variance and partially hides this
correlation.
For example, in a network interconnection simulation, a large batch size
means that most temporary congestions that are produced in one batch will
have disappeared by the start of the next batch, as in Figure 4.2. Of course,
the batches are not truly independent; in-flight packets at the end of one
batch will remain so at the start of the next batch, and with a small enough
b, these effects will be significant.
Therefore, the choice of the batch size b is important. A small b will
cause batches to not be truly independent, therefore casting doubt into the
validity of the statistical indicators obtained. A large b will, if n is fixed,
create very few batches and will result in a very small number of samples k
for the calculation of the variance, resulting in a unnecessarily large variance.
If we are calculating the runtime of the simulation (n) based precisely on
the quality indications obtained by the batch means method, then a large
b will cause unnecessarily long runtimes. On the other hand, a small b will
create batches that are not truly independent from each other, skewing the
results. It is usually recommended that b is selected so that, on average,
k is somewhere around 30 batches [34, 38]. A good value for b might also
be estimated, like the warm-up time, using knowledge of the system under
simulation.
Since the batch means are independent, we can calculate the variance
again using the sample variance:
σ2B =
1
k − 1
k∑
i=1
(B¯ − B¯i)2
A very popular extension to the batch means method is having overlap-
ping batch means, which slightly decrease the measured variance even with
the same number of batches [38].
4.3. SIMULATOR DESIGN 53
Confidence intervals
If we have the variance of a estimation X¯ of the parameter µ of a system (for
example, using any of the previous methods), there is a very useful statistical
tool that allows a quantitative answer to the question of how well does the
output data represent the underlying system: a confidence interval.
A confidence interval (u, v) indicates that µ, the real value that we are
estimating, will fall, with a certain confidence level 100(1−α)%, between the
two ends of the interval; that is:
1− α = Pr(u < µ < v)
Obviously, α is a probability, and thus between 0 and 1. For example, if
we want to ensure with 95% confidence level that µ will be between u and v,
then α = 0.05.
A confidence interval can be constructed from the sample mean X¯ and
standard deviation σX (calculated for example using any of the two previous
methods): (
X¯ − σXtn−1,α/2√
n
, X¯ +
σXtn−1,α/2√
n
)
tn−1,α/2 is the Student’s t-distribution, with n− 1 degrees of freedom and
the α/2 percentile. Alternatively, the normal distribution can be used if we
know that the number of samples is larger than 30, due to the Central Limit
Theorem.
The length (distance between the lower and upper bound) of a confi-
dence interval is also traditionally used as an accuracy indicator. A popular
stopping criteria, for example, is to stop when the length of the calculated
confidence interval decreases below a certain threshold.
4.3 Simulator design
As we discussed on the introduction, a simulator for a component is usually as
complex as the simulated component itself. In this section, we will describe
how the different components of a CMP (and specifically, a CMP using a
hierarchical topology) are simulated.
4.3.1 Simulating cores
A core is usually the most complex circuit in a CMP. Therefore, it is also
usually the most complex object in the simulation.
54 CHAPTER 4. SIMULATION OF A CHIP MULTIPROCESSOR
From an external point of view, a core is a CMP component that:
1. Is connected with other components and may inject traffic such as
memory reads, writes, or peripheral and interprocessor communication.
2. Executes instructions.
While it is intuitive and expected that a processor will eventually stop
executing instructions if its memory requests are no longer served, how and
when it will exactly do so are one of the concepts that need to be decided
by the core model. Note that the results of the execution of a program are
usually either transferred to a peripheral (e.g. a video graphics array or a
printer), or stored in memory, and therefore fall under point (1).
We already showed in Table 4.1 that there are several levels of detail
at which a core can be simulated. A full system simulator, that runs real
programs or even operating systems, would need to include several parts:
first, it would require an instruction decoder software that would be able to
read a program written for the simulated machine (and run it); second, it
would need to know how much time each instruction takes to execute; and
third, it would need to know which traffic each of the instructions generates
and which incoming traffic each has to wait. Such a complex simulator would
faithfully simulate the behavior of the real core, and would generate the same
traffic the real core would generate when running under the same program.
However, unless simulating very slow cores, this full-system simulator would
be orders of magnitude slower than the real core. Additionally, and as we
saw in Section 4.2, since the simulator is executing an arbitrary program or
benchmark, the steady state may never be reached. On the other hand, and
if the programs that are being run terminate, the simulation will be of the
terminating type. This level of detail is often required if the objective of
the simulation is to study in depth the performance characteristics of the
processor, or its behavior for certain programs.
This can be simplified if we are interested mostly in its external behavior
only, either because we are only interested in exploring the performance of
other components in the CMP or, as in this work, because simulating every
processor with such level detail would simply be unfeasible. One simplifica-
tion is the use of traces ; for this, a real program is run under the real core,
and every external transaction (memory requests, peripheral accesses, ...) is
logged. This list of accesses, the trace can then be replayed to simulate the
external behavior of the processor as many times as required. The trace might
be expensive to create, but this cost needs to be paid only once. The main
problem with this method is that if we cannot physically construct the CMP
that we want to simulate, it becomes impossible to extract the traces from a
4.3. SIMULATOR DESIGN 55
running system. For this case, alternative strategies have to be envisioned,
such as extrapolating from the traces obtained in a very similar system, or
use a statistical model based on the expected statistical characteristics of the
core and the program that is to be running on it.
Traditionally, a stochastic process is used for the statistical approxima-
tion. This kind of process models the evolution of a system over time by as-
signing a certain probability to each possible outcome for every time. That is,
function x(t, s) is defined which assigns a probability for every time instant t
and every possible outcome s at that time instant. One of the most simplistic
models, the Bernoulli process, models a process with two outcomes – a event
happens, or not – using a fixed probability p and 1 − p for each outcome.
This probability does not change as t advances. Bernoulli processes can be
used for example to model the injections of a core if we are to assume that
a core’s injection behavior does not change over time, or that it does but it
is not necessary to model due to our low accuracy requirements. There are
many additional stochastic process types [38] that are used to model different
real-life situations.
Markov chains
For this work, it was decided that the most promising approach would be
to model the behavior of the cores by using a stochastic process: Markov
chains [38, 39]. This decision was made on the rationale that full-system
simulation would be excruciatingly slow for a exploration task, and on the
other hand, traces would be difficult to obtain for a system that is not yet
built. A simple stochastic process would fail to model some of the character-
istics of the kinds of processors that we would be interested to model in this
work. For example, a simple in-order processor will stop when waiting for a
memory request to be served (it is self-throttling – see Section 2.1).
Specifically, Markov chains work by defining a state machine with states
1, 2, . . . , n so that for any given time instant the machine is always on a
specific state Xt. The Markov chain additionally defines a probability distri-
bution function that determines, if in the current time instant t the machine
is currently in state Xt = i, the probability that on time instant t + 1 the
machine will be in any other possible state j [38]. That is, the Markov chain
defines this function:
Pr(Xt+1 = j|Xt = i) for every j = 1, 2, . . . , n
Note how the probability function does not depend on past states but only
by the current state.
56 CHAPTER 4. SIMULATION OF A CHIP MULTIPROCESSOR
Runningunning
Memory 
reference
Load/store instruction
Probability = MPI
Execute non-memory 
instruction 
L1 hit delay
STALL
L1 hit
Pr = pL1
L2 hit delay
STALL
L1 miss + L2 hit
Pr = pL2
Stall for
LL1 cyles
Stall for
 LL2 cycles
Inject L3 
request 
packet
L1 miss + L2 miss + L3 hit
Pr = pL3
Wait for 
reply
STALL
Inject 
request to 
memory
L1 miss + L2 miss + L3 miss
Pr = pMC
Figure 4.3: An example Markov chain for a in-order core simulation.
As an example, we will now assume that we are trying to model a very
simple in-order core, which executes a instruction every cycle, using a Markov
chain. We will not distinguish between read and write operations because
they are similar.
The complete state machine for this simple a case can be seen in Fig-
ure 4.3. At the initial state, “running”, the processor is not generating any
traffic but executing instructions, and therefore increasing throughput. For
every simulation cycle, there is certain probability that the state machine
advances to the “memory reference” state. We call this probability mem-
ory references per instruction (MPI). Obviously, there is the complementary
probability 1− MPI that the processor continues in the “running” state.
Once the memory reference state is reached, there is a new set of proba-
bilities for each of the newly reachable states: a hit to L1, a miss to L1 but
a hit to L2, a miss to L2 but a hit to L3, and no hit at all – and therefore
an access to main memory. These probabilities are provided as inputs to the
Markov chain, as seen in Table 4.3. When accessing L1 or L2, the processor
does not generate any traffic but just stalls for a fixed set of cycles. When L3
or main memory is accessed, the Markov chain reaches a state where a packet
is injected into the network and the processor stalls until it is received. From
4.3. SIMULATOR DESIGN 57
MPI Average number of memory references per instruction (a
workload parameter)
pL1 Average proportion of hits inside L1 (the hit ratio). This can
be modeled based on the size and a power law approximation
(Section 2.2.1).
pL2 Average hit ratio inside L2.
pL3 Average hit ratio inside L3.
LL1 Average L1 latency on a hit (until data is served).
LL2 Average L2 hit latency
Table 4.3: Input parameters for the Markov chain.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1 2 3 4 5 6 7
A
cc
e
ss
 p
ro
b
ab
ili
ty
 
Distance between core and cache node 
Figure 4.4: To calculate the targets for memory request traffic, a power law distribution
is used.
those, it goes back to the “running” state with 100% probability. Pseudocode
to show how the “running” state is implemented is shown in Listing 4.1.
Note that, unlike the model presented here, in a real system a core does
not know whether a memory request will be a miss or a hit without having
launched the actual memory request beforehand, as well as memory requests
to each other lower level in the hierarchy. For example, a core will not able
to know whether accessing main memory is needed until a request has been
sent to L1, L2, and L3. In our model this is known before the request is sent
for additional simplicity and because the involved traffic is often negligible,
but this might not be the case depending on the cache coherency protocols
used (Section 2.2.2).
One additional aspect needs to be resolved for memory accesses that will
go outside of the core, that is, those to L3 or the main memory controllers: a
given CMP design might have more than one L3 cache node or main memory
controller node, and a core must inject traffic to a specific one. We discussed
58 CHAPTER 4. SIMULATION OF A CHIP MULTIPROCESSOR
i s memory access ← RandomBoolean ( ProbTRUE = MPI)
i f i s memory access :
r ← RandomUniformReal (0 . . .1)
i f r < pL1 :
Duration ← LL1
S t a l l F o r ( Duration )
e l s e i f r < pL1 + pL2 :
Duration ← LL2 ;
S t a l l F o r ( Duration ) ;
e l s e i f r < pL1 + pL2 + pL3 :
Des t ina t i on ←
SelectRandomL3CacheNodeNear (Me)
In j e c tPacke t (MEMORY REQUEST, Des t inat i on )
e l s e :
Des t ina t i on ←
SelectRandomMemoryControllerNear (Me)
In j e c tPacke t (MEMORY REQUEST, Des t inat i on )
e l s e :
ExecutedInstruct ionsCount ←
ExecutedInstruct ionsCount + 1
Listing 4.1: Pseudocode for the basic algorithm of the core simulation in the Running
state
4.3. SIMULATOR DESIGN 59
in Section 3.1 that the average distance for injected traffic shows a power-
law like distribution. For this reason, in this work we will model cores as
randomly selecting a memory node, but giving more probability to nodes
that are logically nearest (Figure 4.4).
Out-of-order cores can be modeled by extending the Markov chain in
Figure 4.3 so that the “stall” states can also transition to the memory request
state – with a probability that depends on the MPI, the ILP, and the current
number of pending memory requests. As the number of pending memory
requests increases, the probability of entering a ”complete stall” increases,
where no new memory requests are launched or instructions executed. How
much memory pending memory requests can coexist without entering the
complete stall is both a workload and core parameter [40].
4.3.2 Simulating the memory hierarchy
Simulating the memory hierarchy includes simulating the L3 cache nodes, the
main memory controllers, and the cache coherency and consistency protocols.
Generally, a memory component can be thought, from an external point
of view, as a system that listens to incoming memory requests and replies
to them after a certain delay. In the real world, this delay depends on a
very large number of both physical parameters and the incoming requests as
memories are complex elements [41].
However, as in the previous case, this work is not centered on exploring
the inner workings of the memory technologies, but rather the high-level
aspects of the memory hierarchy, such as the number, size and placement of
cache memories. Therefore, full simulation of the memory subsystem, while
possible, was deemed unnecessary.
The delay can be model as a stochastic process that depends on a much
lower number of parameters, such as the current number of requests the
component is serving at a given time instant, plus a base constant delay
depending on the construction of the memory chip. Within the scope of this
work, however, a simpler approximation will be used: we will model this
delay using a function that depends only on the size of the cache, as seen on
Figure 4.5. This function is an approximation of real world memory behavior:
larger capacity implies a longer access time. While this simple model will
fail to account some of the finer details of memory response time behavior,
it does not incur any performance penalty.
60 CHAPTER 4. SIMULATION OF A CHIP MULTIPROCESSOR
0
2
4
6
8
10
12
0 MB 2 MB 4 MB 6 MB 8 MB 10 MB 12 MB 14 MB
H
it
 a
cc
e
ss
 la
te
n
cy
 (
in
 c
yc
le
s)
 
Cache size (in MB) 
Figure 4.5: The evolution of the cache hit latency is logarithmically proportional to the
cache size.
4.3.3 Simulating the interconnect
For the simulation of the interconnection fabrics, we have selected Book-
Sim [9]. BookSim is an existing network on chip simulator that was created
by the Stanford University Concurrent VLSI Architecture group. Originally
designed based on the techniques proposed in [34], BookSim has evolved
incorporating support for a variety of different network topologies, routing
algorithms and flow control configurations. The simulator is freely available
under a permissive license, open for anyone to modify and improve.
Within this project, the BookSim core was heavily modified to enable
support for several additional features that we required in order to explore
the design spaces proposed in this work:
• The use of the core and memory hierarchy simulations proposed in
Section 4.3.1 and Section 4.3.2, replacing BookSim’s existing purely
stochastic traffic simulation.
• The arbitrary combination of topologies in a hierarchical configuration
(Chapter 3).
• The batch means and confidence interval stopping criteria as defined
in Section 4.2.2.
• Simulating buses and multibuses as defined in Section 2.3.1.
BookSim is a steady-state cycle-accurate flit-level simulator (Table 4.2).
This means it accurately models the passage of flits over links, routers and
4.3. SIMULATOR DESIGN 61
buffers. In every cycle, CMP components inject packets as determined by
their simulation models. The injected packets are divided into flits depending
on the interconnection settings. After this, flits advance at a speed config-
ured by the link delay. Routers also proceed through each of the internal
stages described in Section 2.4.4. If a head flit arrives into a router port, the
routing algorithm function is executed in order to know the output port for
that packet. Ejected packets are sent to the corresponding CMP component
simulation model.
4.3.4 Estimating power and area
By measuring the instructions executed in the processor simulation, the
throughput from the CMP can be measured. However, there are other very
important attributes of a CMP design we would like to evaluate: power and
area. More detailed simulations could produce very accurate estimates for
power usage and chip area usage. Those usually have prohibitive perfor-
mance requirements for most explorations, and are usually left to the final
evaluation steps. Specifically, finding the exact area of a CMP configuration
involves finding the optimal placement of all of the components within it (the
floorplan), a computationally very complex problem.
For exploration, the real values can be approximated based on the design
parameters that are deemed most influential. For example, the ORION 2 [42]
power and area model for NoCs approximates the area of the router using a
formula whose inputs include the number of input and output ports of the
router.
Power can be modeled as the sum of both static power and dynamic
power. Static power is independent from the activity of the component, and
therefore depends only on design parameters. Dynamic power, on the other
hand, depends on the activity of the component. For example, the number
of read and write accesses in a memory.
Within this work, we have used existing models for the estimation of area
and power from a given CMP configuration. For memories and caches, the
CACTI 5 [43] model will be used. For all NoC components, such as links,
routers and buffers, the ORION 2 [42] model is used. For the cores them-
selves, a much simpler linear approximation based on the known parameters
of the Intel Atom processor, as the level of detail employed for the simulation
of processors is too small for a more accurate approximation.
62 CHAPTER 4. SIMULATION OF A CHIP MULTIPROCESSOR
4.4 Conclusions
There are many possible levels of detail available when simulating a CMP.
Increasing the level of details means more complexity, less speed, but more
accuracy. In this chapter, the concept of simulation has been explained,
and different levels of simulation detail have been explored for each of the
components of a CMP. The different levels of detail have been evaluated from
the point of view of architectural exploration, selecting the most appropriate
ones for this work. Those are:
• Cores: a stochastic model is used, based on Markov chains. It is con-
figurable via the IPC, MPI, and different by cache level access proba-
bilities.
• Memories: a simple reactive model is used; memories reply after a delay
that depends on their capacity.
• Interconnection: the interconnection is simulated with a flit-accurate
level of detail.
Chapter 5
Results
In this section, different CMP design spaces will be proposed and explored
using the tools developed in this work.
5.1 Experimental conditions
For evaluating the proposed CMP designs, we have used a experimental setup
consisting in the modified BookSim described in Section 4.3.3 as well as the
probabilistic core and memory system models described in Section 4.3.1. In
this simulation the user has to provide the average workload parameters
(average number of memory accesses, miss ratios) and CMP design (number,
ideal IPC of processors, distribution of caches and memory controllers and
configuration of the interconnect).
5.2 The case for a CMP
We begin by doing a simple simulation to show how a CMP with many simple
cores can have a performance similar to a CMP with few but very powerful
cores. For the example, we will assume that we are comparing two cores
with the specifications seen in Table 5.1, that are extrapolated from existing
45nm processors [2].
Using a gross approximation for the total area size, we can fit 4 of the
larger cores in a 240 mm2 chip. No shared L3 memory will be assumed for
this test. In the same 240 mm2 chip, we can fit around 12 of the smaller
cores. Using ideal IPC alone, the chip with smaller cores would execute up
to 12 × 1.5 = 18 instructions per second, while the chip with larger cores
would be able to execute up to 4 × 3 = 12 IPC. However, simulation shows
63
64 CHAPTER 5. RESULTS
Small core Large core
IPC 1.5 3
L1 64 KB 64 KB
L2 256 KB 1024 KB
Core size 20 mm2 60 mm2
Table 5.1: Specifications of the core types to compare
0
0.5
1
1.5
2
2.5
Large cores Small cores
Th
ro
u
gh
p
u
t 
Figure 5.1: Results of test proposed in Section 5.2
(Figure 5.1) that the CMP composed of the 12 small cores provides 25%
more throughput than the CMP with the 4 large cores.
This result might seem counterintuitive, as despite a higher ideal IPC, the
CMP with larger processors turns out to be slower. The fact that the smaller
cores CMP has more L1 cache than the larger cores one certainly contributes
to this difference, albeit the main cause is that stalls are less problematic on
the CMP with many-cores. While a single core stalling reduces throughput
by a 25% in the large cores CMP, such a situation only reduces the small cores
CMP throughput by 8%. Additionally, our workload model is assuming that
each core runs tasks that are mostly independent, in that, one task stalling
does not increase the chances of another task stalling – in a sense, we are
assuming that our tasks have infinite TLP.
5.3 Memory hierarchy design
It has been seen that one quite important area of the exploration space is the
design of the memory hierarchy subsystem. This includes the selection of the
appropriate size of caches, distribution of memories between levels, etc. In
Figure 5.2, it can be seen how the characterization of the workload affects the
5.4. HIERARCHICAL TOPOLOGIES 65
0.05
0.1
0.2
0.3
0.4
0.5
0
20
40
60
80
100
120
140
160
180
200
4
8
16
24
32
48
64
128
192
256
Workload Average MPI 
[MemRefs/Instruction] 
TT
h
ro
u
gh
p
u
t 
[I
P
C
] 
L3 Total Cache Size [MB] 
Figure 5.2: Evolution of the CMP throughput as the workload memory characterization
and L3 cache sizes change.
demands of the CMP to the memory subsystem. The z axis, the workload
MPI, determines how often a core will generate a memory request. The x axis
represents the total L3 cache size, and the y axis the throughput of the entire
CMP as measured by simulation. Larger MPI values characterize workloads
that have more frequent memory accesses and therefore are bounded by the
performance of the memory system. On the other hand, smaller MPI values
characterize applications that do not send as many memory accesses. Such
applications with small MPI can reach very high throughput even on systems
with memory hierarchies that are underprovisioned.
It can also be seen in Figure 5.2 that even for memory-bounded applica-
tions there is a point (L3 cache size > 128MB) at which adding more cache
is negligible. This is because the miss ratio is calculated using a power-law
(Section 2.2.1), and such large cache sizes are situated at the far-right of the
curve. Since adding more cache increases both power consumption and area
usage, the design space exploration tool can be configured to find the most
optimum point given area and space constraints.
5.4 Hierarchical topologies
In this test, the topologies presented in Chapter 3 will be compared against
each other, using simulation (Figure 5.3). Two different workload character-
66 CHAPTER 5. RESULTS
Topology Global
inter-
connect
size
Local
inter-
connect
size
Flat Mesh Mesh 8x8 –
HMesh (4/clus) Mesh of meshes 4x4 2x2
Bus (4/clus) Mesh of buses 4x4 4
URing (4/clus) Mesh of rings (unidir.) 4x4 4
BRing (4/clus) Mesh of rings (bidir.) 4x4 4
HMesh (16/clus) Mesh of meshes 2x2 4x4
Bus (16/clus) Mesh of buses 2x2 16
URing (16/clus) Mesh of rings (unidir.) 2x2 16
BRing (16/clus) Mesh of rings (bidir.) 2x2 16
Table 5.2: Topologies compared in Section 5.4
izations will be used: the first one will use an MPI = 0.2 while the second
one will use an MPI = 0.3. The results with MPI = 0.2 will be more rep-
resentative of workloads with lower memory requirements, while those with
MPI = 0.3 will represent tasks where the speed of the memory hierarchy,
and therefore, of the interconnection is critical. The topologies compared are
summarized in Table 5.2. Apart from exploring the different topologies, we
will explore different levels of concentration, by altering the balance between
the sizes of the global and local interconnect.
Apart from throughput, we have also measured the power consumption
of the different interconnect topologies. The results are in Figure 5.4. It can
be seen that the mesh of meshes consumes a specially high amount of power.
It is strongly correlated with the number of routers having large numbers
of ports, something that meshes have (at a minimum, 5 ports). Since the
throughput benefit of the mesh of meshes is minimal, this does not seem like
an interesting topology. It can be seen though how well the mesh scales as
it keeps the best performance when we increase the number of components
in the local interconnect to 16, beating all other topologies.
For the other topologies there is a very apparent trade-off between energy
and performance: bidirectional rings are the most performing, but also most
power consuming of the topologies, while buses are the least consuming ones,
but clearly do not scale well even at 16 components per bus.
5.4. HIERARCHICAL TOPOLOGIES 67
0
5
10
15
20
25
30
35
Flat Mesh HMesh
(4/clus)
Bus
(4/clus)
Uring
(4/clus)
Bring
(4/clus)
Hmesh
(16/clus)
Bus
(16/clus)
Uring
(16/clus)
Bring
(16/clus)
Th
ro
u
gh
p
u
t 
MPI=0.2
MPI=0.3
Figure 5.3: Results of test proposed in Section 5.4
0
20
40
60
80
100
120
140
160
180
Flat Mesh HMesh
(4/clus)
Bus
(4/clus)
Uring
(4/clus)
Bring
(4/clus)
Hmesh
(16/clus)
Bus
(16/clus)
Uring
(16/clus)
Bring
(16/clus)
In
te
rc
o
n
n
e
ct
 P
o
w
e
r 
[W
] 
Figure 5.4: Power results from Section 5.4
68 CHAPTER 5. RESULTS
5.5 Conclusions
In this chapter, we have used the simulator developed in this work to compare
a few different CMP designs. We have firstly found how with our current
workloads model the simulator finds that CMPs composed of many small
cores are preferable to CMPs composed of few large cores. The increases in
throughput gained by having improved memory subsystems have also been
estimated, albeit the increases get progressively smaller as the total cache
memory capacity surpasses a certain threshold. There is a trade-off between
this capacity and the area and power used by the memory capacity.
Additionally, we have explored the hierarchical topologies that were pro-
posed in Chapter 3. The results of comparing the performance of those show
that all hierarchical topologies have much better performance than the refer-
ence 8x8 flat mesh. Hierarchical meshes have a very small performance per
watt ratio when compared to the other topologies due to the large number of
routers, even when compared to the flat mesh. All the other three hierarchi-
cal topologies use less power than the flat mesh. The mesh of bidirectional
rings is the topology with the greatest raw performance, but it also uses
the most power from the those three. Bus uses the least power, but it also
has the worst performance. Unidirectional rings are somewhere in between,
presenting another power/performance trade-off.
While increasing the size of the local interconnect reduces performance
in all cases, power usage is also reduced. Therefore, large clusters might be
interesting from a point of view of power usage, albeit smaller clusters are to
be preferred when performance is a goal.
Chapter 6
Conclusions
In this work, we have introduced the concepts of chip multiprocessor and
design space exploration. We have thereafter designed and implemented a
simulation tool that is capable of helping a CMP designer explore the very
large design space of all possible CMP configurations. Neither CMP design
exploration nor using simulation to tackle this exploration space are new
topics. However, the simulator that we have implemented in this work is
both fast and flexible. It is nowhere near as fast as an analytical model, but
it can readily be used as a second stage evaluation tool for CMP designs that
have been preselected by a less accurate first-stage but faster exploration.
Being flexible, the tool can also be adapted to handle very complex CMP
designs and topologies such as the hierarchical topologies that were described
in Chapter 3.
Additionally, we have used the simulator to explore a few sample design
spaces. For example, it has been used to verify the benefits of hierarchical
networks versus traditional flat topologies (from 25% up to even 50%), and,
from those, the interest of buses as low power local interconnects for small-
sized clusters.
6.1 Future work
There are several potential directions where this work could be enhanced:
• Heterogeneous workloads: The simulation tool is perfectly capable of
simulating CMPs with irregular topologies. For example, a hierarchical
topology where some leaves are different from the others, even having
different types of processors in each of the leaves. However, the cur-
rent workload model, which assumes that we have an infinite stream
of identical programs that can be fully parallelized without penalty,
69
70 CHAPTER 6. CONCLUSIONS
Core #1 Core #2
Router
Core #3 Core #4
L3 Cache Memory
C
o
re
 #
1
C
o
re
 #
2
Router
C
o
re
 #
3
C
o
re #4
L3 Cache Memory
Router
Figure 6.1: An example of a non optimal and optimal floorplans
is not enough to characterize a workload that might exploit heteroge-
neous configurations. A strategy to model those workloads needs to be
researched if the design space is to include heterogeneous CMPs.
• Trace-based core simulation: Despite its inherent slowness the increased
accuracy that comes from using traces might be an interesting project.
Difficulties will arise in that it is hard to find traces for systems with
hundreds of cores and will have to extrapolated from the current exist-
ing systems somehow.
• Supporting additional cache coherency and consistency models. It is
the author’s impression that the impact of the traffic generated by
synchronization messages will not be important. Our simplistic models
might therefore fail to account the full impact of the traffic caused by
the synchronization messages. For completeness, more accurate cache
models could be plugged in so that this traffic could be quantified
accurately.
• Floorplanning. In order to accurately quantify the area of a given
CMP design, an optimal floorplan, that is, the optimal placement of
all components of a CMP so that the area is minimized needs to be
produced (Figure 6.1). Floorplanning is also usually the next step in
the traditional CMP design, right after exploration. In fact, we already
have the intention to start researching on this topic.
Most of the the tasks detailed in this section have been already started
or will be researched as part of the ongoing Intel project.
Bibliography
[1] Neil Weste and Kamran Eshraghian. Principles of CMOS VLSI Design
- A System Perspective. Second Edition. Addison-Wesley Publishing
Company, 1993.
[2] Stanford CPU DB. http://cpudb.stanford.edu/.
[3] David W. Wall. Limits of instruction-level parallelism. In David A.
Patterson, editor, Proceedings of the Forth International Conference on
Architectural Support for Programming Languages and Operating Sys-
tems, pages 176–188. ACM Press, 1991.
[4] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Kenneth G. Wil-
son, and Kunyung Chang. The case for a single-chip multiprocessor.
In Proceedings of the Seventh International Conference on Architectural
Support for Programming Languages and Operating Systems, pages 2–
11. ACM Press, 1996.
[5] Torsten Kempf, Gerd Ascheid, and Rainer Leupers. Multiprocessor Sys-
tems on Chip: Design Space Exploration. Springer, Berlin, February
2011.
[6] L. Kleinrock. Queueing systems, vol. I: Theory. Wiley, New York, 1975.
[7] J.Y. Le Boudec and P. Thiran. Network calculus: a theory of determin-
istic queuing systems for the internet. Springer-Verlag, 2001.
[8] Matteo Monchiero, Ramon Canal, and Antonio Gonza´lez. Power/per-
formance/thermal design-space exploration for multicore architectures.
IEEE Trans. Parallel Distrib. Syst., 19(5):666–681, 2008.
[9] Booksim 2.0. nocs.stanford.edu/booksim.html.
[10] Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann, Michael R.
Marty, Min Xu, Alaa R. Alameldeen, Kevin E. Moore, Mark D. Hill, and
71
72 BIBLIOGRAPHY
David A. Wood. Multifacet’s general execution-driven multiprocessor
simulator (GEMS) toolset. SIGARCH Computer Architecture News,
33(4):92–99, 2005.
[11] Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K. Jha. GAR-
NET: A detailed on-chip network model inside a full-system simulator.
In Proceedings of the 2009 IEEE International Symposium on Perfor-
mance Analysis of Systems and Software, pages 33–42. IEEE, 2009.
[12] Noxim: Network-on-chip simulator. http://sourceforge.net/
projects/noxim.
[13] James D. Balfour and William J. Dally. Design tradeoffs for tiled CMP
on-chip networks. In Gregory K. Egan and Yoichi Muraoka, editors,
Proceedings of the 20th Annual International Conference on Supercom-
puting, pages 187–198. ACM, 2006.
[14] Reetuparna Das, Soumya Eachempati, Asit K. Mishra, Narayanan Vi-
jaykrishnan, and Chita R. Das. Design and evaluation of a hierarchical
on-chip interconnect for next-generation CMPs. In Proceedings of the
15th International Conference on High-Performance Computer Archi-
tecture, pages 175–186. IEEE Computer Society, 2009.
[15] Norman P. Jouppi et al. A 300mhz 115w 32b bipolar ECL microproces-
sor. IEEE Journal of Solid-state Circuits, 28(11):1152–1166, November
1993.
[16] Norman P. Jouppi and David W. Wall. Available instruction-level par-
allelism for superscalar and superpipelined machines. In Joel S. Emer,
editor, Proceedings of the Third International Conference on Architec-
tural Support for Programming Languages and Operating Systems, pages
272–282. ACM Press, 1989.
[17] Richard M. Karp and Vijaya Ramachandran. A survey of parallel al-
gorithms for shared-memory machines. Technical Report UCB/CSD-
88-408, EECS Department, University of California, Berkeley, March
1988.
[18] J.M. Tendler, J.S. Dodson, JS Fields, H. Le, and B. Sinharoy. POWER4
system microarchitecture. IBM Journal of Research and Development,
46(1):5–25, 2002.
BIBLIOGRAPHY 73
[19] Ronald N. Kalla, Balaram Sinharoy, and Joel M. Tendler. IBM Power5
chip: A dual-core multithreaded processor. IEEE Micro, 24(2):40–47,
2004.
[20] Jean-Luc Gaudiot, Jung-Yup Kang, and Won Woo Ro. Techniques to
improve performance beyond pipelining: Superpipelining, superscalar,
and VLIW. Advances in Computers, 63:2–35, 2005.
[21] D. Bhandarkar. RISC architecture trends. In Proceedings of the 5th Ad-
vanced Computer Technology, Reliable Systems and Applications Annual
European Computer Conference, pages 345–352. IEEE, 1991.
[22] G. Hinton, D. Sager, M. Upton, D. Boggs, et al. The microarchitecture
of the Pentium 4 processor. Intel Technology Journal, 2001.
[23] R. M. Tomasulo. An efficient algorithm for exploiting multiple arith-
metic units. IBM Journal of Research and Development, 11(1):25–33,
January 1967.
[24] Joseph A. Fisher. Very long instruction word architectures and the eli-
512. In Harold W. Lawson Jr., Tilak Agerwala, Hans H. Heilborn, Hideo
Aiso, Lars-Erik Thorelli, Jean-Loup Baer, and Mario Tokoro, editors,
Proceedings of the 10th Annual International Symposium on Computer
Architecture, pages 140–150. ACM, 1983.
[25] Sanjive Agarwala et al. A 600-mhz VLIW DSP. IEEE Journal of Solid-
state Circuits, 37(11):1532–1544, 2002.
[26] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. Simultaneous
multithreading: Maximizing on-chip parallelism. In David A. Patterson,
editor, ISCA, pages 392–403. ACM, June 1995.
[27] Lawrence Spracklen and Santosh G. Abraham. Chip multithreading:
Opportunities and challenges. In Proceedings of the 11th International
Conference on High-Performance Computer Architecture, pages 248–
252. IEEE Computer Society, 2005.
[28] David A. Patterson and John L. Hennessy. Computer Organization and
Design: The Hardware/Software Interface. Morgan Kaufmann Publish-
ers Inc., San Francisco, CA, USA, 4th edition, 2008.
[29] Wm. A. Wulf and Sally A. McKee. Hitting the memory wall: impli-
cations of the obvious. SIGARCH Comput. Archit. News, 23(1):20–24,
March 1995.
74 BIBLIOGRAPHY
[30] Alan Jay Smith. Cache memories. ACM Comput. Surv., 14(3):473–530,
1982.
[31] C. K. Chow. Determination of cache’s capacity and its matching storage
hierarchy. IEEE Trans. Comput., 25(2):157–164, February 1976.
[32] Leslie Lamport. How to make a multiprocessor computer that correctly
executes multiprocess programs. IEEE Trans. Computers, 28(9):690–
691, 1979.
[33] Milo M. K. Martin, Mark D. Hill, and David A. Wood. Token coherence:
Decoupling performance and correctness. In Allan Gottlieb and Kai
Li, editors, Proceedings of the 30th Annual International Symposium on
Computer Architecture, pages 182–193. IEEE Computer Society, 2003.
[34] William Dally and Brian Towles. Principles and Practices of Intercon-
nection Networks. Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA, 2003.
[35] Giorgos Passas, Manolis Katevenis, and Dionisios N. Pnevmatikatos.
Crossbar nocs are scalable beyond 100 nodes. IEEE Trans. on CAD of
Integrated Circuits and Systems, 31(4):573–585, 2012.
[36] Robert E. Shannon. Tests for the verification and validation of computer
simulation models. In Proceedings of the 13th conference on Winter
simulation - Volume 2, WSC ’81, pages 573–577, Piscataway, NJ, USA,
1981. IEEE Press.
[37] Jennifer R. Linton and Catherine M. Harmonosky. A comparison of
selective initialization bias elimination methods. In Jane L. Snowdon
and John M. Charnes, editors, Winter Simulation Conference, pages
1951–1957. ACM, 2002.
[38] Jerry Banks. Handbook of Simulation: Principles, Methodology, Ad-
vances, Applications, and Practice. Wiley-Interscience, September 1998.
[39] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter. Markov chains and
Monte Carlo in Practice: Interdisciplinary Statistics. Chapman & Hal-
l/CRC, 1996.
[40] Yuan Chou, Brian Fahs, and Santosh G. Abraham. Microarchitecture
optimizations for exploiting memory-level parallelism. In ISCA, pages
76–89. IEEE Computer Society, 2004.
BIBLIOGRAPHY 75
[41] David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Kathleen
Baynes, Aamer Jaleel, and Bruce Jacob. DRAMsim: a memory system
simulator. SIGARCH Comput. Archit. News, 33(4):100–107, November
2005.
[42] Andrew B. Kahng, Bin Li, Li-Shiuan Peh, and Kambiz Samadi. ORION
2.0: A power-area simulator for interconnection networks. IEEE Trans.
VLSI Syst., 20(1):191–196, 2012.
[43] S. Thoziyoor, N. Muralimanohar, J.H. Ahn, and N.P. Jouppi. CACTI
5.1. HP Labs, Palo Alto, Tech. Rep. HPL-2008-20, 2008.
Glossary
CMP chip multiprocessor. 6, 11–15, 20, 24, 28, 29, 45
DSP digital signal processor. 16
flit flow control digit. 34
ILP instruction level parallelism. 14–16, 18–20
IPC instructions per cycle. 15
MPI memory references per instruction. 56
NoC network-on-chip. 7, 11, 28–30, 34, 35
SMT simultaneous multi-threading. 15, 20
throughput number of executed instructions per second, a common perfor-
mance measure unit. 15
TLP thread level parallelism. 14–16, 20
VC virtual channel. 35, 37
VLIW very long instruction word. 14, 19
76
Appendix A
Simulator user manual
The simulator that was built for the evaluation of the hierarchical topologies
was based, as mentioned in Section 4.3.3, in BookSim, an existing NoC sim-
ulator. To avoid duplication of work, most of the frontend of the simulator
behaves in a very similar to the original BookSim.
The simulator can be invoked by:
$ booksim config.conf
config.conf must be a configuration text file existing in the current work-
ing directory. This is the main simulator configuration file, and follows the
same syntax as the original BookSim configuration file [9]. Some of the
most common options will be explained in section Section A.1. However, the
description of the CMP that is to be explored is specified in a completely dif-
ferent way, by using an additional configuration file described in Section A.2.
The name for this additional configuration file is to be specified in config.conf
in the following way:
cmp conf ig = c o n f i g . cmp
As a shortcut, parameters from the main simulation configuration file can
be overridden when invoking BookSim. For example:
$ booksim config.conf cmp conf ig=mesh.cmp
or
$ booksim config.conf cmp conf ig=mesh.cmp
sample per iod=10
77
78 APPENDIX A. SIMULATOR USER MANUAL
When a simulation is in progress, the simulator will print after simulation
of each batch is completed the current batch number as well as the estimated
throughput, average latency of all memory requests in the CMP as well as
the sum of injection rates from all the cores:
Samples = 9 ,
Average throughput = 6.54344 ,
Average l a t ency = 16 .8002 ,
I n j e c t i o n ra t e = 1.24233
Once min samples batches have been run, the simulator will start calcu-
lating the current 95% confidence interval using the batch means method.
Current 95% Conf idence i n t e r v a l f o r Throughput =
[ 6 . 4 4 3 1 5 , 6 . 56 202 ]
Half−s i z e = 0.0594353
( Acceptable l i m i t = 0.0650259)
The simulation will automatically end when the stopping criteria is reached.
A.1 Main simulation configuration
Flow control options
num vcs = 1 ;
Changes the number of virtual channels in use (Section 2.4.3).
v c b u f s i z e = 8 ;
The buffer size, in flits. If virtual channels are in use, this is the number of
flits per virtual channel.
Simulation conditions options
sample per iod = 10000 ;
The batch size (Section 4.2.2).
warmup periods = 2 ;
The number of cycles (in multiples of the batch size) to discard from the
beginning of the simulation (to avoid initialization bias, Section 4.2.1).
s t o p p i n g t h r e s = 0 . 0 1 ;
A.2. HIERARCHICAL TOPOLOGY DESCRIPTION FILE 79
The maximum relative half-length (as a percentage) of the confidence interval
before the simulation stops. If this value is x and the confidence interval is
(c− l, c+ l) (centered in c with half-length l), the simulation will stop when
l < xc holds true.
min samples = 10 ;
The minimal number of batches a simulation (not including warm-up) must
run, even if the stopping threshold is reached.
max samples = 300 ;
The absolute maximum number of batches (not including warm-up) the simu-
lation will be run. If the stopping criterion is not reached before this number,
the simulation ends with an error condition.
s im type = node ;
topo logy = cmp ;
r o u t i n g f u n c t i o n = mul t ip l e ;
These options are required for the use of the simulator extensions described
in Section 4.3.3.
cmp r ing subcon f i g = cmp ring . conf ;
Additional configuration files can be specified for certain topologies. These
files follow the same syntax as the main configuration file, and can override
certain parameters for levels in the hierarchy that use such topology.
A.2 Hierarchical topology description file
The hierarchical topology description file is an additional text file that de-
scribes the CMP design to be evaluated. This file is a text file, but follows a
different syntax. Here is an example of a 4-core CMP, using a 2D mesh 2x2
topology.
PARAM L3LatencyDef=11
PARAM MemReplySize=3
PARAM NiDelay=1
MESH Col=2 Row=2 LinkDelay=1 RouterDelay=3
PROC L1Size =0.064 L1Lat=3 L1AccProb=0.7
L2Size =0.256 L2Lat=3 L2AccProb=0.0166667
L3AccProb=0.263233 MMAccProb=0.0201 Ipc=1
Mpi=0.5
PROC L1Size =0.064 L1Lat=3 L1AccProb=0.7
L2Size =0.256 L2Lat=3 L2AccProb=0.0166667
L3AccProb=0.263233 MMAccProb=0.0201 Ipc=1
80 APPENDIX A. SIMULATOR USER MANUAL
Mpi=0.5
PROC L1Size =0.064 L1Lat=3 L1AccProb=0.7
L2Size =0.256 L2Lat=3 L2AccProb=0.0166667
L3AccProb=0.263233 MMAccProb=0.0201 Ipc=1
Mpi=0.5
MEM Siz e =15.497
MEMCTRL Locat ion=North Latency=100
Initially, global configuration parameters are listed, each using a PARAM
statement. Then, the top-level component of the network is described (in
this case, the 2x2 mesh, MESH). After the top-level component, each of the
subcomponents is described.
Types of components
These are all of the component types that can be specific
MESH Two-dimensional mesh
BUS Bus or multi-bus
BRING Bi-directional ring
URING Uni-directional ring
PROC A core
MEM A shared L3 cache memory node
MEMCTRL A memory controller
NI A network interface node
Global parameters
These parameters are global to the entire CMP. They are specified at the
top of the configuration file.
L3LatencyDef Access latency for all L3 memory nodes.
MemReplySize The length of the memory reply packet sizes (in
flits).
NiDelay Processing time of the network interfaces (in cy-
cles).
Parameters for MESH
Col, Row Size of the mesh.
LinkDelay Delay (in cycles) to traverse any link in the mesh.
RouterDelay Flit processing time for any router in the network.
A.2. HIERARCHICAL TOPOLOGY DESCRIPTION FILE 81
Parameters for BUS
CompCnt Size of the ring (number of components).
Buses 1 in case of a bus. The number of parallel links in
the case of a multi-bus.
AccessTime Delay (in cycles) for a packet to traverse the bus.
Parameters for URING, BRING
CompCnt Size of the ring (number of components).
LinkDelay Delay (in cycles) to traverse any link in the mesh.
RouterDelay Flit processing time for any router in the network.
Parameters for CORE
L1Lat Hit latency for the internal private L1.
L2Lat Hit latency for the internal private L2.
Ipc Ideal instructions per cycle.
Mpi Memory references per instruction.
L1AccProb Probability of an L1 hit.
L2AccProb Probability of an L1 miss and L2 hit.
L3AccProb Probability of an L1 miss, L2 miss and L3 hit.
MMAccProb Probability of an L1 miss, L2 miss and L3 miss.
Parameters for MEM
Latency Explicit hit latency for this L3 memory node
(overriding global parameter)
Parameters for MEMCTRL
Location In a mesh, a memory controller can be placed
alongside one of the sides. This option indicates
which side (West, East, North, South).
Latency The latency of external memory requests.
Appendix B
Analytical performance
modeling
It was mentioned in Section 1.1 that analytical modeling is a faster, but less
accurate design exploration methodology than simulation. The simulator
that has been developed in this project has already been used to validate an
analytical model also oriented towards the design space exploration of CMPs.
This analytical model is around 1,000 times faster but shows 10% diminished
accuracy when compared to the simulator developed in this work. The sim-
ulator, though, fits in the exploration process by producing more accurate
performance evaluations of the configurations that have been preselected by
the analytical model.
A paper describing this project, which was submitted to the 2012 Inter-
national Symposium on Networks-on-Chip, is attached here.
82
Analytical Performance Modeling
of Hierarchical Interconnect Fabrics
Nikita Nikitin, Javier de San Pedro, Josep Carmona and Jordi Cortadella
Universitat Polite`cnica de Catalunya
Barcelona, Spain
Abstract—The continuous scaling of nanoelectronics is in-
creasing the complexity of chip multiprocessors (CMPs) and
exacerbating the memory wall problem. As CMPs become
more complex, the memory subsystem is organized into more
hierarchical structures to better exploit locality. During the
exploration and design of CMP architectures, it is essential to
efficiently analyze their performance. However, performance is
highly determined by the latency of the memory subsystem,
which in turn has a cyclic dependency with the memory traffic
generated by the cores. This paper proposes a scalable analytical
method to estimate the performance of highly parallel CMPs
(hundreds of cores) with hierarchical interconnect fabrics. The
method can use customizable probabilistic models and solves the
cyclic dependencies by using a fixed-point strategy. The technique
is shown to be a very accurate and efficient strategy when
compared to the results obtained by simulation.
I. INTRODUCTION
The continuous shrinking of CMOS technology has enabled
the integration of multiple cores and distributed memory in
one chip. Parallelism has also been one of the paradigms to
make computations more power efficient. In the last few years,
multicore systems have evolved from having few cores [1], [2]
to single-chip processors with tens or hundreds of computing
units [3]–[5].
Tiled CMPs are an effective approach to architect general-
purpose processors under the intense time-to-market pres-
sure [6], [7]. The replication of tiles provides a rapid way of
floorplanning many computing units in one chip and commu-
nicating them with scalable interconnect fabrics. Figure 1(a)
shows an example of a CMP with 16 tiles, each one including
a computing core (C), two levels of private on-chip caches
(L1, L2), and a router (R) that communicates with the on-chip
interconnection network (a mesh). Two memory controllers
(MC) provide access to the off-chip memory.
To exploit the locality of memory references, hierarchical
interconnects have been proposed [7], [8]. Several cores can be
grouped into one cluster to share the on-chip cache, accessible
through a local interconnect (e.g., bus, crossbar, ring, etc).
Hierarchy increases the intra-cluster hit-ratio and reduces the
traffic in the top-level interconnect. Figure 1(b) shows an
implementation of a CMP with 4 clusters. Each cluster has
two cores with private caches, a shared cache (L3) with tag
directory (DIR), a local interconnect (IC), a router and a
network interface (NI).
Given the vast space of design parameters, CMP designers
are faced with the complex problem of selecting the best ar-
T T T T
T T T T
T T T T
T T T T
MC MC
C L1
L2 R T
MC MC
C L1
L2
RC L1
L2
L3 DIR
NI
IC
T
T T
(a) (b)
Fig. 1: CMP layouts: (a) flat, (b) hierarchical.
chitecture subject to a set of constraints. Many design options
must be explored, such as the variety of core implementations,
interconnect types, topologies, cache hierarchies and memory
management policies. Moreover, the amount of configurations
increases drastically as the technology advances, allowing
more cores and memory to fit into the chip area.
Evaluating the performance of a CMP architecture is es-
sential to take the correct decisions during design. Unfortu-
nately, simulation imposes a prohibitive computational cost
when the space of design points grows significantly. In this
scenario, analytical modeling becomes an effective alternative
for rapidly pruning the design space during early exploration
and selecting a small set of promising configurations. Along
this line, several analytical models for CMP exploration have
been recently proposed (e.g., [9], [10]).
The fundamental problem in evaluating the performance of
a CMP is the calculation of the latency for memory requests,
given the parameters of the interconnect fabrics and memory
hierarchy. A key phenomenon that is underestimated by the
existing models is the contention effect of the interconnection
fabric. This paper will show that contention has a major
significance in the analysis of CMP performance. Ignoring
contention leads to optimistic latency and throughput measure-
ments, and may overestimate the architectures with saturated
interconnects. As a result, exploration may select inefficient
architectures and, even worse, discard the promising ones.
Accounting for contention is particularly important when
exploring hierarchical CMPs. Interconnects at different levels
of hierarchy may deliver different throughput characteristics
(e.g., a bus at the cluster level and a mesh at the top level). It
is then essential to verify that the required bandwidth between
the cores and the memory is delivered at all levels. This
verification discards architectures in which one level of the
interconnect is saturated, while another remains underutilized
and consumes resources unnecessarily.
The purpose of this work is to emphasize the importance
and provide analytical methods for modeling contention in
CMP exploration frameworks. The contributions of the paper
can be summarized as follows. First, we formulate the cyclic
dependency between the latency and the rate of memory
requests as a system of non-linear equations that models the
contention in the CMP interconnect. Second, we propose three
methods to resolve this model: using a general-purpose solver,
a fixed-point iteration and a bisection method. The last two
methods show significant runtime savings, trading-off accu-
racy and convergence. More importantly, these methods can
be parametrized with any black-box analytical model for the
latency. This makes our strategy flexible to incorporate novel
models for on-chip interconnects. Third, we experimentally
show the application of the methods for CMP exploration and
confirm the importance of evaluating contention.
Next section describes a simple example to emphasize the
importance of contention in performance evaluation of hier-
archical CMPs. Section III reports related work. The models
for CMP throughput, memory latency, traffic and their inter-
dependency are discussed in Section IV. In Section V, we
propose three methods to resolve this dependency. The exper-
imental evaluation of the methods is described in Section VI.
II. THE IMPORTANCE OF CONTENTION: AN EXAMPLE
Consider a CMP with 48 cores and 16 shared on-chip
cache modules. Figure 2 presents three (of the many) possible
architectures with such parameters. One of the architectures
has a 8×8 structure of regular tiles connected with a mesh
(Figure 2(a)). The cores and caches are shown as light and
dark squares, respectively. Solid lines represent the mesh links.
To take advantage of the locality of memory accesses,
several cores and caches can be grouped in a cluster and
communicate via the local interconnect. For instance, clusters
with bus interconnects were shown to notably improve the
average communication latency [8]. This fact encourages the
exploration of hierarchical interconnects. Figure 2(b) describes
the CMP organization with 16 clusters, each one having three
cores, one cache and a shared bus. The clusters communicate
via the top-level 4×4 mesh. Another option is to increase the
cluster size up to 16 components (12 cores and 4 caches) and
decrease the dimensions of the top-level mesh (Figure 2(c)).
One of the problems of architectural exploration is to select
the configuration with the best performance. We first estimated
the throughput of each configuration using only the static (hop-
count) latency of the network, i.e., assuming no contention.
In this experiment we assumed the ideal throughput of cores
(under the assumption of zero-latency memory) to be 2.0
IPC (instructions per cycle), and the number of memory
references per instruction to be 0.5. The values of static
latency (in cycles) and the estimated throughput (in IPC) are
displayed in the columns Lest and θest of Table I. Hierarchical
architectures show a higher performance due to the exploited
locality: configuration (c) has the largest size of local cache
(per cluster), hence the increased local hit ratio. Therefore, (c)
shows the highest estimated throughput.
However, this conclusion is incorrect when network con-
tention is taken into account. Simulation reveals rather distinct
(a) 8×8 mesh (b) 4×4 mesh with
bus clusters
(c) 2×2 mesh with
bus clusters
Fig. 2: Possible architectures for a 48-core CMP.
TABLE I: Performance of architectures in Figure 2.
Architecture Lest θest Lsim θsim
(a) 11.17 8.23 11.26 8.16
(b) 10.12 9.04 10.40 8.81
(c) 9.95 9.19 16.69 5.58
performance numbers, reported in the Lsim and θsim columns
of Table I. For configurations (a) and (b), the estimated
throughput with no contention is close to the one reported
by simulation. However, the performance of configuration (c)
drops by about 40%. In fact, simulation concludes that (c) is
the worst in terms of performance.
The reason of this significant discrepancy is the fabric
contention. It occurs because of the competition between
memory requests for the shared resources of the interconnect.
This results into longer latencies, decreasing the overall perfor-
mance of the system. In this example configuration (b), which
incorporates hierarchy at some extent, is the one with the
highest throughput and represents the best architectural trade-
off between cache locality and communication parallelism.
III. RELATED WORK
The topic of CMP design space exploration has been widely
studied in the last years. Many simulation-based frameworks
(e.g. [11], [12]) appeared to extensively investigate the param-
eters of core architectures, memory hierarchies and emphasize
the importance of their joint optimization for improving the
power, performance and thermal characteristics.
Analytical models aim at replacing costly simulations and
provide instead a quick insight on the architectures. How-
ever, the modeling techniques in the literature significantly
underestimate the performance degradation caused by the con-
tention in the communication fabric. The model in [9] studies
the trade-off between the number of cores and the on-chip
memory size for throughput optimization. The latency model
used includes a contention penalty with linear dependency
on the number of cores. Still, apart from being inaccurate,
this approximation does not allow to compare interconnects
with various parameters and topologies. In [10] the authors
introduce an energy-performance analytical model for CMP
architectures, however they only consider bus interconnects
with a simplified contention model. The work in [13] analyzes
finite cache penalties in memory hierarchies, but the intercon-
nects are also restricted to buses.
In this paper we propose a generic method for analyti-
cal modeling of contention in hierarchical interconnect fab-
rics. The advantages of hierarchical topologies for many-
core CMPs have been demonstrated in [7], [8]. Our method
can be parametrized by an arbitrary latency model for on-
chip interconnect. This paper discusses the application of the
latency model in [14] due to its flexibility. Other models can be
considered, such as [15]. It introduces an accurate model for
heterogeneous NoCs that can be useful for modeling variable
number of virtual channels and link capacities on the different
levels of the hierarchical interconnect. In [16], an approach
similar to [14] is proposed, offering an accurate backpressure
analysis at the cost of the model efficiency. To overcome the
limitations of queueing approaches, alternative latency models
(e.g. using non-stationary traffic analysis [17]) can be used.
IV. CMP PERFORMANCE MODELING
This section introduces the models for the evaluation of
CMP performance. First, we explain the assumptions and
input parameters of the model. Next, the equation to model
static latency is presented. This equation is then extended to
consider the contention component of communication. Finally,
the throughput model is discussed and the formula for memory
request rate is derived. The section concludes by emphasizing
the cyclic dependency between memory traffic and latency.
A. Assumptions and input parameters of the analytical model
In this paper we focus on systems with two-level hierarchi-
cal interconnect fabrics. However, the approach can be applied
for an arbitrary number of hierarchical levels, including the
particular case of flat interconnects. Several components are
grouped into a cluster: cores, components of the memory sub-
system and the local interconnect. The top-level interconnect
provides communication between the clusters and access to
the off-chip memory (Figure 1(b)).
The system has in total N cores, each one with two
user-defined parameters. IPC0 is the ideal core throughput,
i.e., the amount of instructions executed by the core in one
cycle, assuming zero-latency memory. MPI is the average
number of memory references generated per instruction. These
parameters characterize both the core and the workload.
Without loss of generality, we assume that the memory sub-
system has four hierarchy levels. Every core has a private L1
cache and possibly, a private L2 cache of larger size but higher
latency. The clusters incorporate modules of a distributed L3
cache, shared by all cores. The off-chip memory is accessible
via a set of memory controllers. The latencies of the caches
and the off-chip memory are provided as parameters.
We use the term memory flow to denote a feasible com-
munication between a core and a component of the memory
subsystem. For example, each core may access its own L1 or
L2 caches, or any of the L3 modules or MCs. The set of all
possible memory flows for core c is denoted as F (c).
Every flow f ∈ F (c) is realizable with probability pf , that
defines the probability for c to request data from a certain
memory component. These probabilities can be user-defined
or calculated with some analytical model. In our work we
calculate these probabilities using a model of cache miss
behavior based on a power law that represents the dependency
between miss ratio and cache size. This model was proven to
be a good approximation [18]:
Miss(S) = κS−α, (1)
where S is the cache size, and κ, α are the model parameters.
Since L3 is a distributed cache, its access latency depends
on the cluster where the requested data is stored. In this work
we assume the probability to find the data in a particular
cluster to be inversely proportional to the distance between
the requesting core and the cluster. However, our method can
be parameterized with any other model for distributed cache.
B. Static latency
In this section we describe how to calculate the average
static latency of memory accesses for a core c in the presence
of memory hierarchy. Given the probability pf for each flow
f ∈ F (c) and its latency Lf , the static latency Lstc is:
Lstc =
∑
f∈F (c)
pfLf . (2)
Since requests to L3 and MC are sent via the communication
fabric, its delay must also be considered. This delay represents
the latency of the on-chip network traversal and is defined
using the routing function R : f → pi(f), that for any flow
f returns its routing path pi(f). In this work we consider the
XY-routing function [19], however any deterministic or even
adaptive routing can be used, specifying the probabilities for
certain paths. The total latency to access an L3 instance is
the sum of the network traversal latency along the path pi(f)
and the L3 latency. The total latency of the off-chip memory
accesses is calculated likewise.
The flow probabilities pf are obtained using the dependency
of miss ratio on the cache size, Miss(S), given by (1). As-
suming the sizes SL1, SL2 of the two low-level caches, the
probabilities to access them are:
pL1 = 1−Miss(SL1),
pL2 = (1− pL1)(1−Miss(SL2)).
As L3 is shared, the miss ratio is defined by the effective
L3 size, SeffL3, seen by each core [20]. To estimate S
eff
L3 we use
the concept of the average number of cores, sharing each line,
as proposed by [20]. The probability to access L3 is then:
pL3 = (1− pL1)(1− pL2)(1−Miss(SeffL3)).
Finally, pL3 should be multiplied by the probability to find
the data in a particular L3 instance, that is an input parameter.
A similar strategy is used to calculate the probabilities of flows
to every memory controller.
C. Queueing model for the on-chip interconnect
Equation (2) describes the static latency of memory ac-
cesses. Another important part of the communication delay is
the dynamic or contention latency [19]. Contention happens
in the interconnect fabrics when several packets compete for
the same shared resource, such as a bus or an NoC link.
This results into additional delays experienced by packets in
RR R
R
CL CL
CL CL
MC MC
(a)
R
C1 C2 NI
DIRL3
(b)
Fig. 3: Queueing representation for (a) mesh NoC and (b) bus-
based cluster.
the buffers, distributed over the on-chip interconnect. One of
the approaches to estimate the contention delays is to model
the CMP as a system of queues and apply queuing theory to
calculate the buffer delays.
Figure 3(a) shows the queueing representation of the top-
level mesh interconnect. The mesh routers (R) have up to five
input-buffered ports to store incoming flits while the router
is busy. The primary ports of the routers are connected to
the clusters of devices (CL), which in case of a flat CMP
organization may consist of one device (e.g. a core with private
caches in Figure 1(a)). Figure 3(b) presents the queueing
model of a cluster, corresponding to one tile of the hierarchical
CMP depicted in Figure 1(b). The cluster consists of five
devices, communicating via a shared bus: two cores with
private caches, an instance of an L3 shared cache, a directory
and a network interface. Every device has a buffer to store
the requests to the bus. To distribute the off-chip memory
traffic uniformly over the mesh and avoid high contention
of certain routers, we assume that memory controllers have
multiple connections to the mesh, as shown in Figure 3(a).
D. Total latency
The average total latency for core c, Lc, is obtained by
adding the queue delays along the communication paths,
denoted as wq , to the static latency. Hence, given the paths
pi(f) for every flow f , we extend equation (2) accordingly:
Lc = L
st
c +
∑
f∈F (c)
pf
∑
q∈pi(f)
wq. (3)
To find the values for wq , an analytical model for the on-chip
interconnects can be used. In this work we apply the model
from [14], that permits calculating the delays for a variety
of topologies. Given the vector of injection rates into the
interconnect, λ¯ ∈ RN , the model proposes to express queue
delays in the form of a system of equations with a matrix W :
w¯q = W (λ¯). (4)
The exact form of the matrix W is given by the expressions
(5) and (18) in [14]. What only remains is to compute the
injection rates λ¯, which is covered in the next sections.
E. Throughput model
The throughput of a CMP and the traffic of its interconnect
are closely related. To derive the exact dependencies, we start
with the performance model for a single core, given in [21].
For a core with the average rate of accesses to remote memory
(RemRate), and the cost of an access (RemCost), the average
number of cycles for executing an instruction, CPI, is:
CPI = CPI0 + RemRate · RemCost, (5)
where CPI0 = 1/IPC0 is the ideal CPI, derived under the
assumption of zero-latency memory. For a single-threaded in-
order core, the cost of a remote access is the average latency,
given by (3), and the remote rate is given by the MPI value.
As throughput is typically measured in IPC, the reciprocal of
CPI, from (5) we obtain:
θc =
1
CPI
=
1
1
IPC0
+ MPI · Lc
. (6)
The throughput of a CMP, θcmp, is then calculated as the total
performance of individual cores:
θcmp =
∑
c
θc. (7)
The rate of memory accesses, λc, is the probability for a core
to issue a remote memory request per cycle. λc is proportional
to the core throughput and the MPI:
λc = θc · MPI = MPI1
IPC0
+ MPI · Lc
. (8)
This equation can be extended for the case of more complex
out-of-order and multithreaded cores. The difference in model-
ing an out-of-order core is that the remote memory access does
not force the core to stall, hence the effective remote latency
Lc decreases [21]. Techniques discussing throughput modeling
for out-of-order implementations can be found in [22]. A
multithreaded core can be modeled as a group of single-
threaded cores. The latency for each thread remains Lc, but
the total memory access rate becomes λmtc = Mλc, where M
is the number of threads.
F. The cyclic dependency between memory latency and traffic
In order to calculate the buffer delays, equation (4) requires
the injection rates at every input (source) of the interconnect,
while equation (8) gives the rates of request generation per
core. Note that the injection rates in a flat interconnect are
directly defined by the core rates: for a CMP with N cores,
λ¯ = {λ1, .., λN}. In case of a hierarchical interconnect fabric,
the core rates will correspond to the injection rates at the
sources of the cluster-level interconnects, such as the bus in
Figure 3(b). The injection rates to the top-level mesh can be
calculated, given the fraction of inter-cluster traffic. The latter
is defined by the probabilities of access to the L3 and the off-
chip memory, discussed in section IV-B. Below we directly
consider the dependency of memory latency on the core rates.
From (3), (4) and (8) we observe the following system of
dependencies:
∀c = 1, .., N :
{
Lc = L(λ1, .., λN )
λc = λ(Lc).
(9)
This result is quite intuitive: the latency of the memory
requests traversing the interconnect depends on the rate of sent
requests, due to the network contention. In turn, the request
rate is determined by the latency, as no new memory requests
are issued if the execution of cores stalls due to the absence
of data. System (9) emphasizes the cyclic dependency between
the latency and rate of memory requests. In the following
section we describe the methods to resolve this dependency.
V. ANALYTICAL METHODS FOR LATENCY ESTIMATION
In this section we propose three methods that can be used
to resolve the dependency (9). Apart from the straightforward
way to solve the system of equations, the fixed-point iteration
and bisection methods are discussed. The benefit of the fixed-
point method is that it delivers the exact solution in case of
convergence and can be applied to arbitrary configurations.
The bisection always converges for our problem, but finds an
approximate solution. However, it was found to be a good ap-
proximation for tiled homogeneous CMPs (see Section V-C).
A. Solving the system of nonlinear equations
Given an analytical model for the interconnect latency
as a closed form, the straightforward way to find Lc is to
solve the system of nonlinear equations. We apply the model
from [14], which offers a convenient definition of queue delays
via the injection rates in the closed form (4). Hence, the
dependencies (3), (4) and (8) create a system of equations
with respect to the vectors of variables L¯, λ¯, and w¯q .
The system will always contain nonlinear equations because
of (8). The solution to a nonlinear system can be found using
a solver, such as MATLAB [23]. Although hard to be proved
analytically, our conjecture is that the system has a unique
solution. We verified this by running MATLAB with different
initial vectors and observing convergence to the same solution.
While this method is straightforward, it has two important
drawbacks. First, it only works for the analytical models
that provide closed-form equations for latency. Second, unless
properly tuned, the methods for solving general systems of
nonlinear equations may exhibit poor performance. Since our
objective is to apply the method for exploration of a large
design space, this limitation is critical. In conclusion, this
method is useful to validate the techniques described below.
B. Fixed-point iteration
The algorithm proposed in this section is a popular numeri-
cal method for solving the systems of nonlinear equations [24].
While the theoretical speed of convergence of this method is
relatively slow, it performs well in practice due to its low cost
for a single iteration. Given a system of equations in the form:
x¯ = F (x¯), (10)
where x¯ is the vector of unknowns and F is the system matrix,
and an initial guess x¯0, the following iterative procedure can
be used to find a solution x¯∗ (fixed point) of the system:
x¯n+1 = F (x¯n), n ≥ 0.
In our setting, x¯ is composed of the variables {L¯, λ¯,
w¯q} and matrix F is defined by the right-hand terms of the
equations (3), (4) and (8). For the initial guess, x¯0, we use
static latencies (2) and compute other values using the same
equations: L¯0 = L¯st, λ¯0 = λ(L¯0), w¯q,0 = W (λ¯0).
The benefit of the proposed method is that it does not require
closed-form analytical expressions for latencies. Furthermore,
any black-box model for the dependency of NoC latency on
injection rate can be used. The method hence maintains the
modular structure of hierarchical interconnects and permits
plugging independent models for different topologies, such
as bus (cluster-level) and mesh (top-level). This makes the
approach a valid tool for future interconnect modeling.
As a numerical method, fixed-point iteration is subject
to convergence issues. For a system in the form (10), the
sufficient condition for convergence is [24]:∑
i
∣∣∣∣ ∂F∂xi
∣∣∣∣ < 1.
In our case, this requires the latency to grow slowly with the
injection rate, and vice versa. This condition holds for the
communication fabrics that perform far from their saturation
throughput (for instance, see chapter 23 in [19]). Although this
condition is quite strong, it is not necessary for convergence.
In practice we observe that for the majority of configurations
the iterative procedure converges.
A second issue of the fixed-point iteration is due to the
analytical models based on queueing theory: the queueing
models work under the assumption of system being in the
steady-state [25]. This means, that for any router with service
time T and the sum of arrival rates to its inputs λ, the following
condition must hold: λT < 1. In other words, there should
be no packet accumulation in the input queues of the router.
Unfortunately, this requirement may be not satisfied by the
initial solution. From (8) we know that the latency Lc and the
memory access rate λc are inversely proportional. Since static
latency is taken for the initial value of Lc, it may be highly
underestimated for the configurations with high contention. As
a result, the initial value of λc will be overestimated and may
violate the steady-state condition.
To handle this situation as well as the configurations
for which the fixed-point iteration diverges, we propose the
method based on bisection search of λc, to find a reasonable
and fast approximation to the solution.
C. Bisection search for traffic rate
The advantage of the bisection method is that it always
converges for our model (due to the intermediate value the-
orem [24]). Since every core generates traffic at certain rate,
λc, multidimensional bisection [26] can be applied to find the
exact rates. However, a good approximation to the exact rates
can be obtained by using the less complex unidimensional
bisection: by simulation we observed that tiled CMPs with
uniform clusters tend to have traffic rates that change pro-
portionally to their estimates, obtained by the static latency.
Hence, we initialize the vector of injection rates λ¯ with the
values estimated by static latency, and on every bisection step
adjust all rates in the same proportion.
To introduce the bisection more formally, let us rewrite
equation (8) by isolating Lc, and using the star symbol to
distinguish from the latency in (3):
L∗c(λc) =
1
λc
− 1
MPI · IPC0 . (11)
From (3) and (11) we define the average latencies L(λ¯) and
L∗(λ¯) as the functions of the vector λ¯:
L(λ¯) =
1
N
N∑
c=1
Lc(λ¯), L
∗(λ¯) =
1
N
N∑
c=1
L∗c(λc).
Finally we introduce the latency difference function, F(λ¯):
F (λ¯) = L(λ¯)− L∗(λ¯).
Figure 4 shows the typical behavior of these functions,
emphasizing the cyclic dependency (9). To depict a 2D view
of this behavior, we plot L(λ¯) and L∗(λ¯) as a function of the
average rate Λ = 1N
∑N
c=1 λc. The curve L
∗(λ¯) shows that
the average rate of memory requests increases as the latency
decreases. In turn, L(λ¯) shows that the average latency of
requests grows with the injection rate. The real values for
latency and traffic are defined by the intersection point A of
these curves, that can be found as a root of F (λ¯). Hence,
we use the bisection as a root-finding method, that does not
require the exact knowledge of the function F (λ¯) and can be
used with any black-box analytical model for latency.
Bisection searches for λ¯ that satisfies the condition
|F (λ¯)| < , where  is the solution tolerance. The initial range
for λ¯ is limited by the traffic, obtained with static latency:
λ¯min = 0¯, λ¯max = λ¯(Lstc ). Assuming the proportionality in
variation of the individual components of λ¯, all components
are updated simultaneously. For any pair of consecutive iter-
ations i and i + 1, either λ¯i+1min = λ¯
i when F (λ¯i) < 0, or
λ¯i+1max = λ¯
i when F (λ¯i) > 0. The iteration is continued until
the required tolerance for F (λ¯) or λ¯ is met [24].
VI. EXPERIMENTAL RESULTS
In this section we describe several experiments used to
validate the proposed analytical method for efficiency and
quality. Validation is performed with respect to simulation.
Section VI-A describes our simulation environment. Further
sections focus on the experiments for the analytical model.
A. Simulation environment
To verify the analytical model we have developed a flit-level
simulator for hierarchical CMP interconnects on top of Book-
Sim 2.0 [19]. In contrast to the analytical model, the simulator
performs flit-level modeling of contention in the interconnect
fabric and cycle-accurate calculation of throughput.
To simulate a hierarchical CMP, three enhancements were
made to BookSim. First, BookSim purely probabilistic traffic
injection patterns were replaced with state machine models
for cores, caches and memory controllers. The cores inject
memory requests based on the average workload parameters
and stall waiting for the replies. Memories accept requests
from cores and send replies after a predefined latency. As
0
10
20
30
40
50
0 0.05 0.1 0.15 0.2
L ,
 a
v
e
r a
g
e
 l
a
t e
n
c y
 (
c y
c l
e
s )
Λ, average injection rate (flits/cycle)
A
L*(λ)
L(λ)
Fig. 4: Behavior of the latency functions L(λ¯) and L∗(λ¯).
another extension, support for hierarchical topologies was
added. This enables simulation of multi-level interconnect
fabrics with arbitrary depth. Finally, we implemented bus and
multibus topologies.
Each simulation was run long enough to obtain a 2% relative
error (the same value used for the analytical method) with
a 95% confidence degree. The 95% confidence interval is
calculated using the popular batch means method [27].
B. Efficiency of the analytical methods
In this section we compare the efficiency of the three ana-
lytical methods for resolving the cyclic dependency presented
in Section V: solver (MATLAB), fixed-point iteration (FP) and
bisection (BS). We generated a set of CMPs having flat mesh
topologies with various dimensions and contention degrees.
The reason to select flat interconnects is to demonstrate that
even for rather simple architectures the obtained system of
equations is hard to be tackled in a straightforward way.
The test cases and the results of modeling are summarized
in Table II. The first three columns display the test name, mesh
dimension and the ratio of contention latency, with respect to
the average total latency. The fourth column represents the
number of variables and equations in the obtained system.
The fifth column shows the time required to find a solution
using the general nonlinear solver provided by MATLAB. The
last two columns show the time consumed by the FP and BS
methods. For each test case the three methods converged to
the same solution, within the given tolerance region of 2%.
Test cases T1 to T5 accentuate how the MATLAB time
grows with the mesh size. The solution to T5-T7 could
not been found within an hour. Clearly, this straightforward
method is not acceptable for efficient exploration of CMPs.
The purpose of test cases T6 and T7 is to compare the FP
and BS methods. Configuration T6 has higher contention than
T5. As a result, FP method takes more time to converge than
BS. Configuration T7 has even more contention, resulting in
a violation of the steady-state assumption for the queueing
model (see Section V-B). Hence, FP can not be used in this
case, so BS is the only option.
We observed that FP typically outperforms BS, when the
contention component of latency is moderate, i.e. does not
exceed about 30-40% of the total latency. Hence we choose
to run FP first and use BS only when the former method fails.
C. Architectural exploration for CMPs
To validate the quality of the analytical methods in perfor-
mance estimation of hierarchical architectures, we carry out an
TABLE II: Performance comparison of analytical methods.
Test Mesh Cont. Num. of Runtime (sec)lat. var./eqn. MATLAB Fixed-point Bisection
T1 2× 2 5% 236 0.023 0.001 0.001
T2 4× 4 13% 1224 1.412 0.001 0.002
T3 6× 6 8% 3108 30.831 0.002 0.003
T4 8× 8 12% 6128 408.539 0.006 0.010
T5 10× 10 23% 10620 Timeout (1hr) 0.010 0.012
T6 10× 10 46% 10620 Timeout (1hr) 0.022 0.015
T7 10× 10 55% 10620 Timeout (1hr) NA 0.016
experiment for CMP design space exploration. Our framework
reads a setup file with the parameters for cores, memories
and workloads, generates a multitude of architectures, and for
every architecture obtains the throughput, using both modeling
and simulation. With this experiment we demonstrate that
the modeling selects a very similar set of best-throughput
architectures as simulation, but in much shorter time.
Table III shows the exploration setup. The estimates of
chip and component area were taken from the Niagara 2
processor [2]. We scaled the core area and memory density
down to the 16nm to allow hundreds of cores fit into the chip
area. The IPC0, MPI of cores and the miss ratio dependency
on cache size were estimated from the benchmarks in [28].
The number of cores and cache sizes were varied to explore
the trade-off between the computing units and the on-chip
memory. All architectures were generated with the mesh-
of-buses topology. The exploration of the mesh dimensions
compromises the number of clusters and processors per cluster.
Given these parameters, our framework generates 1062 fea-
sible configurations. The simulation of all the configurations
took 324 minutes, while performance modeling was done in
just 16.8 seconds, delivering more that 1000x speedup. The
best architecture by simulation is the configuration #937, with
a throughput of 30.81 IPC. It consists of 6×6 mesh (36
clusters, 5 cores per cluster), a total of 180 cores with 64Kb
L1, 256Kb L2 private caches and 68Mb shared L3 cache.
In Figure 5 we sorted the configurations by throughput
along the horizontal axis, as estimated by simulation. One can
see that the modeling estimate follows well the simulation
curve. The analytical model for latency underestimates the
contention, and therefore deviation increases with its degree.
Configurations with similar throughput may have various
contention degrees, hence the noisy behavior of the modeling
curve. The error in throughput varies up to 19% with the
0
5
10
15
20
25
30
35
1
4
1
8
1
1
2
1
1
6
1
2
0
1
2
4
1
2
8
1
3
2
1
3
6
1
4
0
1
4
4
1
4
8
1
5
2
1
5
6
1
6
0
1
6
4
1
6
8
1
7
2
1
7
6
1
8
0
1
8
4
1
8
8
1
9
2
1
9
6
1
1
0
0
1
1
0
4
1
T
h
r o
u
g
h
p
u
t  
( I
P
C
)
Configurations sorted in descending order of throughput
Modeling
Simulation
Fig. 5: Throughput comparison for modeling and simulation.
TABLE III: Parameters of the exploration.
Parameter Value
Chip area 350 mm2
Core area 1.25 mm2
Core IPC0 2.0
Core MPI 0.5
L1 size 64, 128 Kb
L2 size 64 Kb to 3 Mb
Memory density 1 mm2 / Mb
Mesh dimensions 2×2 to 16×16
Miss ratio dependency on cache size 0.05 · CacheSize−0.4
Cache latency dependency on size 5.0 · CacheSize0.5 cycles
Off-chip memory latency 100 cycles
average value being 10%, which corresponds to the error
reported by the latency model [14].
However, what really matters for exploration are the relative,
rather than the absolute values of throughput. Indeed, when
exploring the huge design space we would like to effectively
prune suboptimal architectures and leave a moderate subset of
promising solutions. These configurations can be simulated
further to select the best one. Hence, we are interested in
comparing the order of configurations by the highest through-
put, as delivered by modeling and simulation. And here our
technique demonstrates very accurate results: Figure 6 shows
the comparison for the best-throughput order. To make the
picture illustrative, we limit to consider the 50 best config-
urations, however the explained tendency is maintained for
the whole set. The horizontal axis specifies the number N
of best configurations chosen by simulation. The vertical axis
indicates the minimum number of best configurations chosen
by modeling, that include all the N best ones by simulation.
For example, the point with coordinates (1; 2) means that the
best configuration by simulation (#937) has the second place in
modeling. Furthermore, the throughput of #937 is 30.81 IPC,
while the throughput of the best configuration by modeling
(#940) is 30.80 IPC, and is within our modeling tolerance.
The rightmost point on the plot (50; 64) means that the 64 best
configurations by modeling include all 50 best by simulation.
This is actually a very accurate result for the analytical model,
when comparing more than one thousand of configurations.
We also demonstrate that approximation by the static latency
delivers poor order. It biases the exploration towards the
configurations with large clusters, given the fact that the long
contention latency in the buses is not considered. The point
(1; 33) in Figure 6 means that the best configuration (#937)
is on the 33rd position, when not considering contention.
For comparison, we also checked a configuration similar
to #937, having the same number of cores and cache size,
but exploiting less locality by using a 12×15 mesh with one
core and one cache module per cluster. The throughput of this
configuration is 24.23 IPC, that is 21% less. This witnesses the
importance of the hierarchical fabrics exploration, effectively
using the locality of memory accesses.
D. Scalability of the modeling
To investigate the scalability of the analytical model, we
generated several CMPs with mesh-of-buses topology and
(1; 33)
(6; 61)
(1; 2)
(6; 13)
(50; 64)
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60
B
e
s t
 c
o
n
f i
g
u
r a
t i
o
n
s  
b
y
 a
n
a
l y
s i
s  
t h
a
t  
i n
c l
u
d
e
 N
Number of best configurations by simulation (N)
Modeling by static latency
Modeling by total latency
Ideal modeling (Simulation)
Fig. 6: Order comparison for modeling and simulation.
similar structure. Each cluster contains four components (three
cores and one cache module) and a bus interconnect. The
top-level mesh dimensions are varied from 2×2 to 16×16,
producing CMPs with 16 to 1024 components. For each
test case we executed both fixed-point and bisection and
compared the average runtime value of these two methods
with simulation. Figure 7 shows the results of comparison.
Our probabilistic simulator demonstrates very good per-
formance. Simulation of the 16-component CMP takes just
2.5 seconds, and for the 1024-component CMP about 600
seconds. However, the modeling yet brings about three orders
of magnitude improvement in efficiency. For the 16- and 1024-
component CMPs modeling took only 0.002 seconds and 3.3
seconds respectively. In one second our method handles a
CMP with nearly 700 components. This result justifies high
scalability of the proposed method and its ability to efficiently
explore architectures with many hundreds of cores.
VII. CONCLUSIONS
Analytical models for CMP performance are crucial to make
the architectural exploration possible. This paper shows that
such models need to incorporate the contention factor in order
to adequately estimate performance. We have presented three
analytical methods to model the contention of hierarchical
interconnects, by resolving the cyclic dependency between the
memory latency and traffic. The validity and efficiency of the
model were proved through extensive simulation and with an
example of architectural exploration.
0.001
0.01
0.1
1
10
100
1000
0 100 200 300 400 500 600 700 800 900 1000
R
u
n
t i
m
e
 (
s e
c o
n
d
s )
Number of components in CMP
Simulation
Modeling
≈700-component
CMP in 1 second
Fig. 7: Performance comparison of modeling and simulation.
VIII. ACKNOWLEDGMENT
This research has been funded by the grant from Intel
Corporation, project CICYT TIN2007-66523, and FPI grant
BES-2008-004612.
REFERENCES
[1] D. Pham et al., “Overview of the architecture, circuit design, and
physical implementation of a first-generation cell processor,” Solid-State
Circuits, vol. 41, pp. 179–196, 2006.
[2] U. Nawathe et al., “An 8-core 64-thread 64b power-efficient SPARC
SoC,” in Solid-State Circuits, feb. 2007, pp. 108 –590.
[3] S. Bell et al., “TILE64 - processor: A 64-core SoC with mesh intercon-
nect,” in Solid-State Circuits, feb. 2008, pp. 88 –598.
[4] S. Vangal et al., “An 80-tile 1.28TFLOPS network-on-chip in 65nm
CMOS,” in Solid-State Circuits, feb. 2007, pp. 98 –589.
[5] J. Owens et al., “GPU computing,” Proceedings of the IEEE, vol. 96,
pp. 879 –899, may 2008.
[6] M. Taylor et al., “The Raw microprocessor: a computational fabric for
software circuits and general-purpose programs,” Micro, IEEE, vol. 22,
no. 2, pp. 25 – 35, mar/apr 2002.
[7] J. Balfour and W. J. Dally, “Design tradeoffs for tiled CMP on-chip
networks,” in Proc. Intl. Conf. on Supercomputing, 2006, pp. 187–198.
[8] R. Das, S. Eachempati, A. Mishra, V. Narayanan, and C. Das, “Design
and evaluation of a hierarchical on-chip interconnect for next-generation
CMPs,” in High Performance Comp. Arch., feb. 2009, pp. 175 –186.
[9] T. Oh, H. Lee, K. Lee, and S. Cho, “An analytical model to study optimal
area breakdown between cores and caches in a chip multiprocessor,” in
ISVLSI ’09, may 2009, pp. 181 –186.
[10] A. Cassidy, K. Yu, H. Zhou, and A. Andreou, “A high-level analytical
model for application specific CMP design exploration,” in Design,
Automation Test in Europe, march 2011, pp. 1 –6.
[11] M. Monchiero, R. Canal, and A. Gonzalez, “Power/performance/thermal
design-space exploration for multicore architectures,” Parallel and Dis-
tributed Systems, vol. 19, no. 5, pp. 666 –681, may 2008.
[12] Y. Li, B. Lee, D. Brooks, Z. Hu, and K. Skadron, “CMP design
space exploration subject to physical constraints,” in High-Performance
Computer Architecture, feb. 2006, pp. 17 – 28.
[13] R. E. Matick, T. J. Heller, and M. Ignatowski, “Analytical analysis
of finite cache penalty and cycles per instruction of a multiprocessor
memory hierarchy using miss rates and queuing theory,” IBM J. Res.
Dev., vol. 45, pp. 819–842, November 2001.
[14] U. Ogras, P. Bogdan, and R. Marculescu, “An analytical approach
for network-on-chip performance analysis,” Computer-Aided Design of
Integrated Circuits and Systems, vol. 29, pp. 2001 –2013, dec. 2010.
[15] Y. Ben-Itzhak, I. Cidon, and A. Kolodny, “Delay analysis of wormhole
based heterogeneous NoC,” in NOCS’11, may 2011, pp. 161 –168.
[16] S. Foroutan, Y. Thonnart, R. Hersemeule, and A. Jerraya, “An analyt-
ical method for evaluating network-on-chip performance,” in Design,
Automation Test in Europe, march 2010, pp. 1629 –1632.
[17] P. Bogdan and R. Marculescu, “Non-stationary traffic analysis and its
implications on multicore platform design,” Computer-Aided Design of
Integrated Circuits and Systems, vol. 30, pp. 508 –519, april 2011.
[18] A. Hartstein, V. Srinivasan, T. Puzak, and P. Emma, “On the nature of
cache miss behavior: is it square root of 2,” Journal of Instruction-Level
Parallelism, vol. 10, 2008.
[19] W. Dally and B. Towles, Principles and Practices of Interconnection
Networks. Morgan Kaufmann Publishers, Inc., 2003.
[20] A. R. Alameldeen, “Using compression to improve chip multiprocessor
performance,” Ph.D. dissertation, 2006.
[21] J. L. Hennessy and D. A. Patterson, Computer Architecture, 4th Edition:
A Quantitative Approach. Morgan Kaufmann Publishers Inc., 2006.
[22] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic
performance model for superscalar out-of-order processors,” ACM Trans.
Comput. Syst., vol. 27, pp. 1–37, May 2009.
[23] “MATLAB,” http://www.mathworks.com.
[24] R. Burden and D. Faires, Numerical Analysis. Brooks Cole, 2010.
[25] L. Kleinrock, Queueing Systems, Volume 1. Wiley-Interscience, 1975.
[26] G. Wood, “The bisection method in higher dimensions,” Math. Program.,
vol. 55, June 1992.
[27] G. S. Fishman, “Grouping observations in digital simulation,” Manage-
ment Science, vol. 24, pp. 510–521, 1978.
[28] K. Olukotun, Chip Multiprocessor Architecture: Techniques to Improve
Throughput and Latency. Morgan and Claypool Publishers, 2007.
