Structured Parallel Programming and Cache Coherence in Multicore Architectures by LAMETTI, SILVIA
Universita` di Pisa
Dipartimento di Informatica
Dottorato di Ricerca in Informatica
Ph.D. Thesis
Structured Parallel Programming
and Cache Coherence
in Multicore Architectures
Silvia Lametti
Supervisor
Marco Vanneschi
December 11, 2015
Dipartimento di Informatica, Largo B. Pontecorvo, 3, I-56127 Pisa, Italy
SETTORE SCIENTIFICO DISCIPLINARE INF/01

To Mom and Dad

Abstract
It is clear that multicore processors have become the building blocks of today’s
high-performance computing platforms. The advent of massively parallel single-
chip microprocessors further emphasizes the gap that exists between parallel archi-
tectures and parallel programming maturity. Our research group, starting from the
experiences on distributed and shared memory multiprocessor, was one of the first
to propose a Structured Parallel Programming approach to bridge this gap. In this
scenario, one of the biggest problems is that an application’s performance is often af-
fected by the sharing pattern of data and its impact on Cache Coherence. Currently
multicore platforms rely on hardware or automatic cache coherence techniques that
allow programmers to develop programs without taking into account the problem.
It is well known that standard coherency protocols are inefficient for certain data
communication patterns and these inefficiencies will be amplified by the increased
core number and the complex memory hierarchies.
Following a structured parallelism approach, our methodology to attack these
problems is based on two interrelated issues: structured parallelism paradigms and
cost models (or performance models).
Evaluating the performance of a program, although widely studied, is still an
open problem in the research community and, notably, specific cost models to de-
scribe multicores are missing. For this reason in this thesis, we define an abstract
model for cache coherent architectures which is able to capture the essential ele-
ments and the qualitative behaviors of multicore-based systems. Furthermore, we
show how this abstract model combined with well known performance modelling
techniques, such as analytical modelling (e.g., queueing models and stochastic pro-
cess algebras) or simulations, provide an application- and architecture-dependent
cost model to predict structured parallel applications performances.
Starting out from the behavior and performance predictability of structured par-
allelism schemes, in this thesis we address the issue of cache coherence in multicore
architectures, following an algorithm-dependent approach, a particular kind of soft-
ware cache coherence solution characterized by explicit cache management strate-
gies, which are specific of the algorithm to be executed. Notably, we ensure parallel
correctness by exploiting architecture-specific mechanisms and by defining proper
data structures in order to “emulate” cache coherence solutions in an efficient way
for each computation. Algorithm-dependent cache coherence can be efficiently im-
plemented at the support level of structured parallelism paradigms, with absolute
transparency with respect to the application programmer. Moreover, by using the
cost model, in this thesis we study and compare different algorithm-dependent im-
plementations, such as those based on automatic cache coherence with respect to an
original, non-automatic and lock-free solution based on interprocessor communica-
tions. Notably, with this latter implementation, in some cases, we are able to reduce
the number of memory accesses, cache transfers and synchronizations and increasing
computation parallelism with respect to the use of automatic cache coherence.
Current architectures do not usually allow disabling automatic cache coherence.
However, the emergence of many-core architectures somewhat changed the scenario,
so that some architectures, such as the Tilera TilePro64, allow to control and disable
the automatic cache coherence facilities. For this reason, in this thesis we finally
apply our methodology to TilePro64 platform in order provide a further validation
of the results obtained by our cost model.
Of course, the world looks different to you now:
So, don’t forget the bigger picture.
What Exactly Is a Doctorate?
Matt Might

Acknowledgments
First of all I would like to thank my supervisor Marco Vanneschi for his patience
and for teaching me his research approach and methodology. I will never stop to be
amazed by his teaching skills.
I am grateful to Prof. Marco Danelutto, Prof. Massimo Coppola, Dr. Massimo
Torquati, Dr. Carlo Bertolli, Dr. Massimiliano Meneghin, Dr. Gabriele Mencagli,
Dr. Daniele Buono and Dr. Tiziano De Matteis, which contributions and advices
are distributed on the whole thesis.
I would like to thank my best friends: Chiara and Marta who shared with me the
last 28 years or so, and, Glo and Patu who have been like a family to me.
Finally, I would like to lovely thank Andrea, Simba and my family: mom, dad,
grandma, Tatta and Aurora. They always supported and encouraged me to face
serenely and courageously all the challenges of life. Most of what I reached could
not have been possible without them on my side.
iv
Contents
I Introduction 1
1 Introduction 3
1.1 The Cache Coherence Problem in the ManyCore Era . . . . . . . . . 4
1.2 Parallel Paradigms and Cache Coherence . . . . . . . . . . . . . . . . 5
1.3 Our Starting Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Current Publications by the author . . . . . . . . . . . . . . . . . . . 9
2 Background 11
2.1 CMP Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Automatic Cache Coherence . . . . . . . . . . . . . . . . . . . 16
2.2.2 Software or non-Automatic Cache Coherence . . . . . . . . . . 20
2.3 Existing Evaluations of Cache Coherence Solutions . . . . . . . . . . 21
2.4 Parallel Programming on CMPs . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Programming Languages . . . . . . . . . . . . . . . . . . . . . 23
2.4.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Our Methodology: Programming and Cost Models 27
3.1 Structured Parallel Programming . . . . . . . . . . . . . . . . . . . . 28
3.1.1 Stream Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Parallelization Methodology and Cost Model . . . . . . . . . . . . . . 33
3.2.1 Performance Modeling with Queueing Networks . . . . . . . . 36
3.2.2 Performance evaluations of modules and graph computations . 40
vi CONTENTS
3.2.3 Parallelism forms and cost models . . . . . . . . . . . . . . . . 43
3.2.4 Evaluating the model parameters . . . . . . . . . . . . . . . . 47
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
II Applying Our Methodology to the Cache Coherence
Problem 51
4 Modelling Cache Coherent Architectures 53
4.1 SM Organization in CMP-based Architectures . . . . . . . . . . . . . 53
4.1.1 Single-CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.2 Multiple-CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 An Abstract Model for CC-Architectures . . . . . . . . . . . . . . . . 56
4.2.1 A Hierarchy-based Classification . . . . . . . . . . . . . . . . . 58
4.3 Base Memory Access Latencies . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Reading Operations . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 Writing Operations . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.3 Reading and Writing Operations in Multiple-CMP Architectures 68
4.4 Benchmarks for Reading and Writing Latencies . . . . . . . . . . . . 70
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5 Parallel Paradigms and Cache Coherence 75
5.1 Recognizing CC Patterns in Parallel Paradigms . . . . . . . . . . . . 76
5.1.1 Farm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1.2 Data-Parallel: Map . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1.3 Data-Parallel with Stencil . . . . . . . . . . . . . . . . . . . . 81
5.2 Synchronizations in Shared Memory Systems . . . . . . . . . . . . . . 85
5.2.1 Mutual Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.2 Global (barrier) event synchronizations . . . . . . . . . . . . . 89
5.2.3 Memory Ordering and Memory Consistency Models . . . . . . 90
5.3 Lock-Free Data Structure for Comm. Mechanisms . . . . . . . . . . . 92
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6 Cost Models for CMP-based Architectures 99
6.1 Cost Model for Under Load Memory Latencies . . . . . . . . . . . . . 100
6.1.1 Model Resolution . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.1.2 Complexity vs Approximation of the model . . . . . . . . . . 102
6.1.3 Memory access latency . . . . . . . . . . . . . . . . . . . . . . 104
6.1.4 On-chip cache-to-cache transfers . . . . . . . . . . . . . . . . . 104
6.2 On parallel program mapping and under-load evaluation . . . . . . . 105
6.3 PEPA: Process Algebra for Quantitative Analysis . . . . . . . . . . . 108
6.3.1 PEPA Language . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3.2 PEPA for graphs . . . . . . . . . . . . . . . . . . . . . . . . . 111
0.0. CONTENTS vii
6.3.3 PEPA for under load memory latency . . . . . . . . . . . . . . 112
6.3.4 On the resolution of PEPA models . . . . . . . . . . . . . . . 118
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
III Evaluation of the Proposed Methodology 123
7 A Structured Parallelism Approach to CC 125
7.1 Optimizations for Parallel Paradigms RTS . . . . . . . . . . . . . . . 125
7.1.1 Flexible home node selection . . . . . . . . . . . . . . . . . . . 126
7.1.2 Home-flush techniques . . . . . . . . . . . . . . . . . . . . . . 126
7.1.3 Cooperation mechanisms among cores through inter-processor
communications . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.2 Communication run-time support . . . . . . . . . . . . . . . . . . . . 128
7.2.1 The Rdy-Ack Communication Model . . . . . . . . . . . . . . 129
7.2.2 Rdy-Ack Based on Shared Memory Synchronizations . . . . . 130
7.2.3 Rdy-Ack Based on Inter-processor Communications . . . . . . 137
7.3 Asymmetric Rdy-Ack Communications . . . . . . . . . . . . . . . . . 143
7.4 Implementation and Evaluation on TilePro64 . . . . . . . . . . . . . 144
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8 Conclusions 161
Bibliography 163
viii CONTENTS
List of Figures
2.1 Examples of current CMP-based architectures . . . . . . . . . . . . . 12
2.2 A first comparison of automatic and non-automatic techniques . . . . 22
3.1 Pipeline parallellism form . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Farm parallellism form . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Data-parallel with stencil parallellism form . . . . . . . . . . . . . . . 31
3.4 Map parallellism form . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 MapReduce parallel paradigms . . . . . . . . . . . . . . . . . . . . . 32
3.6 The “compilation workflow” in our programming environment . . . . 35
3.7 An example module graph . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8 A computation module modeled as a queueing system . . . . . . . . . 37
3.9 An example module graph and its queueing network representation . 38
3.10 A generic acyclic graph computation Σ . . . . . . . . . . . . . . . . . 41
3.11 Stream-oriented pipeline modeling of task farm . . . . . . . . . . . . 44
3.12 An example of an abstract architecture with n Processing Elements
connected to the main memory through the interconnection structure 48
4.1 SMP vs NUMA characterization of multiprocessor architectures . . . 54
4.2 Main memory subsystem in single-CMP architectures . . . . . . . . . 55
4.3 SMP vs NUMA characterization of multiple-CMP architectures . . . 56
4.4 Abstract model for a cache coherent architecture with N processing
elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Global Control (GC) implementation in a hierarchy-based classifica-
tion of cache coherent CMP-based architectures . . . . . . . . . . . . 59
5.1 Farm implementation strategies . . . . . . . . . . . . . . . . . . . . . 77
5.2 Pseudo-code for stencil computation in shared-variable implementation 82
5.3 Pseudo-code for stencil computation in message-passing implementation 83
x LIST OF FIGURES
5.4 Average latency times for the FastFlow queues on Tilera TilePro64
and Intel processors varying the buffer size . . . . . . . . . . . . . . . 97
6.1 Client-server models with request-reply behaviour used to model mul-
tiprocessor systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 RQ/RQ0 for main memory with TM = 30τ . . . . . . . . . . . . . . . 104
6.3 RQ/RQ0 for on-chip cache-to-cache transfers . . . . . . . . . . . . . . 105
6.4 Structured Operational Semantic of PEPA . . . . . . . . . . . . . . . 109
6.5 Co-operating modules graph example . . . . . . . . . . . . . . . . . . 111
6.6 Comparison with PEPA of RQ of each server: low-p mapping strategy
(pW [i] = 2) versus standard mapping strategy (pIN = n+1) in Single-
CMP with GC implemented at the shared cache level . . . . . . . . . 118
6.7 Comparison with PEPA of RQ of each server: low-p mapping strategy
(pW [i] = 2) versus standard mapping strategy (pIN = n+ 1) in single-
CMP, single-MINF, without a shared level cache, with GC distributed
among the PrCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.8 Comparison with PEPA of RQ/RQ0 of each server: low-p mapping
strategy (pW [i] = 2) versus standard mapping strategy (pIN = n+ 1)
in Single-CMP with GC implemented at the shared cache level . . . . 119
6.9 Comparison with PEPA of RQ/RQ0 of each server: low-p mapping
strategy (pW [i] = 2) versus standard mapping strategy (pIN = n+ 1)
in single-CMP, single-MINF, without a shared level cache, with GC
distributed among the PrCs . . . . . . . . . . . . . . . . . . . . . . . 120
6.10 Comparison with PEPA of RQ of each server: low-p mapping strategy
(pW [i] = 2) versus standard mapping strategy (pIN = n+ 1) a single-
CMP, multiple-MINF, with a shared level cache that acts as GC . . . 120
7.1 Abstract definition of the send-receive operations in the rdy-ack com-
munication model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2 Algorithms of the send-receive operations in the rdy-ack communica-
tion model based on shared memory synchronizations . . . . . . . . . 130
7.3 Algorithm of the zero-copy receive in the rdy-ack communication
model based on shared memory synchronizations and an example of
its use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.4 Rdy-Ack communication data structures for asynchrony degree k > 1 132
7.5 Rdy-Ack channel structure for the message-passing solution based on
interprocessor communications . . . . . . . . . . . . . . . . . . . . . . 139
7.6 Pseudo-code of send, receive and set ack operations for Rdy-Ack
channel structure for the message-passing solution based on interpro-
cessor communications . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.7 Pseudo-code of wait, notify and interrupt-handler operations for
Rdy-Ack channel structure for the message-passing solution based on
interprocessor communications . . . . . . . . . . . . . . . . . . . . . . 140
0.0. LIST OF FIGURES xi
7.8 Rdy-Ack channel structure for the pointer-passing solution based on
interprocessor communications . . . . . . . . . . . . . . . . . . . . . . 140
7.9 Tilera TilePro64 automatic cache coherence protocols . . . . . . . . . 145
7.10 FastFlow stream matrix multiplication (AixBi) using different cache
coherence strategies: Hash Home Node (HHN), No Home Node (NHN)
and Fixed Home Node (FHN) . . . . . . . . . . . . . . . . . . . . . . 147
7.11 Rdy-Ack implementation based on Shared Memory synchronizations
with home-flush optimization on Tilera TilePro64 . . . . . . . . . . . 149
7.12 Asymmetric Rdy-Ack implementation based on Shared Memory syn-
chronizations with home-flush optimization on Tilera TilePro64 . . . 150
7.13 Rdy-Ack implementation based on interprocessor communications on
Tilera TilePro64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.14 Rdy-Ack symmetric communication latency for passing-pointer solu-
tion evaluated with the ping-pong micro-benchmarks . . . . . . . . . 151
7.15 Rdy-Ack many-to-one communication latency for passing-pointer so-
lution evaluated with the ping-pong micro-benchmarks . . . . . . . . 152
7.16 A 3-stage pipeline computation example . . . . . . . . . . . . . . . . 153
7.17 Speedup of the farm computation executed on Tilera TilePro64 with
the various run-time supports: integer values matrices with N=32/64/128159
7.18 Speedup of the farm computation executed on Tilera TilePro64 with
the various run-time supports: float matrices with N=32/64/128 . . 160
xii LIST OF FIGURES
List of Tables
4.1 GC and Network Latencies . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Reading Operations Latencies . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Writing Operations Latencies . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Memory read latencies for core 0, depending on the cache line state,
for Intel SandyBridge processors. Times in clock cycles . . . . . . . . 71
4.5 Memory read latencies for core 0, on the Tilera TilePro64 architec-
ture, with or without cache coherence. Times in clock cycles. . . . . . 71
4.6 Memory write latencies for core 0, depending on the cache line state,
for Intel SandyBridge processors. Times in clock cycles . . . . . . . . 72
4.7 Memory write latencies for core 0, depending on the cache line state,
for Tilera TilePro64 processors. Times in clock cycles . . . . . . . . 73
6.1 PEPA steady-state resolution of a multiple-source computation graph 112
7.1 Reading and writing activities on the VTG fields . . . . . . . . . . . . 148
7.2 Reading and Writing Operations Latencies in Tilera TilePro64 . . . 155
xiv LIST OF TABLES
List of Algorithms
5.1 Lock and unlock implementation with the test-and-set instruction . . 86
5.2 Lock and unlock implementation with LL/SC instructions . . . . . . 88
5.3 Global barrier algorithm with mutual exclusion . . . . . . . . . . . . 90
5.4 Use of a memory fence instruction inside the unlock operation to
ensure the correct memory ordering (Total Store Ordering) . . . . . . 92
5.5 Locking queue implementation . . . . . . . . . . . . . . . . . . . . . . 93
5.6 ⊥-based lock-free queue implementation . . . . . . . . . . . . . . . . 95
7.1 Pseudo-code of Stage 2 in a 3-stage pipeline computation . . . . . . . 153
xvi LIST OF ALGORITHMS
Part I
Introduction

CHAPTER 1
Introduction
Continuous advances in microprocessor design have been reflected in high-performan-
ce computing platforms that increasingly rely on multicores as a basic building block.
Off-the-shelf multicores, also called Chip MultiProcessors (CMPs) in the research
world, are at the moment commonly built with a relatively small number of cores
(4 to 12). However, the trend in hardware technology is clear: the “Moore law” is
expected to be applied to the number of cores and researchers expect architectures
with 128 to 1024 cores on a single chip in few years [67, 96], up to the point that the
term many-cores has been introduced, to indicate the large amount of core per chip
of some solutions. All that accompanied by a corresponding evolution of on-chip
interconnection networks and by increasingly complex memory hierarchy and more
scalable cache coherence solutions. Some examples of current highly parallel multi-
core platforms are the Tilera TilePro64 [19] processor (64 cores), the IBM PowerEN
[50](16 4-way Simultaneous Multithreading cores each processor for a maximum of
64 cores and 256 thread in 4-chip configurations), the AMD Opteron [34] with 48
cores in a single machine, the IBM Power7 [64] servers, with up to 32 8-core 4-way
SMT processors per machine, for a total of 1024 threads, or the Intel Xeon Phi, an
accelerator composed of 60 4-way SMT cores [62].
This “revolution” only further emphasizes the gap that exists between parallel ar-
chitectures and parallel programming maturity. “The aggressive goal of the parallel
revolution is to make it as easy to write programs that are as efficient, portable, and
correct (and that scale as the number of cores per microprocessor increases bienni-
ally) as it has been to write programs for sequential computers.” [16].
The historical difficulties that characterize parallel programming continue to influ-
ence the choices related to the development of parallel programming environments:
the achievement of the trade-off “productivity-performance-programmability” is still
characterized by the contrast between the search for high performance through low-
4 1. INTRODUCTION
level tools and the use of high-level languages for easy programmability. Applications
should be designed at the highest level by means of formalisms and tools that are
fully independent from the machine architecture and from the mechanisms at the
process level.
Our research group, starting from the experiences on distributed and shared memory
multiprocessor [109, 11, 10, 21], proposes a structured parallelism approach to bridge
this gap. Structured parallelism paradigms aim to provide standard and effective
rules for composing parallel computations in a machine independent manner. Cost
models are defined for performance evaluation and prediction, and are a fundamen-
tal tool for reducing the complexity in parallel software design.
General results for multiprocessor architectures are therefore valid for CMPs. At
the same time, however, there are some important features and capabilities that are
not available in multiprocessor and must be further investigated. The number and
complexity of cores, the interconnection network among cores and towards the outer
memory, cache hierarchies and cache coherence solutions make each CMP different.
Regarding the parallel programming, this forces the programmer to write specific
low-level code to reach good performance. Currently, the most common situation
consists in developing parallel applications directly at the process level through pro-
gramming languages or extensions of existing sequential languages (e.g., OpenMP
[38], Cilk [71]) or through libraries (e.g., MPI [102], Intel TBB [92], Skandium [72],
Fast Flow [15]). These approaches are independent of the underlying architecture,
but have not sufficient expressive power to support high-level development of com-
plex applications and “performance portability”. In addition, a cost model is still
missing, that is there is no way to predict the behaviour in terms of performance
of your program until it is run on a specific platform, thus making different paral-
lelizations of a program incomparable in a formal or generalizable way, but only by
execution times.
1.1 The Cache Coherence Problem in the Many-
Core Era
The cache coherence problem arises from the possibility that more than one cache of
the system may maintain a copy of the same memory location. If different processors
transfer into their cache the same data, it is necessary to ensure that copies remain
consistent with each other and against the copy in the memory hierarchy. Current
solutions consist in preventing incoherence through the use of hardware mechanisms
that ensures that each cache holds the current value of a memory location. These
mechanisms (commonly called cache coherence protocols) rely on hardware or au-
tomatic cache coherence techniques that allow programmers to develop programs
without taking into account this problem. In fact, no explicit coherence operations
must be inserted in the program.
1.2. PARALLEL PARADIGMS AND CACHE COHERENCE 5
In a generic cache coherence protocol each line in a cache has a state associated,
that represents the availability and use of that cache line inside the system. From
a logical point of view, the state is global (i.e. the same for each cache of the archi-
tecture). A read/write operation to a cache line changes its state, and may prompt
some communications between the caches to ensure that none of them holds a stale
value. To do that two main techniques can be used:
• invalidation when a line is modified on a private cache, all the other caches
remove (if present) the old value;
• update each modification of a cache line is communicated (broadcasted) to
all other caches.
In general shared memory programming models an automatic cache coherence pro-
tocol is proven to offer the best performance [86]. When the problem of cache
coherence was introduced, several studies tried to introduce and compare hardware-
based and software-based cache coherency protocols. There are several works that
highlight how cache coherence may be ensured at a software-level to obtain perfor-
mance improvements [3]. In fact, it is well known that standard coherence protocols
are inefficient for certain data communication patterns (e.g., producer-consumer),
and these inefficiencies will be amplified by the increased core number and the com-
plex memory hierarchies.
This makes it again necessary to study the cache coherence mechanisms with par-
ticular attention to the disadvantages of automatic techniques compared to, what
we call, an algorithm-dependent approach a particular kind of software (or non-
automatic) cache coherence solution characterized by explicit cache management
strategies, which are specific to the algorithm to be executed.
1.2 Parallel Paradigms and Cache Coherence
Understanding the impact of automatic cache coherence solutions in parallel pro-
gram performances is too complex if studied independently of the parallel program
characteristics. Following a structured parallelism approach, our methodology to
attack these problems is based on two interrelated issues: structured parallelism
paradigms and cost models (or performance models).
Performance prediction of a program, although widely studied, is still an open prob-
lem in the research community. Cost models in the world of parallel programming are
usually proposed to asymptotically study algorithms, like PRAM[47], and BSP[106]
models. A first step towards a more “detailed” model is LogP[36], and its successive
enhancements. However all these models are kept as simple as possible to let pro-
grammers easily compare algorithms. We are not interested in this kind of model.
We are looking for a more realistic model that takes into account every important
property of the parallel architecture and of the parallel program. Unfortunately, as
6 1. INTRODUCTION
of today, there does not exist a way to precisely estimate the completion time of a
general program on current architectures, mainly because of their complexity and
dynamicity. The idea of specific performance models for parallel pattern is not really
new, as it was introduced in P 3L [17]. These, however, modeled the performance of
the implementations by taking the sequential code as a “black box”, with specific,
immutable, characteristics.
We extend the original concepts by defining a performance cost model in associa-
tion with a simplified view of the concrete architecture, an abstract architecture or
abstract model for cache coherent architectures. The abstract model is a simplified
view of the concrete architecture able to describe the essential performance proper-
ties and abstract from all the others that are useless. It aims to throw away details
belonging to different concrete architectures and emphasizes all the most important
and general ones (e.g., the actions required by cache coherence protocols to maintain
data consistent).
A cost model, associated to the abstract architecture, has to sum up all the features
of the concrete architecture (e.g., how cache coherency state affects the memory and
cache access latencies), the inter-process communication run-time support (e.g., in
order to evaluate communication performances of an algorithm-dependent solution)
and the impact of the parallel application (e.g., showing the possible correlation
between parallel paradigms and cache coherence). Further, we strongly advocated
that a cost model should be easy to use and conceptually simple to understand.
The aim is to use cost models to perform optimizations for parallel applications and
to study and compare the different implementations of cache coherent solutions for
each pattern.
1.3 Our Starting Point
Our research group has a quite long history in structured parallel programming,
starting with the P 3L skeleton language in 1992, and culminating with ASSIST
[109] in the last years. We never, however, really focused our effort in multicore
and shared memory architectures in general. Our experiments with FastFlow[15]
demonstrated the need, and the possibility, of multicore-specific optimizations in a
skeleton-based library. A skeleton library, however, does not allow us to fully exploit
the benefits of structured parallel programming, because it does not (entirely) allow
code restructuring and transformations. The long-term project of our research group
is ASSISTANT [21], the extension and adaptation of ASSIST for the current world
of parallel computing, composed of multicores, pervasive grids and clouds. Many of
the principles introduced in ASSIST are inherited and extended, in order to provide
a significant leap forward in the world of multicore-oriented parallel programming.
Respecting the basic ASSIST principles, a parallel program will be described as a
generic graph of stream-connected parallel modules. Each module will be constituted
by a parallel pattern, and the programmer will be able to write the algorithm code
1.4. CONTRIBUTION OF THE THESIS 7
by means of the most used sequential languages (C/C++, Java, Matlab, and so on).
A first step toward multicore technologies has been taken with Daniele Buono’s PhD
thesis [23] where we start targeting multicore architectures, showing the feasibility of
the cost model approach, by defining an architectural model for a specific many-core
architecture (the Tilera TilePro64), and applying it on well known parallel pattern
implementations to evaluate specific memory-related optimizations introduced in
the thesis.
1.4 Contribution of the Thesis
With this thesis we start from the experiences gained so far in our research group
and we approach the cache coherence problem with our methodology. Notably, we
define an architectural cost model for cache coherent CMP-based architectures and
apply it to evaluate specific cache coherence-related optimizations for well known
parallel paradigms.
The fundamental contributions are the following:
• The definition of an abstract model for cache coherent CMP-based architec-
tures, which is able to summarize the characteristics of the state-of-the-art
of multicore technologies (e.g., memory and cache hierarchy) in relation to
possible cache coherence solutions. The model provides a first result toward
the evaluation of the impact of automatic cache coherence on parallel program
performances, by analytically defining the base memory and cache access la-
tencies of reading and writing operations in terms of the coherency protocol
adopted.
• An extensive study on the parallel paradigm implementations with the focus
on the identification of specific cache coherence patterns in order to evaluate
how and when coherency protocols are effectively used and with which effects
on parallel program performances. In this way, we are able to define specific
optimizations by exploiting the knowledge of the structure of each parallelism
scheme, such as the interactions between the parallel modules and their data
access patterns.
• A queuing network-based model for cache coherent CMP-based architectures
that, starting from the abstract model of the architecture, shows how stan-
dard automatic cache coherence affects the under load memory and cache
access latencies. Notably, by combining the abstract model with well known
performance modelling techniques, such as queuing models and stochastic pro-
cess algebras (i.e., PEPA [55]), we provide an application- and architecture-
dependent cost model to predict structured parallel application performances.
This cost model is fundamental in the definition of the parallel paradigms
run-time support, showing for example how a specific mapping strategy can
improve performances by minimizing the under load latencies.
8 1. INTRODUCTION
• An optimized run-time support for structured parallel applications and a
demonstration of the use of the cost model to compare different solutions based
on automatic or non-automatic cache coherence, lock-free and based on inter-
processor communications. Notably, we show that with the non-automatic
and lock-free solutions based on interprocessor communications we are able
to reduce the number of memory accesses, cache transfers and synchroniza-
tions, and increasing computation parallelism with respect to the use of the
automatic cache coherence alternative. Finally, the implementations of these
solutions on Tilera TilePro64 processors confirm the results estimated by the
cost model.
1.5 Outline of the Thesis
The thesis is organized in three main parts:
Part I: Introduction in which we introduce the reader to the state of the art
of cache coherence in CMP-based architectures and to our methodology based on
structured parallel programming and performance models, establishing the basis to
understand the second part. Notably, we have:
• Chapter 2 that reviews the current state of cache coherence solutions and
of parallel programming for multicores. We describe the features of current
architectures and the evolution trend that is likely to be followed. Then, a brief
overview of hardware-based and software-based cache coherence solutions and
the existing evaluations of these approaches. Finally we introduce the current
tools for programming CMPs.
• Chapter 3 is focused on the introduction of the methodology adopted in
the second part of this thesis. We introduce the reader to the Structured
Parallel Programming. After that, we introduce the conceptual framework of
ASSIST and its pervasive evolution ASSISTANT and the general approach to
performance models.
Part II: Applying Our Methodology to the Cache Coherence Problem
where we define an application- and architecture-dependent cost model to predict
structured parallel applications performances and to compare automatic vs non-
automatic cache coherence solutions. Notably, we have:
• Chapter 4 in which we define an abstract model for cache coherence CMP-
based architectures in order to analytically define base memory access latency
in terms of the cache coherence protocol. We also show interesting benchmark
results that validate the abstract model results.
1.6. CURRENT PUBLICATIONS BY THE AUTHOR 9
• Chapter 5 analyzes different parallel paradigms implementations (i.e., shared-
memory vs message-passing) to understand the recognize eventually cache
coherence patterns in order to study the possible optimizations for the support
of each parallelism form.
• Chapter 6 starting from the abstract model defined in Chapter 4 and the
analysis of Chapter 5, defines an analytic cost model for under load memory
latencies based on queuing model and discusses the effects of specific parallel
program mapping with different cache coherence solutions. Finally, a brief
introduction to the PEPA process algebra shows to the reader how this tool
could be useful for performance evaluation of computation graphs and under
load memory latency.
Part III: Evaluation of the Proposed Methodology in which we show the
result obtained from the experiments used to evaluate the considerations done in
the previous part of the thesis. Notably, we have:
• Chapter 7 Proposes different implementations for the support of parallel
paradigms, in order to compare automatic and non-automatic solutions with
respect to a lock-free solution based on interprocessor communications. The
comparison is done by using the cost model defined in the previous part of the
thesis and by some experiments executed on the Tilera TilePro64 processor,
which constitute an interesting example of a chip multiprocessor, given its 64
cores and the use of innovative solutions for the interconnection network and
the cache coherence mechanisms.
• Chapter 8 present the conclusions of the thesis and some reflections on the
results obtained.
1.6 Current Publications by the author
The following represents the publications that I worked on during my Ph.D. research:
• Carlo Bertolli, Daniele Buono, Alessio Pascucci, Silvia Lametti, Gabriele Men-
cagli, Massimiliano Meneghin, and Marco Vanneschi. A programming model
for high-performance adaptive applications on pervasive mobile grids. Pro-
ceedings of the 21st IASTED International Conference on Parallel and Dis-
tributed Computing and Systems, Cambridge, USA, 2009.
• Daniele Buono, Marco Danelutto and Silvia Lametti. Map, reduce and mapre-
duce, the skeleton way. Procedia Computer Science, Volume 1, Issue 1, ICCS
2010, May 2010, Pages 2089-2097, ISSN 1877-0509, DOI: 10.1016/j.procs.2010.-
04.234.
10 1. INTRODUCTION
• Daniele Buono, Marco Danelutto, Silvia Lametti and Massimo Torquati. Par-
allel Patterns for General Purpose Many-Core. Parallel, Distributed and
Network-Based Processing (PDP), 2013 21st Euromicro International Con-
ference on, Pages 131-139, ISSN 1066-6192, DOI: 10.1109/PDP.2013.27.
CHAPTER 2
Background
This chapter presents the current state of the art in multicore computing, both from
the hardware and the software point of view, with a special attention to the cache
coherence problem.
We give first an overview of the architectures that we consider most interesting and
promising in the industrial and research landscape.
After that, we will try to summarize the state of art of the cache coherence solutions,
in terms of automatic and non-automatic mechanisms and some results about the
evaluations and the comparison of these solutions.
This chapter concludes with a brief overview on the parallel programming tools both
for multiprocessor and CMP-based architectures.
2.1 CMP Architectures
In this section, we present a brief state of the art of CMPs from the hardware
perspective. While processors with a few (4 to 12) cores are common today, this
number is projected to grow. Because of this rapid evolution and of the consequent
open issues, we present a small set of architectures in order to understand what are
common choices in the various features of CMPs and that will characterize future
architectures.
We consider the following four examples, all represented in figure 2.1, as repre-
sentative for our study:
• AMD Opteron 6100 [34]
• IBM PowerEN [50]
12 2. BACKGROUND
L3 + directory
P
L2
P
L2
P
L2
P
L2
P
L2
P
L2
Memory
interface
network
interface
L3 + directory
P
L2
P
L2
P
L2
P
L2
P
L2
P
L2
Memory
interface
network
interface
(a) The AMD Opteron 6100
P
L2
PPP P
L2
PPP P
L2
PPPP
L2
PPP
Bus
Mem 
int
Mem 
int
Hd 
accele-
rator
external bus I/o net int
(b) The IBM PowerEN
C
h
i
p
sw
sw
sw
sw
sw sw sw sw sw
sw sw sw sw sw
sw sw sw sw sw
sw sw sw sw sw
Mem 
int
Mem 
int
Mem 
int
Mem 
int
T
i
l
e
P L2
L2
cc
cc
mess 
pass 
buffer
net 
int
P
(c) The Intel SCC
C
h
i
p
Mem int Mem int
I/
o
n
etw
o
r
k
 in
t
Mem int Mem int
T
i
l
eP
L2
Sw
(d) The Tilera TilePro64
Figure 2.1: Examples of current CMP-based architectures
2.1. CMP ARCHITECTURES 13
• Intel Xeon scc [73]
• Tilera TilePro64 [19]
and we have been focusing on the following characteristics, which are summarized
in table 2.1.
Characteristics AMD Opteron 6100 IBM PowerEN Intel SCC Tilera TilePro
Number of processors 2 six-cores processors 16 cores 24 dual-cores tiles 100 cores
Processor architecture
32-bit words 64-bit words
out-of-order execution
3-way superscalar 4-way superscalar 2-way superscalar 3-way superscalar
6 execution unit (floating point and 
SIMD) 2 execution unit floating point execution unit 3 execution unit
Instruction Level Parallelism no hardware multithreading 4-way SMT no hardware multithreading VLIW instruction set
Memory and Cache hierarchy
Memory bandwith and organization
UMA with a single memory 
interface directly connected to the 
L3
UMA with 2 memory interfaces 
connected by a bus, possible hi- 
erarchical composition (NUMA of 
SMPs)
NUMA with 4 memory interfaces NUMA with 4 memory interfaces
unified L2 for each core 4 L2 caches each shared by 4 cores unified L2 for each core unified L2 for each core with directory
shared L3 with directory
Cache Coherence
snoopy-based inside the chip 
(among L2 and L3) and directory-
based among different chips
snoopy-based no hardware cache coherence directory-based
interconnection network
crossbar among L2 caches and L3, 
partial crossbar for multiprocessor 
configurazions
crossbar among groups of 4 cores 
and the L2, bus connection between 
all L2 and the memory, crossbar for 
multiprocessor configurations
4 by 6 two-dimensional mesh 10 by 10 two-dimensional mesh
Atomic operations and synchronizations “hardware” passive wait for threads message-passing
possibility of inter-core 
communication provided by the 
mesh
separeted L1 for data and instructions
Table 2.1: CMPs characterization
64-bit words
in-order execution
Core complexity
Caches
14
2.
B
A
C
K
G
R
O
U
N
D
2.1. CMP ARCHITECTURES 15
Number of processor In CMPs this is represented by a constant value, which
depends on the chip size and on how all components are organized and balanced
inside the chip. The lower complexity of Intel SCC and Tilera TilePro64 cores
allow a larger number of cores inside a single chip. We expect two parallel lines
of development in the future: complex cores with a relatively low parallelism and
simpler cores with a higher parallelism.
Processor architecture The core complexity is relative to the domain of each
processor: general purpose servers, like the AMD Opteron, have to maintain good
performance on sequential code thus leading producers to maintain high complexity
as it was in high performance uniprocessors. When the target is different, simple
cores obviously allow more cores per chip and lower power consumption.
An important aspect is how Instruction Level Parallelism (ILP) is extracted. A
clean and elegant solution has been offered by the so-called Very Long Instruction
Word (VLIW) model, which makes intensive use of compile-time optimizations. In
VLIW terminology, a long instruction is the implementation of a stream element.
The main feature of a VLIW long instruction is that it is composed of n independent
instructions, of which at most one may be a branch instruction provided that the
target instruction is the first of a distinct long instruction. Another adopted solution
is Simultaneous MultiThreading (SMT) that permits multiple independent threads
of execution. Hardware multithreading, especially in the form of SMT, is having
more and more success and the Opteron is a representative example also because of
the combination of SMT with out-of-order execution.
Memory and Cache hierarchy A very important aspect in CMPs is its memory
hierarchy. A first level cache per core is the norm, in fact it is simply regarded as
part of the core. The main challenge ahead is scaling to chip with thousands of
cores. Thus there is a need to minimize the number of misses out of a core.
Furthermore, another open issue is how to keep good memory bandwith and latency.
Because of the pin-count problem, in CMPs it is impossible have each core with
its own interface to memory, like in multiprocessors. In all the examples memory
interfaces are considerably less than the number of processors, and it will remain
the trend for future architectures. Regarding the memory organization, we usually
have SMP architectures, but with complex interconnections and multiple interfaces
the architectures usually become NUMA.
Interconnection Network The interconnection network is another important
characteristic because it defines how cores exchange data among each other and
the cost of each communication. Simple interconnections like crossbar and bus are
typically used in CMPs with small number of cores and in a hierarchical configu-
ration of the network. Crossbar is an all-to-all connection that keeps the latency
constant and minimizes conflicts, but it can be applied to a limited number of nodes
16 2. BACKGROUND
because of the number of links (quadratic in the number of nodes). Bus and rings
(uni-dimensional mesh) are characterized by a latency proportional to the number
of nodes, so are not adopted in highly parallel CMPs. The current complexity of
the cores is probably the reason for which fat tree and butterfly interconnections are
still missing. The meshes used by Tilera TilePro64 and Intel SCC are simpler and
take less space. However we expect that, as the number of cores will increase, low
latency interconnections like mesh, fat tree and butterfly will be also implemented
on-chip [97].
Atomic operations and synchronizations Current CMPs usually support ato-
mic memory operations for core synchronization. An inter-core synchronization
could be much faster and allow fine grain parallelism. Many works in the literature
study this problem [105]. PowerEN use optimized synchronization mechanisms that
stop execution a thread until a specified memory location is written by other threads.
Tilera TilePro64 instead, directly use the mesh to exchange data, without accessing
memory. However, it appears that message passing can give a significant latency
reduction in inter-core communication and both Tilera TilePro64 (with instructions
between registers) and Intel SCC provide this solution.
2.2 Cache Coherence
The cache coherence problem arises from the possibility that more than one cache of
the system may maintain a copy of the same memory location. If different processors
transfer into their cache the same cache line, it is necessary to ensure that copies
remain consistent with each other and against the copy in the memory hierarchy.
Cache coherence schemes include protocols and policies that prevent the existence
of copies of writable data in more than one cache at the same time.
2.2.1 Automatic Cache Coherence
Off-the-shelf systems usually implement the cache coherence techniques entirely at
the hardware level. Two main techniques, called automatic cache coherence tech-
niques, are used:
• invalidation, in which when a line is modified on a private cache, all the other
caches remove (if present) the old value;
• update, in which each modification of a cache line is communicated (broad-
casted) to all other caches.
Cache Coherence Protocols
In all the systems which use the automatic techniques, a proper protocol must exist
in order to perform the required actions atomically.
2.2. CACHE COHERENCE 17
In a generic cache coherence protocol each line in a cache has a state associated
with it, along with the tag and data, which indicates the disposition of the line.
The cache policy is defined by the cache block state transition diagram, which is a
finite state machine specifying how the state of a line changes. While only cache
lines that are actually in cache have state information, logically, all blocks that are
not resident in the cache can be viewed as being in either a special “not present”
state or in the “invalid” state.
The MSI protocol The MSI protocol is a basic invalidation-based protocol for
write-back caches.
The protocol uses the three states (i.e. Modified, Shared and Invalid) required for
any write-back cache in order to distinguish valid cache lines that are unmodified
from those that are modified (dirty). Before a shared or invalid line can be written
and placed in the modified state, all the other potential copies must be invalidated.
The MESI protocol The MESI protocol (known also as Illinois protocol due to
its development at the University of Illinois at Urbana-Champaign [87]) is a widely
used cache coherence protocol. It is the most common protocol which supports
write-back cache.
The MESI protocol adds an Exclusive state to reduce the traffic caused by writes
of cache lines that only exist in one cache. This new state indicates an intermediate
level of binding between shared and modified:
• unlike the shared state, the cache can perform a write and move to the modified
state without further requests;
• it does not imply ownership (memory has a valid copy), so unlike the modified
state, the cache need not reply upon observing a request for the block from
another cache.
The MO(E)SI protocol The MOSI protocol is another extension of the basic
MSI cache coherence protocol. It adds the Owned state. A cache line in the owned
state holds the most recent, correct copy of the data. The MOESI protocol, in-
troduced in [103], encompasses all of the possible states commonly used in other
protocols.
The Dragon protocol The Dragon protocol is a basic update-based protocol for
write-back caches. This protocol was first proposed by researchers at Xerox PARC
for their Dragon multiprocessor system [74].
The Dragon protocol consists of four states (i.e. Exclusive-clean, Shared-clean,
Shared-modified and Modified) which are comparable to MESI’s states.
18 2. BACKGROUND
Multilevel Cache Hierarchies As seen in Section 2.1 many systems use private
first (L1) and second level (L2) caches. Multilevel cache hierarchies would seem to
complicate coherence. Notably, we distinguished inclusive vs victim 2-level cache:
• with inclusive cache: cache lines in L1 are a proper subset of cache lines
currently allocated in L2,
• while, in victim cache: it is possible that a cache line in L1 is not currently
allocated in L2.
L1 and L2 are independent processing units, thus independent cache controllers ex-
ist for each hierarchy level. However, the distinction inclusive vs victim is relevant
mainly for the actual form of cache coherence control: in victim cache, coherence
control is actually decentralized between L1 and L2, while in inclusive cache, coher-
ence control can actually be implemented in L2 only.
In the inclusive case, automatic solutions also need to preserve the inclusion prop-
erty, which requires the following:
• if a memory block is in the L1 cache, then it must also be present in the L2
cache; in other words, the contents of the L2 cache must be a subset of the
contents of the L2 cache;
• if the cache line is in an owned state (e.g., modified in MESI, owned in MOSI,
shared-modified in Dragon) in the L1 cache, then it must also be marked
modified in the L2 cache.
Keep the inclusion property allows for advantages in performance, even in systems
where a level of cache hierarchy is shared (as shown in [51]), as it avoids, as also
said before, making unnecessary communications.
In [18] are presented some techniques used to maintain the inclusion property.
Snoopy vs Directory
Automatic cache coherence techniques allow programmers to develop programs with-
out taking into account the cache coherence problem. In fact no explicit coherence
operations must be inserted in the program.
Two main classes of architectural solutions have been developed for automatic
caching:
• Snoopy-based, in which the cache coherence protocols exploits a real central-
ization point (e.g., a single bus) and the associated snooping and broadcast
operations;
• Directory-based, which implements cache coherence protocols using shared
data (e.g., in main memory); this solution is adopted in highly parallel systems,
with powerful interconnection networks.
2.2. CACHE COHERENCE 19
In Snoopy-based systems each of devices connected to the interconnection network
can observe every network transaction, e.g., every read or write request. When a
processor issues a request to its cache, the cache controller examines the state of the
cache and takes suitable action, which may include generating network transactions
to access memory or other caches. Coherence is maintained by having all cache
controllers “snoop” on the network and monitor the transactions from other nodes.
An interesting issue that arises with the use of automatic techniques based on snoop-
ing is the cache-to-cache sharing : when more than one cache has a valid copy of the
line, it is necessary to have a selection algorithm to determine which of these should
provide the data. The MESIF protocol was developed by Intel [46] to solve this
problem. To do this the protocol use a new state Forward, which indicates that the
cache should act as a designated responder for any requests for the corresponding
cache line.
In highly parallel systems interconnection structures are used to allow greater
scalability than what can be achieved with linear latency networks. This choice is
also reflected in the decision to integrate automatic cache coherence mechanisms
that scale better than the solutions based on Snoopy bus or similar interconnection
networks.
Scalable cache coherence is typically based on the concept of a directory. Since
the state of a cache line in the caches can no longer be determinated implicitly by
placing a request on a shared bus and having it snooped by the cache controllers,
the idea is to maintain this state explicitly in a place, called directory. This way,
memory operations can be sent only to the set of nodes that are actually interested.
In the following, we will use the term node to represent a processing element (pro-
cessor or core), its private cache subsystem and cache controller(s). Notably, in a
director-based solution we can apply the following definitions to a shared cache line:
• the home node: is the node in whose main memory the cache line is allocated,
or more generally, is the node in charge of controlling a given partition of cache
lines;
• the local (or requestor) node: is the node that issues a request for the cache
line;
• the owner node: is the node that holds the valid copy of the line in its local
cache, and must supply the data when needed.
Depending on where the directory is maintained, we can distinguish two main
schemes:
• memory-based schemes, that store the directory information about all cached
copies at the home node of the cache line;
20 2. BACKGROUND
• cache-based schemes, where the information about cached copies is not all
contained at the home but is distributed among the copies themselves; the
home simply contains a pointer to one cached copy of the block; each cached
copy then contains a pointer to the node that has the next cached copy of the
block, in a distributed linked list organization.
The directory-based approach minimizes traffic in the interconnection networks,
however keeping and accessing the directory have a cost in term of space (the direc-
tory must reside in a fast memory) and increased memory access latencies.
Some works in the literature have discussed alternative automatic solutions: in
[44] the authors propose a MSI implementation within the network, while in [94]
the authors propose a new directory scheme and a cache coherence scheme based
on it for a mesh interconnection. However, the cost in terms of space is simply
moved from the memory or the caches to the network switches, without a significant
improvement from the performance point of view.
2.2.2 Software or non-Automatic Cache Coherence
Software managed coherence represents an alternative design point that places the
burden of maintaining coherence on compilers, libraries, and runtime systems.
Software and hybrid cache coherence Software-directed cache coherence on
shared memory multiprocessors has been proposed as an alternative to automatic
cache coherence. Such schemes are based on self invalidations and forced writebacks
of the private caches at synchronization points (some required a write-through cache
instead of forced writebacks). They relied on explicitly marked synchronization
points (often simply the boundaries of parallel loops). Additionally, to filter out un-
necessary invalidations, these schemes relied on the programmer [98] or the compiler
[29, 30] to identify what data is involved in the communication and, thus, must be
written back and invalidated. Unfortunately, these techniques require sophisticated
analysis, and are applicable to very regular array-based or loop-based computations
where static analysis can identify the communicated data or when the programmer
can clearly specify sections and their data.
Other works [28] have proposed additional hardware support to assist with the ac-
tual invalidation, e.g., using per-word tags updated by software to identify cache
lines that do not have to be reloaded after a synchronization point. However, they
require complex program analysis to statically manage the update of the tags, ob-
taining good results only for programs with simple control flow and regular data
accesses.
Hybrid self invalidation Self-invalidation is a technique used to invalidate cache
lines locally and it has been proposed as a mechanism to reduce the amount of coher-
2.3. EXISTING EVALUATIONS OF CACHE COHERENCE SOLUTIONS 21
ence transactions. Compiler-directed schemes [84] still rely on automatic coherence
mechanism to ensure correctness, hence without obtaining a real improvement on
performance with respect to “pure” automatic solutions.
Software distributed shared memory Several works have been focused on soft-
ware and hybrid distributed shared memory systems [80, 57]. These approaches are
mainly targeted to clusters of workstations and not only dealt with cache coherence
but also provide a shared memory environment.
Software-only directory protocol This approach has evolved along two main
directions: either a separate protocol processor is used to execute the software han-
dlers that emulate cache coherence protocol or the handlers are executed on the com-
pute processor. The first direction is represented by the Stanford FLASH [53] and
the Winsconsin Typhoon [93]. Several design efforts aim at executing the software
handlers on the compute processor, e.g., the MIT Alewife [5] and the Cooperative
Shared Memory [54]. Both approaches have a bad effect on performance in highly
parallel system [2] mainly because of the small possibility of optimizations for the
specific algorithm.
2.3 Existing Evaluations of Cache Coherence So-
lutions
The most common parallel performance models [47, 106, 36] for multiprocessors
are built for parallel algorithm designers, who are not interested in particular ar-
chitectures, but look for algorithms that perform well in general. The models are
therefore based on an asymptotic prediction of the performance, exactly as the com-
plexity order is analyzed in sequential algorithms. The idea of allowing a simple and
machine-independent study of the algorithm is indeed in contrast with our idea of a
detailed and machine-dependent prediction, able to compare different cache coher-
ence solutions.
Several works in the literature have tried to evaluate the overhead introduced on
read and write operations by automatic cache coherence solutions [43, 37, 6].
In general the problem was addressed by adopting several simplifications on the
workload model, and analytically deriving the coherence overhead by using tools
such as Markov chains [42, 90], Generalized Timed Petri Nets [110] or queueing
networks [112].
However, modeling the behavior of cache coherence analytically is still an open re-
search, because of the strong implications of the program (in particular its data
access pattern) on the coherence traffic.
First results on comparison of automatic and non-automatic techniques have shown
that the software schemes are comparable to directory-based protocols for a wide
22 2. BACKGROUND
-- --
MR Fraction of Shared Traffic
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
1.00.80.60.40.20.0
1.0
0.9
0.8
0.70.6
0.4
0.2
cons
Efficiency (sw)
Efficiency (hw)
Figure 5.6. The effect of Conservative Analysis of Memory Conflicts, lMIG = 4, lMR = 8, ratioMR = 2, nMR = 4.
6. Summary and Discussion of the Results
Our experiments show that if memory access conflicts can be detected accurately at compile time (cons !
0.9), the software scheme is competitive with the hardware scheme for most cases. The most important case for
which hardware coherence significantly outperforms software coherence is for the mostly-read class of data. With
a high fraction of this class of data, if less than half of the potential writes detected at compile-time are executed,
the hardware scheme can be more than 30% better than the software scheme. The hardware scheme is also
significantly better with high fractions of frequently read-written data, when ratioRW is high. However, we do not
expect parallel programs to contain such high proportions of this class of data. Otherwise, the software scheme
performs within 10% of the hardware scheme for most cases. For migratory data, the software scheme con-
sistently outperforms the hardware scheme by a significant amount. The RFO optimization for the hardware can
substantially reduce this difference, but does not make the hardware scheme perform better than the software
scheme.
The chief significance of these results is in showing the effect of various types of sharing behavior on rela-
tive hardware and software performance. For data that consists of conditional writes that are performed infre-
quently at runtime (high values of ratioRW and ratioMR), the software scheme performs poorly compared to the
hardware scheme. This suggests that if data with many conditional writes occurs frequently in parallel programs,
some mechanism to handle these writes is essential for a software scheme to be a viable option. None of the
software schemes proposed so far incorporate such a mechanism. Since the result of conditional branches cannot
be predicted at compile time, some hardware support appears necessary so that the compiler can optimistically
- 16 -
SW Efficiency
HW fficiency
fMR
mo
re
 co
ns
er
va
tiv
e S
W
Figure 2.2: A first comparison of automatic and non-automatic techniques
class of computations [4]. In this work, authors classify shared objects into classes:
for example, mostly-read objects represent data written very infrequently and may
be read more than once by multiple processors before a write by some processor,
while frequently read-written objects represent data written frequently and also read
by multiple processors between writes. Figure 2.2 shows the ratio between proc ssor
efficiency of a software scheme and that of an automatic scheme, where fMR rep-
resent the fraction of references to mostly-read objects. There are separate curves
to represent the effect of conservative analysis of the software scheme. Automatic
cache coherence significantly outperf rms n n-automatic solutions for the mostly-
read class of data. However, if me ory access conflicts can be detected accurately at
compile time (> 0.9), the non-automatic schemes is competitive with the automatic
scheme for most cases. The performance of classical software solutions is limited
by the need to use compile-time information to predict run-time behaviour, forcing
these approaches to be conservative. On the other hand, for well structured pro-
grams (many scientific programs fall under this class) non-automatic schemes are
comparable and in some cases better (within 10%) than automatic schemes.
2.4 Parallel Program ing on CMPs
CMPs inherit their ideas from Shared Memory Multiprocessors and for this reason
and because many of these pr cessors are also given in multipr cessor configuratio s,
they can be program ed using all tools developed in the past for SMP architectures.
However, as seen in section 2.1, CMPs have specific features, that have to be ex-
ploited.
As a consequence, a lot of new parallel programming environments emerged. How-
2.4. PARALLEL PROGRAMMING ON CMPS 23
ever, in our opinion, many of these are very similar to previous tools and they still
do not exploit all the possibilities offered by these new architectures.
These approaches are independent of the underlying architecture, but all the ad-
vantages like performance portability of structured parallel programming [109] are
missing: they have not sufficient expressive power to support high-level development
of complex applications and “performance portability”.
The way adopted in order to have an efficient solution for a specific CMP is, typi-
cally, to downgrade to a lower level of implementation writing code strongly related
to that specific platform. However, the possibilities to have performance predictions
are not achievable, because a detailed cost model of the architecture is still unavail-
able.
In this section we briefly analyze the most important tools available for shared mem-
ory multiprocessing and some specific for CMPs. We will first present programming
languages and then libraries.
2.4.1 Programming Languages
Parallel programming languages are usually extensions of sequential languages like
C and Java, only some of them are actually new programming languages. With
respect to shared memory and message passing libraries there is a sort of facility in
the definition of the parallel program. However, the programmer has to define all
the execution flows and the cooperation among them. For these reasons we consider
these solutions a “low level” approach to parallel programming.
OpenMP [38] is probably the best known for shared memory parallel program-
ming. It provides an extension for different sequential languages: C, C++, Java
and Fortran. With OpenMP the main program is run sequentially and in speci-
fied points the code is executed in parallel. The several languages supported offer
to programmers good portability among different architectures. However, to each
supported language corresponds a different compiler that implements the same pro-
gram in different ways on the same architecture, making it impossible to predict
the performances. This is an example of languages that do not exploit any specific
feature of CMPs.
Cilk [71] is a task parallel extension of C, which is recently commercialized as
Cilk Plus by Intel (as C and C++ extension) designed for multithreaded parallel
computing. As in OpenMP, there are some keywords used by the programmer to
define parallel parts of the code and synchronization points. We have the same dis-
advantages of OpenMP, principally no exploitation of CMP specific characteristics.
Berkley Unified Parallel C [45] has an important characteristic that distin-
guishes it from previous languages: it uses a Partitioned Global Address Spaces, so
24 2. BACKGROUND
it can be ported also on distributed memory systems. An advantage over previous
solutions is related to the run-time support, which is being optimized for CMPs. It
also provides some collective operations, but it is still a low-level shared memory
language from the point of view of performance portability.
Erlang [111] was mainly targeted at distributed systems because it is a message-
passing parallel programming language. However, recently a shared memory im-
plementation has been developed. Also this is a low-level solution, in which the
programmer has to define all the parallel application.
Go Programming Language [33] is a recently introduced object-oriented lan-
guage that exhibits a C-like syntax and greatly focuses on concurrency. Parallelism
is automatically achieved by using “goroutines”, functions specifically marked to
be executed concurrently. Goroutines can exchange data by using asynchronous
channels; thus, go highly resembles a low-level message-passing parallel language.
2.4.2 Libraries
There are many different libraries for parallel programming of CMPs and, in order to
have a simple classification, we have chosen representative examples that summarize
typical choices adopted. The various libraries described below are listed from low
level to high level parallel programming.
Posix threads [26], usually referred to as Pthreads, is one of the lowest level
libraries for parallel programming of shared memory systems. It gives to the pro-
grammer access to OS threads and basic mechanisms for synchronization. There are
no distributed memory versions of the library. The definition of the parallel appli-
cation is entirely delegated to the programmer and also there are no optimizations
for CMPs.
MPI [102], the Message Passing Interface, allows to define a set of processes with
a local environment and gives to the programmer primitives for send and receive and
for some collective communications. It is also available a shared memory implemen-
tation and specific implementation for high performance interconnection networks.
MPI offers good portability among different platforms, however we consider it a low
level library which could be used as a sort of run-time support for an high-level
parallel programming environment.
Intel Libraries [92, 83, 73]: Intel Threading Building Block (TBB) and Intel
Array Building Block (ArBB) are library provided with Intel multiprocessors, while
Intel RCCE is the most recently Intel library provided for the Intel SCC. TBB is
a stream-parallel library, while ArBB is specialized for data-parallel programming.
2.5. SUMMARY 25
However, both have no optimizations for CMPs. RCCE tries to fill this lack, offering
a message-passing programming model with collective communications. Because
there is no cache coherence among cores in SCC, this library is an interesting example
which shows how message passing can be an effective way to avoid having to provide
cache coherency in CMPs.
FastFlow [15] is a CMP library created by our research group. It provides a lock-
free and wait-free communication channel that can be used directly by the program-
mer. Moreover, the library implements on top of this level a generic master-worker
skeleton, which can be used to define stream-parallel and data-flow applications.
We consider it an interesting approach to CMP parallel programming. However, no
cost models are used, and performance tuning is still a concern of the programmer.
Skandium [72] is a skeleton-based high-level parallel library specifically targeted
to CMPs. It offers several task and data-parallel skeletons. However, the library
is based on Java threads, which make no distinction between multiprocessor and
multicore architectures. Furthermore, performance can partially be limited by the
use of the Java language.
2.5 Summary
In this chapter we introduced the reader to the world of chip multiprocessors. We
presented the current state-of-the-art in CMPs and the most feasible future trend, in
which the amount of processors per chip will increase to a point that programming
these chips will become even more difficult.
We introduced the reader to the cache coherence problem, describing off-the-shelf
automatic solutions with respect to the software or non-automatic approach. These
solutions have been proposed by the research world to alleviate performance degra-
dation problems of automatic cache coherence in specific data access patterns. How-
ever, comparing and evaluating the effect of cache coherence solutions is still an open
research.
Given the level of detail of certain works, it was not really feasible to introduce and
describe all of them here. We therefore decided, for the sake of readability, to keep
this chapter at an introductory level.
Finally, we presented the most important programming tools used for generic mul-
tiprocessors and CMP-based multiprocessors. Notably, we highlighted the most
important problem in software development, which is the absence of complete envi-
ronments especially targeted at multicores, able to fully exploit these architectures
without manual intervention of the programmer.
26 2. BACKGROUND
CHAPTER 3
Our Methodology: Programming and Cost Models
In this chapter we summarize the basic features characterizing the methodology
proposed by our research group and used in this thesis. Our idea starts from the
definition of a parallel application defined as a directed graph whose nodes are co-
operating (parallel) modules. By “solving the computation graph” we are able to
understand which modules represent bottlenecks for the application performance.
The next step consists in providing a functionally equivalent parallel computation
for each bottleneck, without modifying the semantics and the logical interfaces with
respect to the sequential version. In order to evaluate and to compare some alter-
native versions of the parallel transformation, we use proper performance metrics
according to a cost model. This performance model is parametrically dependent on
the application characteristics and on the target parallel architecture(s).
Over the last decade a significant research effort has been invested in studying and
developing new programming models and frameworks for parallel computations. A
big challenge has been the definition of approaches which render parallel program-
ming easy to use, improving the reuse of existing components to create different
and more complex systems and providing performance portability without requiring
intensive interventions of the programmer to tune the performance of each applica-
tion. Portable parallel applications should be able to be used on different computing
platforms without modifying the program source code, and the porting phase should
also be able to exploit in the best way possible the physical aspects of the underlying
architecture.
As it is well established by the scientific community [41, 108, 101], a high-level ap-
proach is the only solution to performance portability of parallel applications. The
structured parallel programming is probably the most interesting class of high-level
parallel models. This approach allows the programmer to define the parallel pro-
gram having in mind only an abstract high-level view of the application, while all
28 3. OUR METHODOLOGY: PROGRAMMING AND COST MODELS
the most critical implementation choices (e.g., the parallelism degree, task granu-
larity, process/data mapping on corresponding processing elements) are left to the
programming tools and run-time support.
In the first part of this chapter, we introduce the general concepts of a structured
programming model. Then, we describe our methodology, that finds its roots on the
ASSIST programming language [109] developed by our research group some years
ago, comparing it with respect to other classic structured programming models.
Notably, we introduce the workflow of our parallel compiler, highlighting the open
research points that will be partially addressed in this thesis.
3.1 Structured Parallel Programming
Structured parallel programming is probably the most powerful class of high-level
parallel models. It started with the concept of algorithmic skeletons defined by Cole
[32] and has been successfully applied in a range of parallel environments, starting
from clusters[39] and shared memory machines [72], to grid [13], cloud and per-
vasive environments[21]. Two of the most important points of structured parallel
programming are the ability to automatically create different parallel implemen-
tations starting from the high-level description, and the parametric nature of the
produced code, that is able to run with different parallelism degrees. These points
are the basic building blocks to ensure performance portability on the various archi-
tectures. Structured parallel programming also allows composability : a parallel code
can be mixed with others, such that an application can be described as a collection
of parallel kernels, instead of a single, large parallel code. Composability also allows
reuse: the “kernel” code can be reused inside different programs with no modifica-
tions.
Our research group history in structured parallel programming is quite long, start-
ing with the P 3L skeleton language in 1992 [17], and culminating with ASSIST
in recent years. These projects proposed many interesting developments for paral-
lel programming: parallel code restructuring [9], to better exploit the composition
of parallel kernels; efficient fault tolerance [20], and dynamic reconfigurations up
to self-adaptive programs [77], i.e. programs that are able to exploit performance
portability dynamically, at run-time, to better fit dynamic environments such as
grids or clouds. Finally, we also extended the concept of High Performance Com-
puting to Grid computing [13] and lately to Pervasive Grids. We never, however,
really focused our efforts in multi-core and shared memory architectures in general.
Our experiments with FastFlow [12] demonstrated the need, and the possibility, of
multicore-specific optimizations.
The structured parallel programming methodology is based on the concept of par-
allellism forms, also called parallelism paradigms or parallel patterns. Parallel para-
digms are schemes of parallel computations that recur in the realization of many
real-life algorithms and applications. They exhibit the following features:
3.1. STRUCTURED PARALLEL PROGRAMMING 29
• they are characterized by constraints in the parallel computation structure;
• they have a precise semantics;
• their behavior can be predicted through a suitable performance model;
• they can be composed to form complex graph computations.
We can characterize two broad categories of parallel paradigms: stream parallelism
and data parallelism.
3.1.1 Stream Parallelism
Parallelism forms belonging to this class are able to improve the throughput of a
computation in the case in which a large sequence (possibly of unlimited length)
of input elements is defined (i.e. stream-based computation). For example, the
execution of a specific computation is applied to a stream of images or video frames
represented as matrices. The existence of a large sequence of input elements is a
necessary precondition in order to apply these parallelization techniques, on the
contrary no performance enhancements can be obtained if we consider a single or a
limited set of input elements. Parallelism schemes that follow this assumption are
the task-farm and pipeline.
Pipeline In a pipeline computation the sequential code is divided in multiple
pieces executed concurrently. The application of the pipeline paradigm requires
some knowledge of the form of the sequential computation, that is the sequential
computation must be expressed (or rewritten) as the composition of n functions:
F (x) = Fn(Fn−1(...F2(F1(x))...)
In this case, a pipeline parallelization consists in a set of (at most) n entities
{S1, ..., Sn}, called stages, each executing one (or more) of the n functions. Each
stage Si will receive each input element and will compute its function Fi on it. The
output of each entity is sent to the next one, respecting the function ordering (i.e.,
the output of Fi is sent to Fi+1), so that the output of the last stage (i.e., Fn)
correspond to F(x), as depicted in Figure 3.1.
S1 Si Sn
F1 Fi Fn
... ...
Figure 3.1: Pipeline parallellism form
30 3. OUR METHODOLOGY: PROGRAMMING AND COST MODELS
Task-Farm Task-farm is a stream-parallel scheme based on the replication of a
pure function among a set of identical workers, without knowing the internal struc-
ture of the function itself. Figure 3.2 shows the farm internal structure as a com-
putation graph. An emitter module, according to a certain scheduling strategy,
distributes each input stream value to a worker. The general objective is to balance
the workers’ loads, in order to exploit best their processing capabilities. A possi-
ble scheduling strategy is the round-robin one, i.e. circular distribution. However,
this strategy is not able to assure load balancing if the calculation time of a worker
has a significantly high variance, especially in the case in which it depends on the
input data values. For this reason an on-demand approach is much more effec-
tive. Basically, its implementation is based on the availability of workers to accept
a new input task. The collector module is essentially an output interface which is
responsible for transmitting onto the output streams the results which are received
nondeterministically from the workers, possibly (but not necessarily) applying an
ordering strategy. From the performance viewpoint a task-farm scheme has the main
advantage of reducing the mean service time of the computation.
E
W1
Wi
Wn
C
. . .
. . .
TE
TW
TW
TW
TC= F(   )
= F(   )
= F(   )
Figure 3.2: Farm parallellism form
3.1.2 Data Parallelism
Data-parallel computation is characterized by partitioning (and/or replication) of
data structures and function replication, so that distinct, functionally identical mod-
ules (workers) are able to apply the same operations to distinct data partitions in
parallel. In this scheme, an input module provides the distribution of each input
element among the set of workers according to proper collective communications:
scatter for sending distinct partition of the input to distinct workers or multicast
3.1. STRUCTURED PARALLEL PROGRAMMING 31
for sending the same input. Collection of the worker results is achieved by the out-
put module exploiting a gather operation: the collector receives worker results and
builds a unique data structure, such as a vector or a matrix of elements.
The data-parallel paradigm is able to reduce the computation latency for a single
input element but, in the case of a large sequence of input tasks, it can also improve
the throughput of the computation by reducing the mean service time.
IN
W1
Wn
OUT
TIN
TW
TW
TOUTX[N]
Xk
Xi
Xj
Xm Y[N]
Yi
Yi=F(Xi,Xj,
     ...,Xk)
. . .
. . .
Figure 3.3: Data-parallel with stencil parallellism form
In a data-parallel scheme each worker applies a sequential elaboration on its own
data. In order to apply this function, a worker may require to access data contained
in other worker partitions, according to the particular data dependencies imposed
by the computation semantics.
In this case we speak about stencil-based computations (Figure 3.3, where a sten-
cil is a data dependence pattern implemented by information exchanges between
different workers. A Stencil can be:
• Static fixed, if the dependencies are defined at compile-time and remain the
same throughout the computation;
• Static variable, if the dependencies are defined at compile-time but changes
during the computation;
• Dynamic, if the dependencies are defined at run-time, depending on data struc-
ture values.
A very special, but sometimes possible, data-parallel scheme is the so-called
map, in which workers are fully independent, that is each of them operates on its
own local data only, without any communication during the execution, as shown in
Figure 3.4.
Another interesting scheme is the reduce pattern which is applicable every time
we have a computation of the form:
y = x1⊕ x2⊕ ...⊕ xk
32 3. OUR METHODOLOGY: PROGRAMMING AND COST MODELS
IN
W1
Wn
OUT
. . .TIN
TW
TW
TOUT
X[N]
XN
X1 X2 X3
Y[N]
YN
Y1 Y2 Y3
. . .
Yi=F(Xi)
Figure 3.4: Map parallellism form
where the result is a single value obtained by applying a function ⊕ to all the ele-
ments of the input data structure. To ensure correctness ⊕ also has to satisfy the
associative property.
This can be easily parallelizable by partitioning the data structure in nw workers
(parametric but limited by k), each performing a “local” reduce on their partition,
followed by a “global” reduce of the results of each worker. The global reduce can
be executed in several ways (even in parallel) by one (or more) of the workers. No-
table examples that naturally fit this paradigm are many vector- or matrix- based
operations such as the maximum and minimum of a vector, the dot product and
many others.
IN
W1
Wn
OUT
. . .TIN
TW
TW
TOUT
X[N]
XN
X1 X2 X3
y
. . .
y = F(X[N])
Figure 3.5: MapReduce parallel paradigms
Finally, many algorithms can be defined as a composition of two steps, in which
at first a function is applied to all the elements, and then the results are merged
by using some reduction function. we can straightforwardly represent this class of
3.2. PARALLELIZATION METHODOLOGY AND COST MODEL 33
algorithms as the composition of a map and a reduce pattern, as shown in Figure
3.5.
3.2 Parallelization Methodology and Cost Model
The algorithmic skeletons defined by Cole [32] represent the first approach to struc-
tured parallel programming. He proposed a quite small set of skeletons (Fixed
Degree Divide & Conquer, Iterative Combination, Cluster and Task Queue) ob-
tained both by the isolation of particular algorithmic techniques, and by an analysis
of patterns that could perform well on the initial target machine (a Transputer).
From his idea, however, many researchers focused on finding general yet effective
patterns that could be promoted to skeleton. Among the others, P 3L provided
pipeline, task farm, map and reduce, plus geometric, loop and tree as data-parallel
with stencils [17]; SKELib [40] offered only stream-based skeletons (farm and pipe),
while Lithium [8] supported pipe, map, farm and reduce. Once stabilized, the set
of used skeleton basically remained the same over the years: Skandium [72], one of
the newest skeleton frameworks, implements seq, pipe, farm, for, while, map, d&c,
fork, not introducing new patterns with respect to the first works.
All these systems employ the very same concepts introduced by Cole: the user just
writes a skeletal specification, such that a program is basically a composition of
skeletons. The majority of environments define three kinds of skeletons: data paral-
lel, task parallel and sequential skeletons. Sequential skeletons encapsulate functions
written in a sequential language and are not considered for parallel execution. The
others provide typical task and data parallel patterns.
The initial specification provided by the programmer may then be subjected to a
cost-driven transformation process with the aim of improving the performance of
the parallel program. Such transformation is done by means of semantic-preserving
rewriting rules. A rich set of rewriting rules and cost models for various skeletons
have been developed in the past [14, 49, 100].
Despite several advantages of skeletons, a strong evolution of structured parallel
programming beyond such models is needed.
In addition to the capability of expressing some typical parallel schemes, we need a
larger degree of flexibility in expressing parallel and distributed program structures:
we cannot afford to produce a skeleton for any data-parallel pattern, nor force the
programmer to write applications choosing in a small set of well studied patterns.
Although very interesting, pattern composability is still limited becoming a limita-
tion when describing large, complex applications. Finally we recognize that parallel
patterns cannot efficiently capture every parallel application: dynamic stencils, for
example, cannot be modeled by a skeleton; we need to allow some kind of cooperation
with different parallel environments so that skeleton-based patterns can cooperate
with pre-existing, or manually optimized, parallel code.
34 3. OUR METHODOLOGY: PROGRAMMING AND COST MODELS
ASSIST An interesting and effective approach to overcome the limitations of
skeleton environments has been introduced by our research group with ASSIST [109]
(A Software development System based upon Integrated Skeleton Technology). In
ASSIST an application is described by a generic graph of modules connected by
streams. This alone allows some basic stream-parallel paradigm such as pipelining,
but at the same time permits very complex behaviors and loops among the modules
that compose the application. Parallelism is also available inside the nodes, because
each module represents a parallel pattern.
ASSIST employs a novel approach to data-parallel by describing the parallel appli-
cation (and its stencil) at the minimum partitioning level. This approach, called
“Virtual Processor” (VP) generalizes the class of data-parallel algorithms and al-
lows the programmer to describe with a single formalism a generic data-parallel
algorithms with a static stencil. Lastly, a module is not forced to be implemented
as a parallel pattern: the programmer may provide its specific, hand-made imple-
mentation of a parallel module. This effectively solves the cases in which a parallel
paradigm cannot be applied.
The main idea of the VP approach is to describe the application by using a set of
VPs, i.e. virtual entities that, like processors, own a partition of the data structure,
execute the calculation on it and exchange data with others.
Notably, the programmer defines the stencil at the minimum partitioning level by
using this abstraction, while the parallel environment is in charge of analyzing it
to determine if it represents one of the basic, well studied paradigms, or a new,
“unknown” stencil. In any case, a proper worker partitioning must be established,
so that the Virtual Processors becomes Real Processors, perhaps with a different
stencil, and perform the computation. The input data are partitioned, so that each
VP “owns” a single element, and is in charge of computing it by following the owner
compute rule.
This way of describing data-parallel algorithms is indeed very powerful, because it
explicitly defines the stencil at the element level. From this, an intelligent compiler
can apply optimizations like stencil transformation [78], and optimize the stencil
with respect to the execution environment.
ASSISTANT The long-term project of our research group is ASSISTANT, the
extension and adaptation of ASSIST for the current world of parallel computing,
composed of multicores, pervasive grids and clouds. Many of the principles intro-
duced in ASSIST are inherited and extended, in order to provide a significant leap
forward in the world of CMP-oriented parallel programming.
Respecting the basic ASSIST principles, a parallel program will be described as
a generic graph of stream-connected parallel modules. Each module will be con-
stituted by one of the previously mentioned parallel patterns, or by a VP-based
description in the case of a data-parallel. Programming models based on libraries
are considered unsuitable for achieving the desired level of programmability and
3.2. PARALLELIZATION METHODOLOGY AND COST MODEL 35
Application 
Specification
source computation 
expressed as a graph 
or workflow
Cost Model of the 
source computation 
- bottleneck 
detection
Parallelization of 
bottlenecks 
according to one or 
more parallel 
paradigms - 
selection of a 
parallel solution
Encoding, possibly 
reusing existing 
sequential code
Cost Model
Compiler and run-
time support 
libraries linking
Parametric and 
restructurable 
parallel object code
Mapping, loading 
and deployment
Execution for 
autonomic 
application and 
systems: run-time 
feedback - Cost 
model of 
reconfiguration
Dynamic program 
restructuring: new 
parallel solution, 
possibly with 
different parallel 
paradigms and/or 
parallelism degree
Monitoring and 
context- aware 
information – new 
bottleneck detection
Figure 3.6: The “compilation workflow” in our programming environment
performance portability: our environment will need an intelligent source-to-source
parallel compiler, able to analyze the module-based description to determine the
possible parallel implementations, evaluate them for the target machine and, finally,
produce the source code of a low-level parallel program.
Our experience in parallel programming also indicated that there are many cases
in which performance portability is not completely achievable at compile-time: the
cost model may be not detailed enough to accurately fit the combination application
implementation-target architecture, or some model parameters may be unpredictable
(because of both the architecture and the algorithm) so that mere compiler-based
performance portability becomes ineffective. To handle all these important cases, it
is also mandatory to support adaptivity, by means of efficient run-time reconfigura-
tions, in addition to static optimizations [77].
In addition, to better allow performance portability and adaptivity, we believe it
is necessary to allow the programmer to explicity define different patterns for each
module. In this way, if multiple parallel patterns, with different performance char-
acteristics, are known by the programmer, we further increase the possibilities of
our compiler. This approach, which has been introduced in our works with per-
vasive grids, remains consistent with the programming model. The compiler will
then use its cost model to select (and optimize) the best among the whole set of
implementations.
Figure 3.6 represents the specific “compilation workflow” that we have in mind
for ASSISTANT, where, for the reasons describe before, it is evident the importance
of the cost model, which affects basically every step of the workflow.
Notably, starting from computation graph, the cost model is used to recognize which
36 3. OUR METHODOLOGY: PROGRAMMING AND COST MODELS
M1
T1 = 10
M2
T2 = 30
M3
T3 = 50
M4
T4 = 70
0.6
0.40.7
0.3
Figure 3.7: An example module graph
modules of the application graph represents a bottleneck. In order to eliminate, or
at least to reduce the effects of, the bottlenecks, we try to parallelize each bottle-
neck module according to some parallelism paradigms. The result is a computation
graph, functionally equivalent to the initial one, in which some nodes are trans-
formed through the internal structured parallelization defined by the selected par-
allelism form.
In the second step the compiler determines, for each parallelized module, the best
parallel paradigm and its implementation, considering both the user-provided and
the automatically derived transformations.
At the end of this step the bottleneck detection algorithm is run again, considering
the expected implementation of each module and their specific cost model. If new
bottlenecks are found, the steps for the bottleneck detection and their parallelization
are executed iteratively to further refine the parallel implementation.
The compiler then generates the low-level parallel source code, to be compiled using
a generic compiler. The resulting application, however, is also enriched with mon-
itoring tools and other possible parallel implementations, so that, by continuously
monitoring and applying the cost model, the program self-adapts to better match
the running environment and guarantee the best possible performance.
3.2.1 Performance Modeling with Queueing Networks
As already introduced, in our programming environment, a parallel program is ex-
pressed as a graph, whose nodes are co-operating (parallel) modules. Consider the
example of graph computation shown in Figure 3.7. It is a directed acyclic graph
(DAG) in which nodes correspond to modules and arcs to interactions, possible
by means of streams or single values. Moreover, each module can be described
by a parallel pattern, and therefore able to exploit stream-parallel or data-parallel
parallelism, depending on the chosen implementation. Performance metrics are as-
sociated to nodes (notably, internal calculation time and/or ideal service time) and
3.2. PARALLELIZATION METHODOLOGY AND COST MODEL 37
TS
queue
S TPTA
Figure 3.8: A computation module modeled as a queueing system
to arcs (e.g., asynchrony degree, communication latency, in some cases a probability
of utilization).
The methodology proposed in [77] is aimed to completely model the performance
at any level, analyzing both the internal behavior of a single module, and the per-
formance of the entire computation graph, by providing a performance modeling
approach expressed in terms of fundamental results in the area of Queueing Theory
and Queueing Networks. In this way we will be able to formalize important issues
related to:
• how to evaluate the performance of a graph computation starting from the
knowledge of the performance of each module;
• how to evaluate the effective performance of a module based on the ideal
performance behavior of all the modules of the computation graph;
• how to detect bottlenecks in a computation graph, that is modules that seri-
ously limit the performance of the entire application.
In this section we will just introduce the concept, needed to intuitively under-
stand the ideas and how the model works; the interested reader can refer to [77] for
more specific details.
The basic idea consists in modeling the performance of a module M (either sequential
or internally parallel) by abstracting its behavior as a queueing system, as shown
in Figure 3.8. This scheme is a logical one, not necessarily corresponding to the
real structure of the computation. From the performance evaluation viewpoint, the
logical scheme reduces the analysis complexity and makes it possible to obtain an
approximate evaluation, which is quite acceptable provided that the mathematical
and stochastic assumptions are validated.
The queueing systems can be analytically defined knowing the following charac-
teristics:
• Service discipline: if not otherwise stated, the FIFO discipline is assumed.
• Queue size, which is the number of elements available for storing the clients
requests.
38 3. OUR METHODOLOGY: PROGRAMMING AND COST MODELS
M1
T1 = 10
M2
T2 = 30
M3
T3 = 50
M4
T4 = 70
0.6
0.40.7
0.3
0.6
0.40.7
0.3
M2
T2 = 30
M3
T3 = 50
M4
T4 = 70
M1
T1 = 10
Figure 3.9: An example module graph and its queueing network representation
• The probability distribution of the random variable service time ts, which rep-
resents the ideal time needed to serve a customer, that is the time passed
between the beginning of the executions on two consecutive stream elements.
We denote with TS the mean value and σS the variance of this random variable.
• The probability distribution of the random variable interarrival time tA, which
indicates the time interval between two consecutive arrivals of requests, with
mean value TA and variance σA.
• The probability distribution of a random variable inter-departure time tP ,
which represents the time between two successive result departures from the
module, with mean value TP and variance σP .
The queue utilization factor, or equivalently the server utilization factor, is defined
as
ρ =
TS
TA
It is a very meaningful parameter for performance evaluation, expressing a global,
average measure of the congestion degree, or traffic intensity, of the requests to the
server. When ρ > 1, the server represents a bottleneck with respect to the client(s)
requests.
Each computation module can be abstracted as a queueing system and the compu-
tation graph can be described as a network of queues [70], where the departures of
some nodes establish the arrivals of others. Figure 3.9 shows the example module
graph used before and its queueing network representation.
From the network topology viewpoint queueing networks can be categorized into
two broad classes namely open queueing networks and closed queueing networks.
In an open queueing network a possibly infinite number of requests are generated
by source nodes, go through several nodes or even revisit a particular node more
than once and finally leave the system. On the other hand, in a closed queueing
network requests neither arrive at nor depart from the system, but a fixed number
3.2. PARALLELIZATION METHODOLOGY AND COST MODEL 39
of requests continuously circulate through the nodes of the network. In our case, the
graph of modules depict an open queueing network, given the presence of infinite
streams. For the sake of simplicity, our approach will be limited to acyclic computa-
tion graphs, where each task follows a certain path, passing through each module at
most once. With this simplification, we are able to analyze the performance of this
kind of graph in a completely independent way with respect to the internal behavior
of each computation module, which may implement any parallellism paradigm. The
only parameter required is the average value of the ideal service time of each module.
Anyway, the case of client-server computations with request-reply behavior will be
studied in Chapter 6 to model multiprocessor systems.
Acyclic graph computations
The evaluation methodology, derived from common queueing theory, consists in two
interrelated phases: transient and steady-state analysis. As said before, when ρ > 1,
the server represents a bottleneck with respect to the clients requests. There is a
transient period during which more requests of the same client can be delivered to
the queue where they are buffered, so the client behavior is relatively independent
from the server one. However, in the steady-state behavior the mean number of
queued elements tends to grow indefinitely. Thus, if ρ > 1, on the average the
server is not able to satisfy the client requests.
In real systems and computations, because the queue size is of finite length, when
ρ > 1 in the steady-state behavior on the average the client is temporarily blocked
each time it tries to send a new request, thus in the steady-state behavior the client
request rate adapts to the server service rate. For this reason the situation ρ > 1
is a transient one. However, for our purposes, it is meaningful, because it is the
condition we have to check in order to discover the possible existence of a bottleneck
starting from the definition of the computation modules.
When ρ < 1, the server is not a bottleneck, and the distinction between the tran-
sient phase and the steady-state phase has no substantial impact on the average
performance measures.
While, when the server is a bottleneck a non-null transient phase exists before reach-
ing the steady-state behavior. Once the behavior is stable, the mean interarrival
time becomes equal to the mean service time of the server, thus the mean service
time of any client is increased with respect to the initial values, i.e. with respect to
the ideal service time.
For performance evaluation of acyclic graph computations, the condition ρ = 1
denotes a “limit situation” in which, on the average, the clients are not delayed,
although considerable fluctuations around the mean values exist. From a mathe-
matical point of view, in an acyclic graph computation, the condition ρ = 1 − δ,
with δ > 0 arbitrarily small, is sufficient for a steady-state behavior without bottle-
necks
40 3. OUR METHODOLOGY: PROGRAMMING AND COST MODELS
In other words, in acyclic graph computations, ρ values less than one correspond to
the bottleneck elimination and, if close to one, to the optimal server utilization. A
further parallelization of the server, thus a further TS reduction, is not beneficial for
the client and implies lower server efficiency.
In the design of parallel computations expressed by acyclic graphs, where possible,
we try to eliminate all the bottlenecks by imposing utilization factor values less than
one and very close to one, so achieving the best server efficiency.
3.2.2 Performance evaluations of modules and graph com-
putations
Once a bottleneck is found, the module will be parallelized according to the “compi-
lation workflow” presented in this section. In order to do that, we need to evaluate
the ideal and effective bandwidths of acyclic graph computations, and, by using this
evaluations, we can determine the optimal degree of parallelism.
The following results are used in the evaluation of performances for sub-classes of
graph computations. The first theorem evaluate the interarrival time of the requests
generated from a client to a set of servers.
Theorem 3.2.1 (Interarrival time during the transient phase). Assume a graph
composed of a client node C and a set of server nodes S1, ..., Sn. Let pi be the
probability that a request from C is directed to Si, where
∑n
i=1 pi = 1. During the
transient phase, the interarrival time TAi to each server Si is given by:
TAi =
TC
pi
where, TC is the interdeparture time of C towards any server.
The next theorem evaluate the interarrival time of the requests generated from
a set of clients to a single server.
Theorem 3.2.2 (Total interarrival time during the transient phase). If a server
node S has n multiple clients each one with an initial inter-departure time TPi to S,
during the transient phase the total inter-arrival time to S is given by:
TA =
1∑n
i=1
1
TPi
These theorems are valid in the transient behavior of the computation. Thus,
they are valid also in the steady-state behavior only if there are no bottlenecks,
as discussed in the previous section. Otherwise, all interarrival and interdeparture
times must be re-evaluated in order to have a correct evaluation of the steady-
state behavior performances. The interested reader can consult [77] for a complete
3.2. PARALLELIZATION METHODOLOGY AND COST MODEL 41
Σ
Σ1TA
Figure 3.10: A generic acyclic graph computation Σ
analysis and proof.
Let now consider a generic acyclic graph computation Σ, consisting of a module,
or subsystem, Σ1 having one or more input streams with interarrival time TA, as
depicted in Figure 3.10.
Ideal and effective service time of Σ1 The ideal service time of Σ1 (TΣ−id)
is evaluated by considering it as an “isolated” system. That is, the ideal service
time does not depend on the interarrival rate 1/TA. The inverse of the ideal service
time measures the offered bandwidth of Σ1, while the interarrival rate represents the
requested bandwidth of Σ to Σ1.
If Σ1 is a single module, it is characterized by an average internal calculation time
Tcalc, which gives a first idea of the ideal service time. The ideal service time is
also evaluated as a function of the latency of communications Lcom performed by
Σ1 towards the external world. For example, we can have
TΣ−id = TS = max(Tcalc, Lcom)
if the calculation and communications phases can be overlapped, or
TΣ−id = TS = Tcalc + Lcom
If Σ1 is a n-parallelization (n is the parallelism degree) of a sequential subsystem
with ideal service time T , the ideal service time of Σ1 is given by
TΣ−id = TS =
T
n
Because Σ1 belongs to a system Σ, its effective service time is given by its interde-
parture time, which can be evaluated according to the definition of ρ as follows:
TP =
{
TA ρ < 1
TS ρ ≥ 1
or, equivalently,
TP = max(TA, TS)
42 3. OUR METHODOLOGY: PROGRAMMING AND COST MODELS
The efficiency of Σ1, is then evaluated considering the complex system as follows
εΣ1 =
TΣ−id
TΣ
=
TS
max(TA, TS)
Therefore, we have that the relative efficiency of a module/subsystem, belonging to
a more complex system, is equal to its utilization factor if it is not a bottleneck,
otherwise it is equal to one:
εΣ1 =
{
ρΣ1 ρΣ1 < 1
1 ρΣ1 ≥ 1
Thus, the evaluation of the whole system Σ can be derived as follows:
TΣ = TΣ1 = TP TΣ−id = TA
Therefore, the relative efficiency of the entire system is:
ε =
TΣ−id
TΣ
=
TA
TP
This means that the system is able to achieve the ideal bandwidth, and the maximum
efficiency, if it does not contain bottlenecks, that is TP = TA, otherwise TP > TA.
These results are applicable to any acyclic system where a module/subsystem Σin
generates the stream with interdeparture time TA.
Optimal parallelism degree Suppose that Σ1 represents a bottleneck in an
acyclic graph computation. The optimal parallelism degree for its parallelization
can be evaluated as
nopt =
⌈
TS
TA
⌉
Actually, this value represents the potentially optimal parallelism degree, which
means that the bottleneck can be eliminated, provided that a parallelism paradigm
exists able to actually exploit this parallelism degree.
If the optimal parallelization is feasible, then the effective service time of Σ1 is:
T
nopt
S = TA
and the relative efficiency:
ε1 = ρε1 =
TS
nopt
TA
=
TS
TA⌈
TS
TA
⌉
3.2. PARALLELIZATION METHODOLOGY AND COST MODEL 43
Otherwise, the parallelized Σ1 is still a bottleneck. If n0 < nopt is the best parallelism
degree achievable, the effective service time is
T n0S =
TS
n0
> TA
Regarding the motivations for which the optimal parallelization cannot be feasible,
a trivial one is due to the insufficient number of available processing nodes in the
target architecture.
Anyway, nopt depends on Tcalc and Lcom. These values are initially evaluated on a
single processing node of the target architecture. Using n processing nodes, some
parameters affecting Tcalc and Lcom might assume different values with respect to
the “sequential” case of n = 1.
In general, a parallel program uses additional functionalities and data structures
with respect to the sequential module, e.g., for synchronizations and cooperations
between the modules that implements the parallelism paradigm.
Moreover, communication latencies and memory access latencies depend on the num-
ber of active processing nodes and their processing times, as we will deeply analyze
in Chapter 6.
Therefore, in principle the parallelism degree should be re-evaluated in terms of the
“parallel-version” values of Tcalc and Lcom.
3.2.3 Parallelism forms and cost models
As introduced in Section 3.1, each parallelism form is characterized by a specific
semantics and its behavior, in terms of data partitioning or replication and function
replication, is well-defined.
Moreover, each parallel paradigm is characterized by a specific cost model.
Pipeline Consider the pipeline computation
F (x) = Fn(Fn−1(...F2(F1(x))...)
As in the previous Section, let Tcalc the average calculation time of the whole function
F (x):
Tcalc =
n∑
i=1
TFi
In the most general case, stages might be unbalanced in terms of internal calcula-
tion time and/or of communication latency. Therefore, one stage is the (heaviest)
bottleneck for the whole computation. The cost model is derived by applying the
general theory of the previous section.
Let b be the index of the bottleneck stage, and let Tb = Tmax denote its ideal
and effective service time. If Ti−id denotes the ideal service time of stage i, with
44 3. OUR METHODOLOGY: PROGRAMMING AND COST MODELS
Stage 3Stage 2Stage 1
E
W1
Wi
Wn
C
. . .
. . .
TE TW TC
Figure 3.11: Stream-oriented pipeline modeling of task farm
i = 1, ..., n, in the steady-state behavior all stages have effective service time and
efficiency equals to:
Ti = Tmax i = 1, ..., n
and
εi =
Ti−id
Tmax
=
{
ρi i 6= b
1 i = b
If the stream is generated by the first stage, for the whole computation we have:
T nΣ−id = T1−id T
n
Σ = Tmax ε
n
Σ =
T1−id
Tmax
While, if the stream is generated externally with interarrival time TA:
T nΣ−id = Tmax T
n
Σ = max(Tmax, TA)
εnΣ =
1 Tmax ≥ TAρΣ = Tmax
TA
Tmax < TA
Assuming that all stages have the same Lcom, the latency of the pipeline computation
is
Ln =
n∑
i=1
(Li + Lcom)
which is greater than the sequential module latency.
Task Farm In the task farm paradigm a pipeline effect exists among emitter, set
of workers, and collector, as summarized in Figure 3.11.
3.2. PARALLELIZATION METHODOLOGY AND COST MODEL 45
If Tcalc is the calculation time of function F and Tcom−W is the (non-overlapped)
latency of communications performed by a worker, the service time of each worker
is
TW = Tcalc + Tcom−W
Let TE and TC be the emitter and collector service times, and TA be the interarrival
time to farm Σ
The optimal number of workers is given by the general theory:
nopt =
⌈
Tcalc
TA
⌉
This is always true if the emitter is not a bottleneck, that is
TA ≥ TE
which is verified in the large majority of cases, e.g., when TE = Lcom . Farm is not
able to exploit the optimal parallelism degree when:
TA < Lcom
Therefore, the emitter interdeparture time is
TPE = max(TA, TE)
Since the workers are load balanced, the probability that an input stream element
is sent to any worker is constant and equal to 1/n. Therefore for 3.2.1, we have:
TAi = n ∗max(TA, TE) i = 1, ..., n
and
TPi = max(TW , n ∗max(TA, TE)) i = 1, ..., n
For 3.2.2, the collector interarrival rate is given by:
1
TAC
=
n∑
i=1
1
TPi
= min
(
n
TW
,
n
n ∗max(TA, TE)
)
= min
(
n
TW
,
1
max(TA, TE)
)
which represents also the effective farm bandwidth, if the collector is not bottleneck.
Thus, the farm service time is:
TΣ = max
(
TW
n
,max(TA, TE)
)
The best number of workers is the n value that maximizes the bandwidth, which in
the most frequent case in which emitter is not a bottleneck is:
nopt =
⌈
TW
TA
⌉
As in the pipeline, the latency is greater than the sequential latency:
Lfarm ∼ Tcalc + LE + Lcom−W + LC
46 3. OUR METHODOLOGY: PROGRAMMING AND COST MODELS
Data-parallel More complex with respect to the task farm case is the definition
of performance models for data-parallel schemes. As we have seen in the example
provided in Section 3.1, for data-parallel computations many forms exist and, for
each form, some variants are possible, e.g. with or without replicated data, with
or without data communications between workers, and so on. In this section, we
provide a general description of a performance modeling for data-parallel programs,
which needs to be instantiated to real cases.
We consider data-parallel programs in which a composite input state (vectors and/or
matrices) is partitioned/replicated among a set of workers which apply a function
F on each element of its assigned partition for a certain number of iterations. The
function evaluation is a sequential computation that can feature statically known
data dependencies: for instance the evaluation of F on the i-th element of an array
can depend on the values of the nearest neighbors i−1 and i+1. Such dependencies
can vary between different iterations of the a data-parallel program (variable stencil)
or they can be the same for all iterations (fixed stencil). In our model at each
iteration i, all data dependencies are related to the element values computed at the
end of the previous iteration i− 1. In some cases, at the end of the computation, a
dedicated process performs the gathering of the local results of each worker, filling
an output data-structure.
In the above description emerge three different phases of a data-parallel program: (i)
the distribution of the input data-structures; (ii) the execution phase composed of
a set of iterations performed by each worker in parallel; (iii) the collection of worker
results. The performance model of data-parallel programs is defined in terms of the
service time and the computation latency of these three phases.
We can evaluate the latency of a data-parallel computation as
Ldp = LIN +
nit∑
i=1
Titer(i) + LOUT
where LIN is the computation latency for completing the distribution of the input
data-structures of a task executed by the INPUT module, and LOUT is the latency
for completing the results collection executed by the OUTPUT module. The middle
term of the equation (Titer(i)) indicates the computation time of each worker, where
nit represents the number of iterations executed by each worker.
For each worker, the computation time per iteration consists of a calculation phase,
in which the sequential computation is applied to all the elements of its partition (of
dimension g), and in a communication phase, in which a portion of the local data
is transmitted to other workers according to the data-dependencies imposed by the
computation semantics. Therefore we have:
Titer(i) = gTF + Tcomm(i)
where, TF indicates the calculation time per element and Tcomm indicates the commu-
nication time required for exchanging data with other workers of the computation.
3.2. PARALLELIZATION METHODOLOGY AND COST MODEL 47
If a data-parallel scheme operates in a stream-based computation, its ideal service
time is given, as in the farm case, by the following expression:
Tdp = max
(
TIN ,
nit∑
i=1
Titer(i), TOUT
)
where, for each stage, the corresponding modules service times are used.
3.2.4 Evaluating the model parameters
In this section we reached a set of equations to model the service time and the
latency of the various parallelism forms which are architecture independent. The
impact of the underlying architecture is captured by a set of parameters that need
to be estimated, using information derived from the specific target architecture and
from the algorithm.
In general, to evaluate these parameters we can use any of the three main techniques
of performance evaluation: measurement, simulation and analytical modeling. All
these techniques play equally important roles in performance studies, because each
one has its own advantages and disadvantages, that basically consist in a proper mix
of precision and cost. We cannot really say that one technique is always better than
another, as it usually depends on what we are evaluating. When a single technique
cannot be effectively applied, a mix of the three is used to evaluate the performance
of the system.
Modeling communications latencies The communications latencies Lcom is a
key parameter in our cost models. It measures the latency for completing commu-
nication primitives used by the modules of the applications to perform interactions
(i.e., synchronizations and communications). For example, in the message-passing
implementation model it measure the latency of communication primitives send and
receive.
Lcom captures a large number of characteristics of the underlying architecture, like
• shared vs distributed memory,
• interconnection structures,
• memory hierachies, caching strategies (e.g., cache coherence protocols interac-
tions),
• features of CPUs, coprocessors, and other processing units,
• process scheduling,
and so on.
In order to take into account all fundamental issues in the cost model definition, we
48 3. OUR METHODOLOGY: PROGRAMMING AND COST MODELS
... ... ...
Main Memory
M
Interconnection Network
PE1 PEi PEj PEn
Figure 3.12: An example of an abstract architecture with n Processing Elements
connected to the main memory through the interconnection structure
utilize a typical concept in computer science: abstract architecture.
An abstract architecture is a simplified view of the concrete target architecture able
to describe the essential performance properties and abstract from all the others
that are useless. It aims to throw away details belonging to different concrete ar-
chitectures and emphasizes all the most important and general ones. An abstract
architecture for shared memory architectures could be the one in Figure 3.12 wherein
there exist many processing nodes (PEs) as processes connected to the main memory
through the interconnection structure.
A cost model is associated to the abstract architecture. This cost model has to
sum up all the features of the concrete architecture, the interprocess communication
run-time support and the impact of the parallel application. Further, we strongly
advocated that a cost model should be easy to use and conceptually simple to un-
derstand.
For example, let us consider the typical module behavior consisting of the phases
“receive − compute − send”. For each input stream element, Tcalc is the compute
phase service time, while Tcom evaluates the delay introduced by receive and send
communication primitives. In the simplest design approach, the three phases are
executed in a strictly sequential order. Thus, Tcom is just the latency of the com-
munication phase (Lcom), which corresponds to the sum of the send and receive
latencies:
Tcom = Lcom = Tsend + Treceive
According to the interprocess communication run-time schemes, for example, in
message-passing run-time support, we are able to define the the latency of send and
receive primitives as:
Tsend(M) = Tsetup−send +M ∗ Ttrasm−send
Treceive(M) = Tsetup−receive +M ∗ Ttrasm−receive
where M is the message length, Tsetup is the latency of all run-time support ac-
tions except the message copy (sender-receiver synchronization, buffer of “target
3.3. SUMMARY 49
variables” management, low-level scheduling), and Ttransm is the latency for copying
one word of the message.
Evaluating the sequential time The evaluation of Tcalc is the only part that
requires the knowledge of the application algorithm. Because of isolation, we only
need to estimate the execution time of the sequential source code in the target archi-
tecture. This is surely an easier problem with respect to evaluating the performance
of a parallel application. Several works aimed to solve this problem exists in lit-
erature [58, 61], so that we can safely consider this a solved, or at least solvable,
problem that do not require further studies in this thesis. Anyway, for the sake
of simplicity, in our thesis we use a very simple methodology based on the actual
execution of the sequential program on our target machine. Notably, we use the
following execution time model:
Tcalc = CPU execution clock cycles+
CPU execution clock cycles ∗Memory access clock cycles
= CPU execution clock cycles+
Lmisses ∗Memory Latency
That essentially separate the CPU time and the memory hierarchy latencies. De-
spite its simplicity, this is actually used in many performance evaluation works and
is believed to model the behavior of a program with sufficient precision.
Base and under-load memory access latencies It is important to notice that
for the evaluation of both communication latencies and sequential time we need to
use two different values for the memory latency:
• the under-load memory access latency, which represents the access latency in
a parallel program, with multiple PEs issuing memory requests concurrently,
that can be predicted by means of specific architecture-based models, as we
will studied in Chapter 6.
• the base memory latency, that is the memory latency without any congestion
effect, depends only on the specific architecture; notably, in Chapter 4, we
will provide a complete definition of base memory latency starting from the
definition of an abstract architecture able to model different cache coherence
solutions.
3.3 Summary
In this chapter we summarized the basic features characterizing the methodology
proposed by our research group and used in this thesis. Starting from the defini-
tion of a parallel application defined as a graph of co-operating (parallel) modules,
50 3. OUR METHODOLOGY: PROGRAMMING AND COST MODELS
We “solve the computation graph” in order to understand which modules represent
bottlenecks for the application performance. The next step consists in providing a
functionally equivalent parallel computation for each bottleneck, without modifying
the semantics and the logical interfaces with respect to the sequential version. In
order to evaluate and to compare some alternative versions of the parallel trans-
formation, we use proper performance metrics according to a cost model. This
performance model is parametrically dependent on the application characteristics
and on the target parallel architecture(s) and derived starting from the abstract
description of the architecture.
Part II
Applying Our Methodology to the
Cache Coherence Problem

CHAPTER 4
Modelling Cache Coherent Architectures
Off-the-shelf architectures offer different solutions to deal with the cache coherence
problem. As we saw in Section 2.2, we can distinguish between invalidation-based
and update-based coherency protocols and also between snoopy-based and directory-
based implementations. Moreover, each platform defines a specific set of actions
performed in order to maintain the consistency of data, namely the coherency pro-
tocol.
However, we can analyze the most important aspects of a cache coherent system by
reasoning on an abstract model for these architectures.
The model will help us to understand when and how the coherency protocol is
applied and give us a first idea on the differences in terms of performance of the
different implementations used in current architectures.
4.1 Shared Memory Organization in CMP-based
Architectures
In a generic multiprocessor architecture, we recognize the memory organization ac-
cording to the relative “distance” between the Processing Elements (PEs) and the
Shared Memory (M), which characterize the shared memory access latency.
Traditionally, we distinguish two main organizations:
• Uniform Memory Access (UMA), often called Symmetric MultiProcessor (SMP),
where PEs are “equidistant” from the shared memory (from any memory mod-
ules), which means the base memory access latency is equal for any PE;
• Non Uniform Memory Access (NUMA), where for each PE we distinguish
between “local” and “remote” memory, which means the base memory access
54 4. MODELLING CACHE COHERENT ARCHITECTURES
Shared Main Memory (M)
...
PEn
PrCn
Pn
M1 Mm...
PE1
PrC1
P1
Interconnection Network
(a) SMP
...
Interconnection Network
M1
PE1
PrC1
P1
PEn
PrCn
Pn
Mn
(b) NUMA
Figure 4.1: SMP vs NUMA characterization of multiprocessor architectures
latency depends on the specific PE and on the specific memory module that
interact.
Figure 4.1 shows tipycal schemes of the two organizations. Figure 4.1a represents
a SMP with n PEs and m main memory modules. In the NUMA scheme, shown in
Figure 4.1b, each PEi has a local memory Mi, and the shared memory is the union
of all the local memories. Each PE can address its own local memory and any other
memory module. The remote accesses utilize the interconnection network, while
the local accesses exploit the dedicated links. This leads to a base latency for local
memory accesses (PEi−Mi) which is lower than the remote memory access latency
(PEi −Mj, with i 6= j). For this reason, NUMA organization is typically exploited
allocating all the private information of process(es) mapped on PEi in Mi, in order
to limit remote accesses to a subset of shared information.
With the advent of CMP-based architectures, this SMP vs NUMA classification
needs to be further specialized in order to better model and exploit both single-
CMP and multiple-CMP systems. In the following, the term core is a synonym for
Processing Element (PE), which is the basic unit of parallelism at the architectural
level.
4.1.1 Single-CMP
In a single-CMP architecture, as well as the main memory organization, also the
cache subsystem characterizes the internal CMP organization. In fact, when a cache
level is shared, the internal organization is clearly a SMP. For example, the Intel
SandyBridge and the AMD Opteron processors [95, 34] have a third level cache
shared between the various PEs, with the same base cache access latency for any
PE-L3 pair.
4.1. SM ORGANIZATION IN CMP-BASED ARCHITECTURES 55
CMP
PE1
PrC1
P1
PEn
PrCn
Pn
...
Internal Interconnection Network
MINF MINF...
M1 Mm...
(a) Main memory subsystem in single-CMP ar-
chitectures
CMP
PE1
MINF MINF
...
M1 Mi
MINF
Mj
MINF
Mm
PE PE
PE PE PE
PE PE PE
PE PE PE
PE1 PE PE
PE PE PE
PE PE PE
PE PE PEn
(b) NUMA-SMP characterization of
single-CMP architectures
Figure 4.2: Main memory subsystem in single-CMP architectures
NUMA-SMP architecture Regarding the main memory, in current CMPs there
is no internal main memory subsystem. A notable exception was the IBM Cell
[65, 63]. As shown in Figure 4.2a, the main memory is typically off-chip, realized by
distinct subsystems and it is interconnected through a memory controller or Exter-
nal Memory Interface (MINF). The increasing memory bandwidth requirements of
CMPs drove to the inclusion of the memory controller inside the processor chip, and
later even to the increase of the number of controllers per chip: the IBM Power7 7
and the Tilera TilePro64 both contain four MINFs on-chip.
Depending on the connectivity of PEs to MINF and to the process code and data
allocation, for single-MINF CMP we are again in a SMP or in a NUMA organiza-
tion.
For muliple-MINFs CMP, it is possible that each MINF can be, at least logically,
associated to a partition of PEs. As shown in Figure 4.2b, each core is physically
interconnected to all MINFs, but it is nearer to one of them. In this case, if all
private information of each core partition plus some shared information are allo-
cated on the associated shared memory partition, the architecture can be logically
used as a NUMA. Otherwise, when no specific allocation strategy is planned, the
NUMA characterization is related to the specific interconnection network for which
the distance might depend on the communicating PE.
56 4. MODELLING CACHE COHERENT ARCHITECTURES
CMP1
...PE PE
CMPn...
PE PE...
EXT1 EXTn
Shared Main Memory (M)
M1 Mm...
External Interconnection Network
(a) Multiple-CMP SMP
External Interconnection Network
M1 MnEXT1 EXTn
CMP1
...PE PE
CMPn
PE PE...
...
(b) Multiple-CMP NUMA
Figure 4.3: SMP vs NUMA characterization of multiple-CMP architectures
4.1.2 Multiple-CMP
The SMP vs NUMA characterization is meaningful for multiple-CMP architectures.
Figure 4.3 shows how the SMP/NUMA distinction can be applied to the CMPs, ac-
cording to the external interconnection and to the memory allocation strategy. EXT
is the interface subsystem between the CMP MINFs and the external interconnect.
The NUMA architecture is actually a NUMA-SMP architecture, since a CMP
with its local external memory is a SMP or a NUMA-SMP itself, as discussed in
the single-CMP section. For example, the multiprocessor configuration of both Intel
SandyBridge and AMD Opteron processor fall into this characterization due to their
external interconnections.
4.2 An Abstract Model for Cache Coherent Ar-
chitectures
In this section we define an abstract model for a generic cache coherent CMP-based
system. Figure 4.4, shows the abstract model for a generic system with a minimal
memory hierarchy, composed of a shared main memory (M) and a private cache (C)
for each processing element (PE).
The generic PEi contains information about the state (e.g., exclusive or shared) of
all cache lines currently allocated in Ci. For each memory operation, the cache man-
agement actions depend on the local state of the referred line. However, the local
knowledge, managed by the Local cache Control (LC), is not sufficient for achiev-
ing cache coherence. In order to ensure that, a global knowledge of the current
system-wide situation of cache allocation and coherency states is needed. Logically
the global state is centralized in order to be always updated: in the abstract model
this global knowledge is contained in, and managed by, a centralized Global Con-
4.2. AN ABSTRACT MODEL FOR CC-ARCHITECTURES 57
Shared Main Memory
M
Global Control of 
Cache Coherence
GC
PE1
LC
Local Cache
Control
C1
P1
PEi
LC
Local Cache
Control
Ci
Pi
PEj
LC
Local Cache
Control
Cj
Pj
PEn
LC
Local Cache
Control
Cn
Pn
Cache to Cache Communications
C2C
... ... ...
Figure 4.4: Abstract model for a cache coherent architecture with N processing
elements
trol module (GC). The actual implementation of GC will be centralized, or decen-
tralized (statically partitioned), or distributed (dynamically partitioned/replicated),
depending on the specific system architecture. The model includes also an abstract
Cache-to-Cache interconnection facility, through which cache lines can be exchanged
between PEs according to the protocol actions. In the actual architecture, this fa-
cility can be implemented by means of one or more interconnection network and the
shared main memory M.
For example, in Snoopy-based low parallelism multiprocessors, the abstract GC cen-
tralized module is implemented just in a centralized manner. The snoopy bus arbi-
tration logic and the (several) snooping messages strictly correspond to the existence
of a centralized unit. The cache lines transfers (Cache-to-Cache communications) are
implemented through the same bus. While, in Directory-based medium-high paral-
lelism architectures, the abstract GC centralized module is decentralized through the
global state partitioning. In memory-based schemes each node contains the global
state directory entries corresponding to all the blocks in its local memory (home
node). In cache-based schemes, the information about cached copies is distributed
among the copies themselves, and the home node simply contains a pointer to one
cached copy of the block; each cached copy then contains a pointer to the node
that has the next cached copy of the block, in a distributed linked list organization.
All these solutions has been improved to solve the cache coherence problem also in
highly parallel CMP-based multiprocessor, for this reason in the following we extend
our model to deal with these new solutions and systems.
58 4. MODELLING CACHE COHERENT ARCHITECTURES
4.2.1 A Hierarchy-based Classification
In order to deal with the complex memory hierarchies employed in CMP-based ar-
chitectures and the possible use of different interconnection networks, we can extend
our abstract model in the following way. First, it is reasonable to assume that each
PE has one or more levels of private cache (PrC). Without any loss of generality,
we still keep the Local Control (LC) associated to each PE, independently of the
use of inclusive or victim caches. In both cases, LC will correspond to the set of
units (e.g., cache controller(s)) used in that PE to implement the cache coherence
solution adopted.
Of highest impact, from the point of view of implementation and performance of
cache coherence solutions, is the presence or not of one or more modules and/or
levels of shared cache (ShC).
Therefore, we can distinguish between CMP with at least a shared level cache and
architectures where the first shared level in the memory hierarchy directly corre-
spond to the main memory. The model describes a system with S modules of
shared cache, where shared data are distributed among the various modules. The
PE-ShC Inteconnection is use by each processing element to access each module of
the shared level cache. For example, the Intel Sandy Bridge processor [95] has a
third level cache, inside the chip, with a number of modules equal to the number
of cores, that is S = N. In this system, the three logical interconnection structures
(i.e., memory to ShC (M-ShC), PE-ShC and C2C) actually correspond to the same
(ring) interconnection network. In the same way, the M-PE Interconnection that
connects the PEs with the main memory, can also be used for the cache-to-cache
communications.
A further classification concerns the implementation of the Global Control (GC).
As we said before, the actual implementation of GC will be centralized, or decen-
tralized (statically partitioned), or distributed (dynamically partitioned/replicated).
Anyway, in all these cases, we can logically associate the GC to a specific level of the
memory hierarchy. Figure 4.5 shows the GC implementation when a shared level
cache is present (on the left side) or not (on the right side). We can distinguish the
case in which the GC is implemented at:
• the main memory level (Figures AM1a and AM2a),
• the shared level cache (Figures AM1b),
• the private cache level (Figures AM1c and AM2b)
Multiple-CMP model A typical solution used in multiple-CMP architectures
is to provide a cache coherence protocol hierarchy. CMP internal caches are kept
coherent by a certain protocol, called the inner protocol. Coherence across CMPs
is maintained by another, and possibly different, protocol, called the outer protocol.
4.2. AN ABSTRACT MODEL FOR CC-ARCHITECTURES 59
Shared Cache
SCS
Shared Cache
SC1
Shared Main Memory
M
M-ShC Interconnection Network
GC
[AM1a]
Shared Cache
ShCS
Shared Cache
ShC1
PE-ShCPE-ShC
PE1
LC
Local Cache
Control
C1
P1
PEi
LC
Local Cache
Control
Ci
Pi
PEj
LC
Local Cache
Control
Cj
Pj
PEn
LC
Local Cache
Control
Cn
Pn
M-ShC Interconnection Network
GC1 GCS
[AM1b]
Shared Main Memory
M
M-PE Interconnection Network
PE1
LC
Local Cache
Control
C1
P1
PEi
LC
Local Cache
Control
Ci
Pi
PEj
LC
Local Cache
Control
Cj
Pj
PEn
LC
Local Cache
Control
Cn
Pn
GC
[AM2a]
... ... ...
Cache to Cache Communications
C2C
PE-ShCPE-ShC
PE1
          LC
PrC1
P1
PEi
          LC
PrCi
Pi
PEj
          LC
PrCj
Pj
PEn
          LC
PrCn
Pn
GC1 GCi GCj GCn
[AM1c]
... ... ...
Cache to Cache Communications
C2C
M-PE Interconnection Network
PE1
          LC
PrC1
P1
PEi
          LC
PrCi
Pi
PEj
          LC
PrCj
Pj
PEn
          LC
PrCn
Pn
GC1 GCi GCj GCn
[AM2b]
Figure 4.5: Global Control (GC) implementation in a hierarchy-based classification
of cache coherent CMP-based architectures
Each level of the cache coherence protocol hierarchy can also adopt a different ar-
chitectural solution to implement the respective protocol.
To represent these architectures with our abstract model we use a two-level hierarchy
for the GC. Each of these levels maintains the specific, inner or outer, protocol global
information, respectively GCout and GCin. GCout is typically associated to the
main memory level (as in CC-NUMA systems). For example, the AMD Opteron
processor uses a snoopy-based solution as inner protocol implemented on the L2
caches connected to the shared L3 with a crossbar. The outer protocol for NUMA
configurations is directory-based. L3 cache is directly connected to the MINF and
maintains the directory of the cache line present in the entire CMP. In this case,
the GCout is maintained at the main memory level while the GCin is associated to
each CMP at the shared level cache.
The Intel SandyBridge processor uses a similar solution, but there are two alter-
native solutions as outer protocol in NUMA configurations: a sort of snoopy-based
solution with multicast communications, or a directory-based solution can be used
on the crossbar interconnection. The abstract model is the same, what changes are
the communications performed to maintain the consistency of the data, which are
60 4. MODELLING CACHE COHERENT ARCHITECTURES
analyzed in the following sections.
4.3 Base Memory Access Latencies
The abstract model introduced in the previous section, allows us to reason about
the basic actions of a cache coherency protocol in order to evaluate the base memory
and cache access latencies in CMP-based architectures.
We distinguish between reading (Load instructions) and writing (Store instructions)
operations on cache lines. In both cases, depending on the local and the global state
of the referred cache line, we analyze the actions required by a generic coherency
protocol.
Moreover, we refer to the different abstract models, shown in Figure 4.5 in the
following way:
[AM1a] system with a shared level cache and GC associated to the main memory;
[AM1b] system with a shared level cache and GC associated to the shared level cache;
[AM1c] system with a shared level cache and GC associated to the private level cache;
[AM2a] system without a shared level cache and GC associated to the main memory;
[AM2b] system without a shared level cache and GC associated to the private level
cache.
For simplicity, we start with single-CMP architectures (when not specified GC rep-
resents GCin) and we consider a system with S = 1 in the case [AM1b].
4.3.1 Reading Operations
We now consider what happens when PEi execute a load instruction:
1. Pi requires the data to PrCi
2. PrCi checks the local state (LC)
〈a〉 [LOCAL READ] the local state is sufficient to satisfy the request (e.g.,
the data is present in the private cache in shared state)
3. PrCi replies to Pi sending the data required
〈b〉 the global state (managed by GC) has to be examined
3. PEi sends a request read req to
[AM1a] M through the interconnection network(s) (i.e., PE−ShC
and M − ShC) and read req is forwarded to GC and M
4.3. BASE MEMORY ACCESS LATENCIES 61
[AM1b] ShC through the PE − ShC interconnection network and
read req is forwarded to GC and ShC
[AM1c],[AM2b] PrCh (note that can be i = h if i is the home
node for the requested cache line) through the C2C interconnection
network and read req is forwarded to GCh and PrCh
[AM2a]M through theM−PE interconnection network and read req
is forwarded to GC and M
4. GC checks the global state
〈a〉 the global state is sufficient to satisfy the request (i.e., the data
is updated in the corresponding hierarchy level)
5. the data are sent back to PEi through the interconnection
network(s)
([AM1a] the data are sent (typically) in parallel also to ShC,
to maintain the shared level updated)
6. PEi updates the local state and the PrCi, which forwards the
data to Pi
〈b〉 a third entity has to be involved (e.g., PrCj)
5. GC forwards read req through the interconnection network(s)
([AM1a] the third entity could be the ShC or the PrCj and,
depending on the actual interconnection network(s) implemen-
tation, the communications required can be done in parallel)
6. the global state is compared to the actual state of the cache
line
〈a〉 the global state is updated
7. the third entity notifies GC through the interconnection
network(s)
8. the data are sent to PEi through the interconnection net-
work(s) by
[AM1a],[AM2a] M
[AM1b] ShC
[AM1c],[AM2b] PrCh
9. PEi updates the local state and the PrCi, which forwards
the data to Pi
〈b〉 the global state is not updated or, anyway, the third part
involved is responsible for sending the data
7. the data are sent to PEi directly by
[AM1a] ShC through the PE−ShC interconnection net-
work or by PrCj through the C2C interconnection net-
work
62 4. MODELLING CACHE COHERENT ARCHITECTURES
[AM2a] PrCj through the C2C interconnection network
[AM1b],[AM1c],[AM2b] PrCj through the C2C inter-
connection network or M through the interconnection net-
work(s)
and, in parallel, it updates its local state and notifies GC
through the interconnection network(s)
8. PEi updates the local state and the PrCi, which forwards
the data to Pi
With this analysis we can estimate the base access latency for a cache line reading
operation in the various cases. Note that we did not use a specific coherency protocol,
in fact the above and the next considerations remain valid both for invalidation-based
and update-based protocol.
Table 4.1 simply define the costs of accessing data in the memory or cache level
corresponding to the GC (TGC) and the costs of the global state lookup (Tlookup−GC).
In Table 4.2, each line corresponds to one of the possible situations described above.
Table 4.1: GC and Network Latencies
Operation AM1a AM2a AM1b AM1c AM2b
GC Data Access
TGC TM TShC TPrC
GC State Lookup
Tlookup−GC TM lookup TShC lookup TPrC lookup
Respectively, they estimate the latencies of condition 2 〈a〉, 2 〈b〉 4 〈a〉, 2 〈b〉 4 〈b〉 6 〈a〉
and 2 〈b〉 4 〈b〉 6 〈b〉. In the first column Lread(LCstate,GCstate, Tstate) represents
the read latency where the local state returned by LC is LCstate, the global state in
GC is GCstate and the eventually third part involved state is Tstate. With LCstate,
GCstate and Tstate ∈M,S,E, I ∪−. For simplicity, we use the MESI[87] states to
represent the coherency states ((M)odified, (E)xclusive, (S)hared and (I)nvalid or
not present). While, with − (not specified) we represent one of the possible state,
which is not relevant for the latency.
Regarding the various latencies, we use TM , TShC , TPrC to represent the reading
access time of M,ShC and a generic PrC respectively. With TLC we mean the time
needed to check the local state. Finally, we assume a system with an interconnection
4.3. BASE MEMORY ACCESS LATENCIES 63
network that implements all the logical interconnection networks used in the abstract
models. The network latency is represented by Lnet(n words), where n words is the
number of words in the messages (Lnet for request or notify messages of one word).
For example, Lnet(σ) represents the communication latency of a cache line of σ
words.
Table 4.2: Reading Operations Latencies
Lread(LCstate,GCstate, Tstate) Cache Block Read Base Latencies
Lread(M/E/S,−,−) TPrC
Lread(I,M/E/S,−) TLC + Lnet + TGC + Lnet(σ) + TPrC
TLC + Lnet + Tlookup−GC
Lread(I,−, E) +2Lnet + TLC
+TGC + Lnet(σ) + TPrC
TLC + Lnet + Tlookup−GC
Lread(I,−,M) +Lnet + TM/TShC/TPrC
+Lnet(σ) + TPrC
As we said before, we consider a generic coherency protocol in order to be able
to model as many as possible systems. Obviously, actual implementations represent
subsets of the abstract models, coherency protocols and latencies discussed above.
For example, Intel Sandy Bridge [95] is represented by the abstract model [AM1b]
and all examinated situations may occur. Differently, Tilera TilePro64 corresponds
to the [AM2b] model and, due to the writing policy adopted, condition 6 〈a〉 never
occurs. Moreover, the third entity that could be involved (condition 6 〈b〉) is only
the main memory, which sends the data to PEi through PEh adding a delay of TPrC
to the latency.
4.3.2 Writing Operations
Estimating the latency of writing operations is also important for two main reasons:
64 4. MODELLING CACHE COHERENT ARCHITECTURES
• with a write-allocate policy, when data is not present in the PE, in order to
(and before) perform the actual write, similar actions to those required in the
reading operations have to be carried out;
• even when, in most cases, some operations may be considered with negligi-
ble latency (e.g., because executed in parallel and overlapped with other), to
ensure the correctness of a parallel program these actions may have to be
serialized.
For simplicity, we now assume an invalidation-based coherency protocol. Similar
consideration can be made about update-based protocols, with the obvious adjust-
ments (e.g., write request of shared copies involves update requests to the sharers
wrt invalidation requests). Therefore, consider what happens when PEi executes a
store instruction:
1. Pi asks for write the data to PrCi
2. PrCi checks the local state (LC)
〈a〉 the local state is sufficient to satisfy the request (e.g., the corresponding
cache line is present in the private cache in exclusive or modified state)
3. PrCi replies to Pi sending the outcome
〈b〉 the global state (managed by GC) has to be examined
3. PEi sends a rfo req request when the data is not present in PrCi
or a write req request in the other case to
[AM1a] M through the interconnection network(s) (i.e., PE−ShC
and M − ShC) and the request is forwarded to GC and M
[AM1b] ShC through the PE − ShC interconnection network and
the request is forwarded to GC and ShC
[AM1c],[AM2b] PrCh (note that can be i = h if i is the home
node for the requested cache line) through the C2C interconnection
network and the request is forwarded to GCh and PrCh
[AM2a] M through the M − PE interconnection network and the
request is forwarded to GC and M
4. GC checks the global state
〈a〉 the global state is sufficient to satisfy the rfo req request (i.e.,
the data is updated in the corresponding hierarchy level and it is
not present in any of the other higher levels of the hierarchy)
5. the data are sent back to PEi through the interconnection
network(s)
([AM1a] the data are sent (typically) in parallel also to ShC,
to maintain the shared level updated)
4.3. BASE MEMORY ACCESS LATENCIES 65
6. PEi updates the local state and the PrCi, which forwards the
data to Pi
〈b〉 a third entity has to be involved to satisfy the rfo req request
(e.g., PrCj)
5. GC forwards rfo req through the interconnection network(s)
in order to invalidate that copy
([AM1a] the third entity could be the ShC or the PrCj and,
depends on the actual interconnection network(s) implemen-
tation, the communications required can be done in parallel)
6. the global state is compared to the actual state of the cache
line
〈a〉 the global state is updated
7. the third entity invalidates its copy and notifies GC through
the interconnection network(s)
8. the data are sent to PEi through the interconnection net-
work(s) by
[AM1a],[AM2a] M
[AM1b] ShC
[AM1c],[AM2b] PrCh
9. PEi updates the local state and the PrCi, which performs
the actual write and sends the outcome to Pi
〈b〉 the global state is not updated or, anyway, the third part
involved is responsible for sending the data
7. the data are sent to PEi directly by
[AM1a] ShC through the PE−ShC interconnection net-
work or by PrCj through the C2C interconnection net-
work
[AM2a] PrCj through theC2C interconnection network
[AM1b],[AM1c],[AM2b] PrCj through the C2C inter-
connection network or M through the interconnection net-
work(s)
and, in parallel, it invalidates its copy and notifies GC
through the interconnection network(s)
8. PEi updates the local state and the PrCi, which performs
the actual write and sends the outcome to Pi
〈c〉 some other enities have to be involved to satisfy a rfo req or a
write req request (e.g., PrCj and PrCk)
5. GC forwards the request through the interconnection network(s)
in order to invalidate those copies
66 4. MODELLING CACHE COHERENT ARCHITECTURES
6a. PrCj and PrCk invalidate their copies and notify GC through
the interconnection network(s)
(6b.) for rfo req the data are sent back to PEi through the inter-
connection network(s)
7. PEi updates the local state and the PrCi, which performs the
actual write and sends the outcome to Pi
With this analysis we can estimate the base access latency for writing operation in
the various cases. As in the reading operation case, in the following table each line
corresponds to one of the possible situations described above. Respectively, they
estimate the latencies of condition 2 〈a〉, 2 〈b〉 4 〈a〉, 2 〈b〉 4 〈b〉 6 〈a〉, 2 〈b〉 4 〈b〉 6 〈b〉
and 2 〈b〉 4 〈c〉.
Regarding the various latencies, we also use Linv(n sh) to represent the latency of
n sh invalidations. For example, the invalidation of a private level cache can be
Linv(1) ≈ 2 ∗ Lnet(1) + TPrC . When n sh > 1, invalidations latencies can overlap
with each other. Anyway, in some cases, we use square brackets to mean that the
inside term can be partially or totally overlapped.
In this case, we focused on invalidation-based coherency protocol, and the Intel
Sandy Bridge still remains a good example of the abstract model [AM2b]. Differ-
ently, Tilera TilePro64 implements a hybrid approach in which PrCi always updates
the PrCh on write operations. Whereas, PrCh still remains responsible for the inval-
idations. Referring to the above analysis and table, as in the reading case, condition
6 〈a〉 never occurs, and the third entity that could be involved (condition 6 〈b〉) is
only the main memory, which sends the data to PEi through PEh adding a delay
of TPrC to the latency. Finally, condition 2 〈a〉 changes in the following way:
2. PrCi checks the local state (LC)
〈a〉 the local state is sufficient to satisfy the request (e.g., the corresponding
cache line is present only in the private cache and in the home PE)
3. PEi sends an update req request to PEh through the C2C intercon-
nection network
4. PEh updates PrCh and sends an acknowledgment to PEi through
the C2C interconnection network
5. PrCi replies to Pi sending the outcome
Therefore, the corresponding latency change in the following way:
Lwrite(E,−,−) = Lnet + TPrC + Lnet + TPrC
4.3. BASE MEMORY ACCESS LATENCIES 67
Table 4.3: Writing Operations Latencies
Lwrite(LCstate,GCstate, Tstate) Write-Allocate Base Latencies
Lwrite(M/E,−,−) TPrC
Lwrite(I,M/E,−) TLC + Lnet + TGC + Lnet(σ) + TPrC
TLC + Lnet + Tlookup−GC
Lwrite(I,−, E) +2Lnet + TLC
+TGC + Lnet(σ) + TPrC
TLC + Lnet + Tlookup−GC
Lwrite(I,−,M) +Lnet + TM/TShC/TPrC
+Lnet(σ) + TPrC
TLC + Lnet + Tlookup−GC
Lwrite(I, S(nsh),−) +Lnet(σ) + [Linv(nsh)]
+TPrC
TLC + Lnet + Tlookup−GC
Lwrite(S, S(nsh),−) +Linv(nsh) + [Lnet]
+TPrC
68 4. MODELLING CACHE COHERENT ARCHITECTURES
4.3.3 Reading and Writing Operations in Multiple-CMP
Architectures
In multiple-CMP architectures, we need to consider the case in which the outer
protocol has to be involved in the reading and writing operations.
For reading latency we have an additional case:
2. PrCi checks the local state (LC)
〈b〉 the global state (managed by GC which acts as GCin) has to be examined
4. GC checks the global state
〈c〉 the global state has no information, the outer protocol has to be
involved
5. GC forwards read req through the interconnection network(s)
to GCouth, which can be
- Mi if i is also the home node for the cache line
(no communications are needed in [AM1a])
- Mh with h 6= i
6. GCouth checks the global state of the outer protocol
〈a〉 the global state of the outer protocol is sufficient to satisfy the
request (i.e., the data is updated in the corresponding hierarchy
level)
7. the data are sent back to PEi through the interconnection
network(s)
8. PEi updates the local state and the PrCi, which forwards
the data to Pi
〈b〉 a third entity (e.g., PrCj of CMPh) has to be involved to
7. GCouth forwards read req through the interconnection net-
work(s)
8. the data are sent back to PEi through the interconnection
network(s)
9. PEi updates the local state and the PrCi, which forwards
the data to Pi
Therefore, the corresponding reading latencies for conditions 2 〈b〉 4 〈c〉 6 〈a〉 and
2 〈b〉 4 〈c〉 6 〈b〉 are respectively:
Lread−rem(I,M/E/S,−) = TLC + Lnet + Tlookup−GC + (4.1)
P (i 6= h)(Text−net) + TGCouth +
P (i 6= h)(Text−net(σ)) + Lnet(σ) + TPrC
4.3. BASE MEMORY ACCESS LATENCIES 69
and
Lread−rem(I,−,M/E) = TLC + Lnet + Tlookup−GC + (4.2)
P (i 6= h)(Text−net) + Tlookup−GCouth +
P (h 6= j)(Text−net) + TM/TShC/TPrC +
P (i 6= h OR h 6= j)(Text−net(σ)) +
Lnet(σ) + TPrC
where GCstateorLread−rem(LCstate,GCstate, Tstate) now represents the union
of the global state information maintained by the GCin and GCout. Moreover,
P (i 6= h) is the probability that home node is not the requestor node, and P (h 6= j)
is the probability that home node is not the owner node.
For writing latency with the write-allocate policy, the same operations are
required in the case 2 〈b〉 and the corresponding writing latencies change as before
(functions 4.2 and 4.3).
Moreover, in order to invalidate all the possible copies, condition 2 〈b〉 4 〈c〉
changes in the following way:
〈c〉 some other enities have to be involved to satisfy a rfo req or a write req
request (e.g., PrCj of CMPi and PrCk of CMPj)
5. GC forwards the request to GCouth through the interconnection net-
work(s) and in parallel forwards the request also to the possible local
copies (e.g., PrCj)
6. GCouth checks the global state of the outer protocol and eventually for-
wards the request in order to invalidate the copies
(the request is forwarded through the interconnection network(s) to the
specific(s) CMP(s) if j 6= k)
7a. PrCj and PrCk invalidate their copies and notify GC andGCouth through
the interconnection network(s)
(7b.) for rfo req the data are sent back to PEi through the interconnection
network(s)
7. PEi updates the local state and the PrCi, which performs the actual
write and sends the outcome to Pi
Therefore, the corresponding writing latency changes in the following way, in the
case of rfo req:
70 4. MODELLING CACHE COHERENT ARCHITECTURES
Lwrite−rem(I, S,−) = TLC + Lnet + Tlookup−GC + (4.3)
P (i 6= h)(Text−net) + Tlookup−GCouth +
P (h 6= j)(Text−net) + TM/TShC/TPrC +
P (i 6= h OR h 6= j)(Text−net(σ)) +
Lnet(σ) + [Linv(n sh)] + TPrC
while, in the case of write req the writing latency is:
Lwrite−rem(S, S,−) = TLC + Lnet + Tlookup−GC + (4.4)
P (i 6= h)(Text−net) + Tlookup−GCouth +
P (h 6= j)(Text−net) + Linv(n sh) +
P (i 6= h OR h 6= j)(Text−net) +
[Lnet] + TPrC
4.4 Benchmarks for Reading and Writing Laten-
cies
An interesting study has been reported in [52] where, by means of specific bench-
marks, the authors measured the latency of read operations in two current multicore
architectures (an Intel Nehalem and an AMD Shangai processor). This work shows
that load/store latency and bandwidth are affected by the cache coherence state of
the line, as discussed at abstract level in the previous section.
We decided to adapt the benchmark to the Intel SandyBridge and to the Tilera
TilePro64 architectures, in order to have a quantitative evaluation for the access
latencies derived in the previous part of this Chapter. Of course, benchmark results
do not strictly match to base access latencies due to the use of multiple PEs in the
systems, as we will discuss in Chapter 6.
Tables 4.4 and 4.5, shows the benchmark results for memory read latencies respec-
tively for the Intel SandyBridge and Tilera TilePro64 processors.
For the first architecture, we can easily notice that, in general, the access of a
cache line in the shared state costs as much as reading a private data (i.e., from the
private level caches); modified and exclusive states, on the other hand, increase the
access latency. It should also be noted that in these measurements we always pay
the cost of snooping that, although limited, could still be avoided in an incoherent
approach.
For Tilera TilePro64 platform, of course we expect an increased overhead, as we
deal with complex interconnection networks and protocols. In this case we have a
4.4. BENCHMARKS FOR READING AND WRITING LATENCIES 71
Source State L1 L2 L3 M
Local M/E/S 6 12
185
Modified 108 102
Same Chip Exclusive 90 39
Shared
Modified 304-312
280Other Chip Exclusive 195-205
Shared 195
Table 4.4: Memory read latencies for core 0, depending on the cache line state, for
Intel SandyBridge processors. Times in clock cycles
limited amount of possibilities: if the automatic cache coherence is enabled, because
of the write-through mechanism between the PrC and the home node, a cache line
will always be either shared or invalid. The real difference is therefore only in having
or not the cache coherence enabled. We can notice that the memory access time
is increased from 120 up to 204 clock cycles. The latency depends on the distance
between the home node and the requestor node, but we are talking of an overhead
that goes from 33% up to 70% for each memory load.
Of course, even disabling cache coherence has a cost: with automatic cache coher-
ence, if another cache has the required value, we pay from 40 to 70 clock cycles,
depending on the distance; with an incoherent mode we have no way of knowing if
another cache has the copy, and must always go to the memory paying 120 clock
cycles. In fact, the ability of cache-to-cache transfers is always presented as one of
the many benefits of automatic cache coherence mechanisms; however, it should be
noted that cache-to-cache transfers mostly depend on the program and the cache
sizes: for example, if the working sets of the processor do not fit in cache, the prob-
ability of exploiting cache-to-cache transfers is very limited. It should also be noted
that, on this architecture, enabling the automatic cache coherence in fact lower
the amount of second level cache (GC) space available for each core, thus possibly
further increasing performance losses.
Source L1 L2 M
Local
2 8 120
Incoherent
Coherent
HOME
40-70 160-204
Table 4.5: Memory read latencies for core 0, on the Tilera TilePro64 architecture,
with or without cache coherence. Times in clock cycles.
72 4. MODELLING CACHE COHERENT ARCHITECTURES
In [91], a similar benchmark has been executed on the Intel xeon phi architecture
[62].
We finally adapt the benchmark for writing operations in order to measure the
synchronous implementation of the actions required. As we will see in the next
Chapter, in order to solve memory ordering problems in architectures with weak
memory consistency models, the run-time support (thus the cost model) need to
consider the use and the performance impact of the synchronous semantics of writ-
ing operations.
Tables 4.6 and 4.7, shows benchmark results for the memory write latencies respec-
tively for the Intel SandyBridge and Tilera TilePro64 processors.
Source State L1 L2 L3 M
Modified
9 15 39
185
Local Exclusive
Shared 80 83
Modified 102 97 44
Same Chip Exclusive 86
Shared 83-95
Modified 228 225 190
280Other Chip Exclusive 213-215
Shared 233-285
Table 4.6: Memory write latencies for core 0, depending on the cache line state, for
Intel SandyBridge processors. Times in clock cycles
Notably, for Intel SandyBridge architecture, analogous considerations of that
made for read latencies can be applied to the write cases. Very important is the
impact of the invalidations required by the cache coherence protocols which is a
function of the number of copies shared among the PEs. Fortunately, we will see
in the next chapter, that with Structured Parallel Programming and a proper pro-
cess/thread mapping strategies we are able to minimize this number.
Similar results are obtained for the Tilera TilePro64 platform, with the distinction
in the case of using write-allocation policy or not.
4.5 Summary
In this chapter we study the impact of different automatic cache coherence solutions
by reasoning on an abstract model which is able to capture all the essential infor-
mation of CMP-based architectures and the specific cache coherence protocols.
4.5. SUMMARY 73
Source L1 L2
M
wRead w/oRead
Local
Exclusive 5-28
174-219 57-86
Shared 12-292
HOME
Exclusive 45-73
Shared 52-332
Local = HOME
Exclusive 2 7
226-274 121-147
Shared 7-285
Table 4.7: Memory write latencies for core 0, depending on the cache line state, for
Tilera TilePro64 processors. Times in clock cycles
Notably, we provide a simple and effective way to evaluate the read and write mem-
ory operation latencies according to the specific cache coherence solution adopted by
different abstract architecture models, by studying the impact on both single-CMP
and multiple-CMP systems.
We finally used benchmark results for the evaluation of read and synchronous write
latencies, in order to have a first validation of our cost model.
Of course, as discussed in the Introduction part of the thesis, this base latencies
constitute the first step in the evaluation of under-load memory latencies which
depends on the execution of the parallel program on the target architecture. This
evaluation is studied in the following two chapters.
74 4. MODELLING CACHE COHERENT ARCHITECTURES
CHAPTER 5
Parallel Paradigms and Cache Coherence
Automatic cache coherence solutions allow programmers to develop programs with-
out taking into account the coherency problem. In fact no explicit coherency oper-
ations need to be inserted in the program by the application programmer. We are
interested in evaluating the cost of these solutions and understand how to improve
parallel programs’ perfomances with different approaches to the cache coherence
problem. We know, from the previous chapter, that for each abstract model we
have specific read/write latencies that are strictly correlated to the coherency state
of data. In this chapter, we use these base latencies as a first approximation of the
under-load memory latencies, which will be derived according to the methodology
presented in the next chapter. Even if they represent only the base latencies, in
this chapter we reason about how performances of parallelism forms defined by a
structured parallel programming environment, are affected by them. Notably, in the
first part of this Chapter (Section 5.1) we analyze different implementations of farm
and data-parallel paradigms in order to evaluate the cost of the various operations
executed on the run-time support data structures.
In Section 5.2, in the same way, we use the latencies derived from the previous
Chapter, to evaluate the cost of the cooperation and synchronization mechanisms
used in parallel paradigms implementations.
Finally, in Section 5.3 we analyze the impact of cache coherence (always in terms
of the memory and cache base access latencies) on parallel programs’ performances
which use lock-free data structures for communications between modules.
76 5. PARALLEL PARADIGMS AND CACHE COHERENCE
5.1 Recognizing Cache Coherence Patterns in Par-
allel Paradigms
We know, from the previous chapter, that for each abstract model of cache coherent
architectures we have specific read/write latencies that are strictly correlated to the
coherency state of data. Therefore, we can study well-known parallel pattern im-
plementations and identify specific cache coherence patterns. In fact, knowing the
structure of the parallel application and the mechanisms used in their implementa-
tions (e.g., shared memory or message passing model), we are able to
• identify data structures that must be kept coherent
• and use the latencies derived from the abstract models and apply them to the
various parallel pattern implementations.
We consider now farm and data-parallel computations. For the time being, we
focus on the computation data postponing the study on the synchronization data.
In particular, for each parallel pattern we compare three different implementation
models, depending on the way in which interactions between modules are expressed.
The first one is a shared-variable environment in which processes cooperate by
sharing data structures. Shared objects are used, at the same time, for computation
data and for synchronization mechanisms (e.g., locks and condition variables). The
second one also uses shared memory mechanisms, but in this case, sychronizations
are implemented by passing pointers from producers’ and consumers’ modules.
The messages exchanged by the various entities are actually references to the data,
thus is not a local environment model like in the message-passing approach. In
this last case, we map modules into processes and interactions between modules in
communication channels used to send and receive the data. Therefore, shared data
belong only to the run-time support data structures.
In the following, we used the term PrCp as a synonym for the private level cache of
the PE where the process module p is allocated.
5.1.1 Farm
Farm computations can be implemented according to several strategies, that can be
mapped to each of the implementation models discussed above. In Figure 5.1 we
show the most used strategies:
Master-Slave (M-S) The stage 1 and 3 of the farm are both mapped on the
same master module is responsible for: (1) collecting input elements; (2) selecting
a worker for scheduling; (3) delivering the input element to the selected worker; (4)
selecting a worker to collect a result; (5) collecting a result from the selected worker;
(6) delivering the collected result to the output stream.
5.1. RECOGNIZING CC PATTERNS IN PARALLEL PARADIGMS 77
M
W1
Wn
. . .
= F(   )
= F(   )
(a) Master-Slave
E
W1
Wi
Wn
C
. . .
. . .
= F(   )
= F(   )
= F(   )
(b) Emitter-Worker-Collector
Figure 5.1: Farm implementation strategies
Emitter-Worker-Collector (E-W-C) Each stage of the farm is mapped to a
different module: an emitter (E) on which we map the stage 1; a set of workers (W)
on which we map the stage 2; and a collector (C) on which we map the last stage 3.
Consider now the following farm computation, in which the function F is applied
to the i-th element of the input stream (in task) and produces the i-th element of
the output stream (out res):
1 whi le ( t rue ) {
2 o u t r e s = F( i n t a s k , . . . ) ;
3 i n t a s k = getNextTask ( . . . ) ;
4 }
To be as generic as possible, consider that in task and out res could be of small
dimensions, e.g. a small set of integer numbers, or bigger, like matrices of real
numbers. In addition, the function F may need other data to compute the out res.
Therefore, starting from this scenario we can study its possible implementations.
Shared-variable implementation of Farm computations A shared-variable
environment is often linked to a M-S strategy in which streams are implemented as
shared queues, on which the Master (M) and Slaves {S0, ..., SN−1} execute get and
put operations atomically.
For each in task, M performs the put operation and, more specific, the following
actions:
1. it refers the next available element of the task queue tqi used to distribute the
input element elements to the workers
2. it modifies tqi adding the address of the in task
78 5. PARALLEL PARADIGMS AND CACHE COHERENCE
These actions actually correspond to specific cache coherency state and transitions:
1. the cache line that includes tqi is held in PrCM
2. the coherency state of that cache line changes (if necessary), for example, to
Modified ; hence, PrCM holds the most recent copy of the data
Observation 1. After performing the action 2, M no longer needs to use the data
just modified. It will modify again tqi only after performing other tqsize − 1 put
operations.
When Si performs the get operation on the task queue, it actually reads the modified
tqi and then applies F to in task.
Depending on the dimension of tqi, which in this case is small because it is the
address of in task, that read operation could refer to a cache line which is still in
PrCM . As we saw in Chapter 4, we know that this operation leads to the greater
latency cost for the Si.
Cost Model 5.1.1 (Shared-variable Farm). In the worst case, the cost of a get
operation executed by a Si process for accessing to the next in task is
Tget = Tread(I,−,M) + Tsynch
where Tsynch is the cost related to the synchronization mechanisms used in the get
operation.
Moreover, some read operations relative to in task, during the F computation,
may causes the same cost. In fact, M could produce itself the input stream or
receive, in turn, the data from another module and eventually modify them. This
can lead to having part of or the entire in task data structure still in PrCM .
Even in the case in which M use in task in a read-only way, PrCM may have to
be involved in the read operations (Tread(I,−, E) required by F. Therefore, the
following observation is valid.
Observation 2. Read operations related to in task may involve the use of coherency
protocols even if these are not actually necessary.
If M is also responsible for collecting the results from Si, then all the considera-
tions made on the put and get operations can be dually applied to the out res and
the corresponding results shared queue. Actually, here we know that the output is
produced by each Si and when M uses the results, the corresponding read operations
cause the worst case cost, reading the modified data from PrCi (Tread(I,−,M) ).
5.1. RECOGNIZING CC PATTERNS IN PARALLEL PARADIGMS 79
Pointer-passing implementation of Farm computations In a pointer-passing
solution an explicit Emitter process includes the input stream queue and transmits
the pointer (reference) of a queue element to a Worker through a short message,
so that the Worker can address the input data structure without copying it and
whithout additional syncrhonizations. The dual solution is applied to the Worker-
Collector cooperation.
For each in task, E puts the message (msg) that contains the input data reference,
while the selected Wi obtains it with the get operation. In the same way, for each
out res Wi put the message that contains the output data reference and C obtains
it with the get operations. Therefore, all the considerations made above on the tqi
can be applied to msg (and dually to the output data and results queue) with the
same consequences for the coherency actions and costs.
Message passing implementation of Farm computations In message pass-
ing solutions, modules (E/M, Si/Wi and C) are mapped in processes which work
in a local environment, and their interactions are implemented as send and receive
operations on communication channels. Thus, in task and out res are passed as
messages in communication channels.
Consider an E-W-C strategy and an optimized implementation of message-passing
primitives (e.g., a “zero-copy” communication which is able to reduce the number of
message copies to one [107]). For each in task, E sends the message, which in this
case is exactly the in task data structure, to Wi through the task ch communication
channel. With the send primitive, E performs the following actions:
1. it uses the next available target variable vtgi, that is the next available position
in the buffer of the corresponding communication channel (i.e., the channel
used by E to distribute tasks to Wi)
2. it copies in task into vtgi
Therefore, as in the previous cases for tqi, these actions actually correspond to
specific cache coherency state and transitions:
1. the cache line(s) that includes vtgi is(are) held in PrCE
2. the coherency state of that(those) cache line(s) is, for example, Modified ;
hence, PrCE holds the most recent copy of vtgi
and the Observation 1 is still valid and it can be applied to vtgi.
When Wi uses the in task obtained with the receive primitive, it actually read the
modified vtgi and the following observation is valid.
80 5. PARALLEL PARADIGMS AND CACHE COHERENCE
Observation 3. In message passing solutions, Wi works only on vtgi, applying F
directly to it. Hence, coherency operations have to be applied only on vtgi which is
the only data shared between E and Wi.
Therefore, the read latency is paid during the F computation and, even in this
case, the read operation could refer to cache line(s) which is(are) still in PrCE.
Cost Model 5.1.2 (Message-passing Farm). In the worst case, the cost for Wi
process for accessing to the next in task is
Tvtg = Tread(I,−,M) ∗Nlines
where Nlines is the number of cache lines used for vtg.
(Note that in this case the synchronization cost (Tsynch) is paid during the receive
operation.)
All the considerations made on the E-Wi interaction can be dually applied to
the out res and the corresponding Wi-C communication.
5.1.2 Data-Parallel: Map
The previous analysis can be extedend to data-parallel computations on streams
and/or on single data values. As in the farm computations data-parallel can be
implemented according to several strategies. We consider an IN-W-OUT implemen-
tation (IN and OUT are simply more generic names for the E and C modules), which
is able to describe most of the functionalities required in this type of computation.
In data-parallel computations, IN provides the distribution of each input element
among the set of workers according to proper collective communications (e.g. scat-
ter or multicast). While, OUT is responsible for the collection of the worker results,
with a gather or reduce operations.
In map computations, workers are fully independent, that is each of them applies a
sequential elaboration on its own local data only.
Consider now the two following map computations, in which the function F is
applied to all the elements of the input A and produces, respectively, a new (large)
data structure output B or a more synthetic and smaller result x.
The distinction between the two types of computations is useful only to distin-
guish the size of the data exchanged between Wi and OUT.
In both cases the considerations made for farm computations remain valid. In fact,
from the point of view of the cache coherence we have the same “producer-consumer”
behaviour between IN-Wi and Wi-OUT.
5.1. RECOGNIZING CC PATTERNS IN PARALLEL PARADIGMS 81
1 i n t A[N] , B[N] , x ;
2 f o r ( i =0. .N−1) {
3 B[ i ] = F(A[ i ] ) ;
4 }
5
6 f o r ( i =0. .N−1) {
7 x = F(A[ i ] ) ;
8 }
Shared-variable and pointer-passing implementations of Map computa-
tions When IN performs a scatter distribution, each Wi reads the (modified) in-
put task (i.e., the initial address of the appropriate partition of the data structure),
when it performs a get on the shared queue.
While, when IN performs a multicast distribution, it merely puts the same address
to all the shared queues.
Regarding the coherency operations related to the Wi-OUT interaction, as above,
the smaller the output the greater the possibility to have both the address and the
data itself still in PrCWi . Therefore, the cost model 5.1.1 is also valid for map
computations.
Message-passing implementation of Map Computations When IN performs
a scatter distribution, each Wi reads the (modified) target variable which contains
its partition of the input task, during the F computation.
While, when IN performs a multicast distribution, it sends the same message to
each worker.
As for Farm computations, the same considerations on the target variable can be
done also for the Wi-C interaction, and the cost model 5.1.2 is also valid for map
computations.
5.1.3 Data-Parallel with Stencil
In order to apply its function, a worker may need to access data contained in other
worker partitions, according to the particular data dependencies imposed by the
computation semantics. In this case we speak about stencil-based computations,
where a stencil is a data dependence pattern implemented by information exchanges
between different workers.
We consider the case in which the function application is iterated for a given
(possibly unknown) number of steps. A worker can communicate with other workers
to obtain their previous results for its local computation. For example, we can
functionally describe the k-th application of a given function F on the i-th position
82 5. PARALLEL PARADIGMS AND CACHE COHERENCE
of an array A in the following way:
Ak[i][j] = F (Ak−1[i][j], Ak−1[i− 1][j], Ak−1[i][j − 1], Ak−1[i+ 1][j], Ak−1[i][j + 1])
In this example, to compute the k-th value of A[i][j] we need the values of the
previous iteration (k-1) of the same element and its “neighbors”. If these latter
values are assigned to another worker
• Wi and Wi+1 share their border values in a shared-variable implementation;
• a communication is required between Wi and Wi+1 in message-passing and
pointer-passing implementations.
Shared-variable implementation of Stencil-based computations As rep-
resented in pseudo-code 5.2, at each step of the computation, Wi uses two data-
structures: one (Ain) containing the results computed at previous step and the other
(Aout) used to store the results of the current step. In order to maintain the data
consistent across the various steps, a synchronization among workers is required. It
may be implemented by locking mechanisms or by global barriers.
1 int Ain[N ][N ] , Aout[N ][N ] ;
2 . . .
3 whi le ( ! done ) {
4 for ( i=myMin . . myMax) {
5 for ( j =0. .N−1) {
6 Aout[i][j] = F (Ain[i][j], Ain[i− 1][j], Ain[i][j − 1], Ain[i+ 1][j], Ain[i][j + 1]) ;
7 }
8 }
9 done = computeEndCondition ( . . . ) ;
10 i f ( ! done )
11 swap (Ain, Aout ) ;
12 BARRIER( ) ;
13 }
14 . . .
Figure 5.2: Pseudo-code for stencil computation in shared-variable implementation
Consider now a stencil computation from the point of view of cache coherence,
remaining as generic as possible with respect to the kind of stencil computation.
At step k, Wi uses Ain elements to computes Aout elements. However, Ain elements
have been actually modified at the step k-1, possibly by another worker. This
means that Wi acts as a consumer that reads the required data by accessing the
producer WorkerWj partition. Therefore, at each stepWi performs a certain number
(depends on the stencil structure) of read operations related to modified data (and
5.1. RECOGNIZING CC PATTERNS IN PARALLEL PARADIGMS 83
invalidated during the previous step by Wj). Depending on the stencil and the
computation organization (e.g., priority to local value calculation), that data may
still be kept in PrCj.
Cost Model 5.1.3 (Shared-variable Stencil). In the worst case, the cost of read
operations for the stencil-data by a Wi process is
Tstencil = Tread(I,−,M) ∗Nstencil−lines
where Nstencil−lines is the number of cache lines used for the stencil-data.
Message-passing and pointer-passing implementations of Stencil-based
computations Stencil-based computations are typically structured in two phases
in the message-passing approach. As shown in pseudo-code 5.3, each worker first
performs the stencil required communications and, after that, it applies the function
to its partition. Here, communications also act as syncrhonizations across the various
computation steps.
1 int Ain[N ][N ] , Aout[N ][N ] ;
2 . . .
3 whi le ( ! done ) {
4
5 i f (myId != 0) then send (Ain[1][∗] ,N, myId−1) ;
6 i f (myId != nw−1) then send (Ain[g][∗] ,N, myId−1) ;
7 i f (myId != 0) then receive (Ain[0][∗] ,N, myId−1) ;
8 i f (myId != nw−1) then receive (Ain[g + 1][∗] ,N, myId−1) ;
9
10 for ( i =1. . g−1) {
11 for ( j =0. .N−1) {
12 Aout[i][j] = F (Ain[i][j], Ain[i− 1][j], Ain[i][j − 1], Ain[i+ 1][j], Ain[i][j + 1]) ;
13 }
14 }
15 done = computeEndCondition ( . . . ) ;
16 i f ( ! done ) then
17 swap (Ain, Aout ) ;
18 }
19 . . .
Figure 5.3: Pseudo-code for stencil computation in message-passing implementation
From the point of view of cache coherence, message-passing still delimits co-
herency operations for stencil-data to the relative target variables. Therefore, we
can evaluate the cost of the stencil communications as follows.
84 5. PARALLEL PARADIGMS AND CACHE COHERENCE
Cost Model 5.1.4 (Message-Passing Stencil). In the worst case, the cost of read
operations for the stencil-data by a Wi process is
Tstencil = Tvtg ∗Nstencil−vtg
where Nstencil−vtg is the number of receive communications required by the stencil.
If communication among workers is implemented by pointer-passing, than the
solution is similar to the classical message-passing. Anyway, as we saw for E-Wi
and Wi-C interactions, cache coherency operarations may be required both for
stencil-data references and stencil-data itself (with a possible additional cost of
Tread(I,−,M/E)).
5.2. SYNCHRONIZATIONS IN SHARED MEMORY SYSTEMS 85
5.2 Synchronization Issues in Shared Memory Sys-
tems
As we saw in the previous section, parallel paradigm implementations need proper
mechanisms for processor/thread cooperation and synchronization. Synchronization
operations typically rely on some atomic read-modify-write hardware primitives, in
which the value of memory location is read, modified and written back atomically
without intervening access to the location by others.
The focus of this section is on how synchronization operations are implemented in
cache coherent systems, paying attention to memory ordering issues. In particular,
we describe the implementation of mutual exclusion through lock-unlock pairs and
global event synchronization through barriers. We also consider the consequences of
the use of lock-free approaches, based on the classic work of Lamport [69]. These so-
lutions provide concurrent algorithm on specific data structures which solve mutual
exclusion without locking mechanisms.
5.2.1 Mutual Exclusion
Two threads in the same address space, or two processes in shared memory archi-
tectures, accessing common resources must synchronize their behavior in order to
avoid wrong or unpredictable behavior. The period of exclusive access is referred
to as a critical section, which is enforced through mutual exclusion implying that
only one entity (thread or process) at a time is able to execute this critical section
of code.
Mutual exclusion is ensured by enclosing the indivisible sequence of actions between
two synchronization operations, called lock and unlock, which are implemented using
a wide range of algorithms.
CAS-based locking
A typical solution to the lock problem is to have a single shared variable v and
acquiring the lock is done by using an atomic read-modify-write instruction such
as test and set or compare and swap. Using test and set on v, the value in the
memory location M which stores v is read into a specified register of the PE that
performs the instruction, and the constant 1 is stored into the location M atomically
if the value read is 0. As shown in the pseudo-assembly code shown in Listing 5.1,
the lock implementation keeps trying to acquire the lock using test&set instructions,
until it returns zero indicating that the lock was free when tested.
The unlock construct simply sets the location associated with lock to 0, indicating
that the lock is now free and enabling a subsequent lock operation by any process
to succeed. More sophisticated variants of such atomic instructions exist, e.g. swap-
based or fetch&op instructions.
From the point of view of cache coherence, every attempt to check whether the lock
86 5. PARALLEL PARADIGMS AND CACHE COHERENCE
1 lock: t&s register , location //copy location to reg. and
2 // if 0 set location to 1
3 bnz register , lock //compare old value with 0
4 //if not 0, try again
5 ret
8 unlock: st location , #0 //write 0 to location
9 ret
Listing 5.1: Lock and unlock implementation with the test-and-set instruction
is free to be acquired, whether successful or not, generates a write operation (to write
the value 1) to the lock variable cache line. Since this line is currently in another
cache, e.g. PrCj because Pj wrote it doing the test and set, an invalidation request
is sent through the interconnection network by each write to invalidate the previous
owner of the block. Moreover, when Pi, which is executing the critical section,
performs the unlock operation, it refers to a modified location in another private
cache, e.g. PrCj. Therefore, the write operation causes the worst case cost.
We can estimate the cost of lock and unlock operations in terms of the read and
write operation latencies from Chapter 4.
TmaxLOCK = Twrite(I,−,M)
TmaxUNLOCK = Twrite(I,−,M)
Both lock and unlock operations can take advantages of the reuse of the same
location respectively doing consecutive lock/unlock operations and when no other
processes have executed a lock operation during the critical section execution. We
represent with pFREE the probability of finding the lock free during the lock opera-
tion. Therefore, unlock operation and consecutive lock operations have the following
cost.
TminLOCK = pFREE ∗ Twrite(M,−,−) + (1− pFREE) ∗ Twrite(I,−,M)
TminUNLOCK = Twrite(M,−,−)
Two simple variants of this algorithm are typically used for two main reason: (1)
to reduce the frequency with which test and set instructions are issued, (2) to
reduce the invalidations and misses during the busy-wait.
In the first case, the basic idea is to insert a delay after an unsuccessful attempt
to acquire the lock. This solution is called lock with backoff. In the second case, a
test-and-test&set lock solution is adopted. Busy-wait is done by repeatedly reading
with a standard load, not a test and set, the value of the lock variable until it
5.2. SYNCHRONIZATIONS IN SHARED MEMORY SYSTEMS 87
turns from 1 (locked) to 0 (unlocked). On a cache coherent system, the reads can be
performed in the private cache by all processors, since each obtains a cached copy
of the lock variable the first time it reads it. When the lock is released, the cached
copies of all waiting processes (e.g., Pj and Pk) are invalidated, and the next read of
the variable will generate a miss. When Pj then finds that the lock has been made
available, it will only then generate a test and set instruction to actually try to
acquire the lock.
With the test-and-test&set solution, the cost for lock and unlock operations, con-
sidering the reuse of the lock variable location, become as follows.
TmaxLOCK = Tread(I,−,M) + pFREE ∗ Twrite(S,−,−)
TminLOCK = pFREE ∗ (Tread(M,−,−) + Twrite(M,−,−)) + (1− pFREE) ∗ Tread(I,−,M)
TmaxUNLOCK = Twrite(I,−,M)
TminUNLOCK = Twrite(M,−,−)
Improved instruction sets
Several instruction sets provide a pair of instructions called load-locked or load-
linked (LL) and store-conditional (SC) to implement atomic operations, instead of
an atomic read-modify-write instructions like test&set. These instructions allow to
avoid the invalidations generated by failed attempts to complete the read-modify-
write phase. The LL instruction loads the synchronization variable into a register. It
may be followed by arbitrary instructions that manipulate the value in the register.
The last instruction of the sequence is the SC instruction, which writes the register
back to the synchronization variable location if and only if no other processor has
written to that location since this processor completed its LL. Therefore, if the SC
succeeds, it means that the LL-SC pair has read, perhaps modified, and written
back the variable atomically. If the SC detects that a write has occurred to the
variable, it fails and does not write the value back or generate any invalidations.
This means that the atomic operation on the variable has failed and must be retried
starting from the LL. Using LL-SC to implement atomic operations, lock and unlock
primitives can be written as shown in Listing 5.2.
PowerPC architectures use this solution to support atomic operations [88]. In
particular, this mechanism is implemented by the cache coherency protocol and uses
reservation information attached to each cache line. When issuing a lwarx (LL in-
struction in PowerPC’s instruction set), the reservation for the current thread is
simultaneously set with the load operation. In order to have the right to perform a
store operation to a cacheline the issuing cache must contain the data exclusively,
that is no other cache contains the data. If the cache line is not exclusively held,
a request (e.g., rfo req) is sent. Finally atomic operations are realized by looping
88 5. PARALLEL PARADIGMS AND CACHE COHERENCE
1 lock: ll reg1 , location //load -linked the location
2 //to reg1
3 bnz reg1 , lock //if location was locked
4 //(not 0), try again
5 sc location , reg2 //store reg2 conditionally
6 //into location
7 beqz lock //if SC failed , start again
8 ret
11 unlock: st location , #0 //write 0 to location
12 ret
Listing 5.2: Lock and unlock implementation with LL/SC instructions
until the stwx (SC instruction in PowerPC’s instruction set) succeeds, i.e. invalida-
tions are not required.
Relying on this implementation, we can evaluate the cost of lock and unlock opera-
tion based on LL/SC instructions as follows.
TmaxLOCK = Twrite(I,−,M) + pFREE ∗ Twrite(M,−,−)
TminLOCK = pFREE ∗ 2Twrite(M,−,−) + (1− pFREE) ∗ Twrite(I,−,M)
TmaxUNLOCK = Twrite(I,−, E)
TminUNLOCK = Twrite(M,−,−)
Atomic operations on x86 processors are implemented using a lock prefix: the
lock instruction can be prefixed to a number of operations and has the effect to lock
the system bus (sometimes only the local cache in recent architectures) to ensure
exclusive access to the shared resource. In recent versions of these architectures, the
MONITOR and MWAIT instructions have been introduced [60]. The MONITOR instruc-
tion supervises a certain memory location for the occurrence of special events, and
the MWAIT waits for that event or for a generic interrupt. One possible use is the
monitoring of store events: a lock operation can be implemented by using these two
instructions. The drawback of the MONITOR/MWAIT instructions is that they must
be used in kernel space, with the consequent use of costly system calls to perform
these operations.
Reducing invalidations
Some other advanced lock algorithms have been proposed both for providing fairness
and to reduce traffic caused by invalidations [37, 76]. In fact, LL-SC is not fair and
5.2. SYNCHRONIZATIONS IN SHARED MEMORY SYSTEMS 89
when Pi succeeds in its SC instruction, all other cached copies are invalidated and
other processors all incur read operations of modified data in PrCi.
The ticked-lock algorithm tries to minimize traffic due to the SC instruction. In
this solution, a process wanting to acquire the lock takes a number and atomically
increments it at the same time, by an atomic fetch and add instruction. Then it
busy-waits on a global now-serving number, until this number equals the number
it obtained. The correspondig unlock operation simply increments the now-serving
number. Ticked-lock implementations still have the problem that all processes spin
on the same now-serving variable. When that variable is released by Pi, as above
for SC instruction, all other cached copies are invalidated and other processors all
incur read operations of modified data in PrCi.
An alternative is provided by the Array-based Lock algorithms, which eliminates
this extra read traffic by having every process spin on a dinstinct location. In this
case, the traffic is reduced because only the processor Pj that was spinning on a
specific location has its cache line invalidated by the unlock executed by Pi. Pj still
incurs a read operation of modified data in PrCi.
5.2.2 Global (barrier) event synchronizations
Global synchronizations are typically required in the run-time support for parallel
applications especially in some initialization phases. In some cases, a global synchro-
nization between the entities of a parallel program has to be performed in order to
ensure the correctness of the application. For example, in a shared-variable imple-
mentation of a data-parallel with stencil application, a global coordination among
the workers is required across the various step of the computation. For this reason,
an evaluation of the cost of this kind of operation become necessary.
Algorithms for barrier synchronizations are typically implemented using shared vari-
ables (e.g., counters and flags) and manipulating their values in mutual exclusion.
A simple barrier among p processes or threads is the so-called centralized barrier, in
which a single counter and a single flag is used. The shared counter maintains the
number of processes that have arrived at the barrier, and is therefore incremented
by every arriving process. These increments must be mutually exclusive. After in-
crementing the counter, the process checks to see if the counter equals p, i.e. if it
is the last process to have arrived. If not, it busy waits on the flag associated with
the barrier, otherwise it writes the flag to release the waiting processes.
The pseudo-code in Listing 5.3 shows how this algorithm can be implemented. In
particular, it presents the sense reversal version of the algorithm [37], in which a
private variable (i.e., local sense) is used to prevent a process from entering a new
instance of a barrier before all processes have exited the previous instance of the
same barrier.
The lock/unlock protecting the increment of the counter can be replaced more
efficiently by a simple LL-SC or atomic increment operation. Relying on this im-
90 5. PARALLEL PARADIGMS AND CACHE COHERENCE
1 struct b a r r i e r t y p e {
2 int counter ;
3 struct l o ck type lock ;
4 int g l o b a l s e n s e = 0 ;
5 } bar ;
7 BARRIER ( bar , p) {
8 l o c a l s e n s e = ! l o c a l s e n s e ;
9 lock ( bar . lock ) ;
10 mycount = bar . counter++; /∗ mycount i s a p r i v a t e v a r i a b l e ∗/
11 unlock ( bar name . lock ) ;
12 i f ( mycount == p) { /∗ l a s t to a r r i v e ∗/
13 bar . counter = 0 ; /∗ r e s e t counter f o r next b a r r i e r ∗/
14 bar . g l o b a l s e n s e = l o c a l s e n s e ; /∗ r e l e a s e w a i t i n g p r o c e s s e s
∗/
15 } else {
16 while ( bar . g l o b a l s e n s e == l o c a l s e n s e ) {} ; /∗ busy wai t f o r
r e l e a s e ∗/
17 }
18 }
Listing 5.3: Global barrier algorithm with mutual exclusion
plemententation, we can evaluate the latency of a barrier operation executed by p
processes or threads as follows.
TBARRIER(p) = (p− 1)Twrite(I,−,M) + Twrite(S,−,−)[p− 1]
The first p − 1 executions of the barrier cause several transfers of the cacheline
containing the barrier type data structure due to the atomic increment of the
counter. Finally, the last execution causes the invalidations of the other p−1 copies
of the cache line holding the global sense flag.
Several alternative implementations have been proposed for the barrier algo-
rithm, e.g. tree-based barrier [76]. In particular, in order to minimize the contention
on the centralized flag, these solutions use different flags and organize the various
processes in subgroups. The consequences in terms of the latency of the barrier
operations consist in a decreasing number of invalidations on the same flag during
the release phase.
5.2.3 Memory Ordering and Memory Consistency Models
With memory ordering we mean the order in which memory operations (read and
write) are performed. Memory ordering might be changed with respect to the or-
5.2. SYNCHRONIZATIONS IN SHARED MEMORY SYSTEMS 91
der specified in the program (program order) for some reason. The compiler might
change memory ordering as a result of static optimizations, or out-of-order processor
might change memory ordering as a result of dynamic optimizations. In this case,
mechanisms such as compiler barrier 1 can be used to prevent the reorder of read
and write operations in order to ensure correctness of algorithms.
Moreover, because of the nondeterminism behavior of shared memory systems, prob-
lems arise with the order in which memory operations, which are executed by a PE
on shared locations, become visible to the other PEs.
Notably, in this case we need to consider the Memory Consistency Model adopted
by the target architecture [85, 99]. The interested reader can refer to [75] for a
detailed characterization of the various models.
For our purposes we distinguish between two main classes of main memory consis-
tency strategies:
• Total Store Ordering (TSO) in which
– Loads are ordered with respect to earlier Loads
– Stores are ordered with respect to earlier Load and Stores
Thus, Load can bypass earlier Stores but cannot bypass earlier Loads. Stores
cannot bypass earlier Loads or Stores, ensuring that store operations are visible
to the system in program order. Examples or this model include x86 TSO[85]
and SPARC TSO.
• Weak Store Ordering (WSO) which does not guarantee such orderings (i.e., two
distinct writes at different memory locations may be executed not in program
order). Examples or this model include Tilera [19], Power [99] and ARM [31]
processors.
Locking mechanisms and memory ordering
To complete the evaluation of mutual exclusion mechanisms consider the execution
of lock/unlock operations in systems with a WSO memory consistency model.
Notably, write operations executed by Pi inside the critical sections could be visible
to a generic Pj after the unlock’s write operation. Thus, Pj might use non-consistent
value of the critical section data.
For this reason, to ensure the correct memory ordering, a memory fence or write
memory barrier instruction has to be inserted in the unlock operation before the
store instruction.
Following a structured parallelism approach, where a clear level-structuring is
defined, all memory ordering problems are confined inside the run-time support of
process primitives (e.g., in the send-receive run-time support). In this way, no other
1modifications of the ordering of memory operation done at compile-time on sequential codes
92 5. PARALLEL PARADIGMS AND CACHE COHERENCE
1 unlock: mf //ensure a TSO
2 st location , #0 //write 0 to location
3 ret
Listing 5.4: Use of a memory fence instruction inside the unlock operation to ensure
the correct memory ordering (Total Store Ordering)
instances of this problem can affect the parallel program design at the process level.
From the cost model point of view, also atomic read-modify-write instructions,
implicitly incorporate (behave as) a memory fence both on systems with TSO and
WSO. An interesting evaluation of the cost of cas operation on TSO processor is
discussed in [7]. Moreover, an explicit memory fence instruction is also required in
WSO systems after each atomic read-modify-write implemented through a LL/SC
pair.
Therefore, this means that we need to consider synchronous writing semantics in the
locking latency evaluation. That is, none of the latencies related to the operations
required by the coherency protocol (e.g., invalidations latencies) can be overlapped
with others and considered negligible. Therefore, the result is a latency that grows
with the number of invalidations/updates necessary, which can be considerable es-
pecially in the case of global event synchronizations.
5.3 Lock-free Data Structure for Communication
Mechanisms
In the run-time support of stream-based parallel computations, it is essential to
adopt a communication mechanism, used between the various modules, with the
smallest overhead possible. Typically, shared queues or similar data-structures are
the basic buiding block both in the shared-variable, message-passing and pointer-
passing implementaion model. For these reasons, concurrent lock-free queues have
been widely studied in the literature [68, 81, 48, 15, 79, 104].
A shared data structure is lock-free if its operations do not require mutual exclusion
over multiple instructions. If the operations on the data structure guarantee that
some process will complete its operation in a finite amount of time, even if other pro-
cesses halt, the data structure is non-blocking. If the data structure operations can
guarantee that every process will complete its operation in a finite amount of time,
then the data structure is wait-free. Therefore, wait-free protocols are a subclass of
lock-free protocols characterized by stronger properties: roughly speaking lock-free
algorithms are based on retries, while wait-free algorithms guarantee termination in
a finite number of steps.
5.3. LOCK-FREE DATA STRUCTURE FOR COMM. MECHANISMS 93
1 enqueue ( data ) {
2 lock ( queue ) ;
3 i f (NEXT( t a i l ) == head ) { /∗ the queue i s f u l l ∗/
4 unlock ( queue ) ;
5 return fa l se ;
6 }
7 b u f f e r [ t a i l ] = data ; /∗ copy the data i n t o the queue ∗/
8 t a i l = NEXT( t a i l ) ; /∗ change the i n s e r t i o n index ∗/
9 unlock ( queue ) ;
10 return true ;
11 }
13 dequeue ( data ) {
14 lock ( queue ) ;
15 i f ( t a i l == head ) { /∗ the queue i s empty ∗/
16 unlock ( queue ) ;
17 return fa l se ;
18 }
19 data = b u f f e r [ head ] ; /∗ e x t r a c t the next data ∗/
20 head = NEXT( head ) ; /∗ change the e x t r a c t i o n index ∗/
21 unlock ( queue ) ;
22 return true ;
23 }
Listing 5.5: Locking queue implementation
Consider a simple lock-based queue and the corresponding enqueue and dequeue
operations, as shown in the Listing 5.5.
Locking queues can have an algorithmic source of overhead. In fact, lock and unlock
operations strongly couple the producer and consumer. Even when the consumer is
reading an earlier enqueued element, the producer cannot enqueue an element into
a different buffer position.
Lamport proved that, under sequential consistency [69], the locks could be re-
moved in the single-producer/single-consumer case, resulting in a concurrent wait-
free queue. Thus, Lamport’s queue requires no explicit synchronization between the
producer and consumer decoupling the two at the algorithmic level.
However, there still exists an implicit synchronization between the producer and
consumer as the control data (i.e., head and tail) are still shared. In fact, head and
tail are used to implicitly indicate full and empty queue conditions.
From the point of view of cache coherence, this means that the enqueue and dequeue
operations read the constantly modified head and tail indexes (in addition to the
buffer data).
Relying on the Lamport queue implementation the latency of enqueue and dequeue
94 5. PARALLEL PARADIGMS AND CACHE COHERENCE
operation can be evaluated as follows.
Tmaxenqueue = T
max
dequeue = Tsync + Tdata
= Tread(I,GC,−) + Tread(I,−,M) + Twrite(M,−,−) + Tdata
In particular, Tsynch represents the latency related to the control data operations,
which in a lock-based implementation also includes the locking operations latency.
While, Tdata is strictly related to the type of data exchanged, as we saw in section
5.1.
A first optimization consists in having both head and tail indexes in the same cache
line, resulting in the following evaluation.
Tmaxenqueue = T
max
dequeue = Tsync + Tdata
= Tread(I,−,M) + Tread(S,−,−) + Twrite(M,−,−) + Tdata
Note that we evaluate the worst case latencies. Both enqueue and dequeue
operations can take advantage of the reuse of the control data respectively doing
consecutive enqueue and dequeue operations. In this best case scenario, the latency
related to the control data operations can be evaluated as follows (the cache line
state in the second term depends on the number of cache lines used for head and
tail indexes as discussed above).
Tsynch = Tread(M,−,−) + Tread(S/M,−,−) + Twrite(M,−,−)
If the sequential consistency requirement is released, Lamport’s algorithm fails.
For example, when write to write relaxation is allowed (i.e., two distinct writes at
different memory locations may be executed not in program order), the consumer
may incur a read of stale data. In fact, the update of the tail index, modified by
the producer, can be seen by the consumer before the producer writes in the tail
position of the buffer.
A few simple modifications to Lamport’s algorithm have been proposed for
pointer-passing implementation models to allow correct execution even under weakly
ordered memory consistency models [48, 15]. This solution is shown in Listing 5.6.
In particular, an empty position in the buffer is represented using a know value,
called bottom (⊥), that cannot be used as buffer element. In this way, the consis-
tency problem of Lamport’s algorithm cannot occur provided that the generic store
buffer[tail]=data is seen in its entirety by a processor, or not at all, i.e., a single
memory store operation is executed atomically.
From the point of view of cache coherence, using the ⊥ value also solves the sharing
problem between producer and consumer about the head and tail indexes. In fact,
the head and tail indexes are always in the local cache of the consumer and the
5.3. LOCK-FREE DATA STRUCTURE FOR COMM. MECHANISMS 95
1 enqueue ( data ) {
2 i f ( b u f f e r [ t a i l ] != ⊥) { /∗ the queue i s f u l l ∗/
3 return fa l se ;
4 }
5 b u f f e r [ t a i l ] = data ; /∗ copy the data i n t o the queue ∗/
6 t a i l = NEXT( t a i l ) ; /∗ change the i n s e r t i o n index ∗/
7 return true ;
8 }
10 dequeue ( data ) {
11 data = b u f f e r [ head ] ; /∗ e x t r a c t the next data ∗/
12 i f ( data == ⊥) { /∗ the queue i s empty ∗/
13 return fa l se ;
14 }
15 b u f f e r [ head ] = ⊥
16 head = NEXT( head ) ; /∗ change the e x t r a c t i o n index ∗/
17 return true ;
18 }
Listing 5.6: ⊥-based lock-free queue implementation
producer respectively, without incurring read operations of modified control data.
Therefore, the latency of enqueue and dequeue operations can be evaluated as fol-
lows.
Tmaxenqueue = T
max
dequeue = Tsync + Tdata
= Tread(I,GC,−) + Twrite(E,−,−) + Tdata
In the best case scenario, the latency related to the control data operations can
be evaluated as follows.
Tsynch = Tread(M,−,−) + Twrite(M,−,−)
These evaluations rely on implementation techniques used to avoid false-sharing.
In fact, this problem can arise because the cache coherence protocol works at cache
line granularity. For example, there is actually an implicit sharing of the control
data queue if tail and head reside in different cache lines. In order to avoid this
situation, a proper amount of padding is required to force the two indexes to reside
in different cache lines.
As we said, in these solutions the communication buffer is used to transfer ref-
erences of data. In WSO systems, the enqueue algorithm may need to be slightly
modified to introduce a memory fence instruction before the reference is written
96 5. PARALLEL PARADIGMS AND CACHE COHERENCE
into the queue’s buffer (Listing 5.6, line 5). Without a memory fence the queue’s
buffer write could be visible to the consumer before the referenced data has been
committed in memory, potentially resulting in a read of stale data.
Multiple Producer and/or Multiple Consumer Queues
The following step consists in providing one-to-many (SPMC), many-to-one (MPSC),
and many-to-many (MPMC) communication mechanisms. SPMC, MPSC, and
MPMC queues can be realized in several different ways, for example using locks,
or in a lock-free way in order to avoid lock overhead.
Lamport’s queues have been extensively studied and extended [79, 104] especially fo-
cusing on improving the performance of the more general but more difficult MPMC
variants.
However, all these queues could not be directly implemented in a lock-free way with-
out using at least one atomic read-modify-write operation, which is typically used to
guarantee the correct serialization of updates from either many producers or many
consumers. As we saw in Section 5.2.1, these atomic operations implicitly behave
as a memory fence instruction, which can result in considerable synchronization la-
tency due to the required coherency protocol operations (i.e., cache invalidations or
updates).
In addition, an inherent problem in these MP and/or MC communications is the
so-called ABA problem [79]. The ABA problem occurs when a location is read twice
by a process (or thread) P1, has the same value for both reads, and the fact that
the “value is the same” means that “nothing has changed” between the two reads.
If, between the two reads, another process (or thread) P2 changes the value, does
other work, then changes the value back, thus the P1 might think that “nothing
has changed” even though P2 did work that violates that assumption. Handling the
ABA problem requires particular mechanisms and/or strategies to correctly ensure
that all parties (producers and/or consumers) agree on the order of transactions:
dual-lock queues in lock-based solutions and deferred reclamation/hazard pointers
or two bottom values in lock-free alternatives [79, 104].
These solutions build MP queues and MC queues as passive entities on which
processes or threads concurrently synchronize to access data. A different approach
[15] is based on the use of an active entity (i.e., process or thread) that acts as
an arbiter for the synchronizations among producers or consumers. Consider a
structured parallel application and the various implementation strategies that we
have discussed in Section 5.1: the emitter and the collector modules (or the master
module in the MS strategy) can actually be the active entity. According to their
role, the emitter and the collector perform enqueue or dequeue operations on one or
more lock-free SPSC queues. While, MPMC queues can be implemented combining
the emitter and collector functionalities with the consequent cost of an additional
memory copy. Therefore, this solution avoids the use of atomic read-modify-write
5.4. SUMMARY 97
operations and does not suffer from the ABA problem, since MPMC queues are
build explicitly linearising correct SPSC queues using emitter and collector entities.
The exchange of data references in the buffer queues used to minimize the cost of
the copy of data requires a memory fence instruction in weak memory consistency
models (as discussed above). In our work on the porting of FastFlow on the Tilera
TilePro64 processors [24], which is an example of a WSO system, we compare the
average queue latency on the Tilera TilePro64 and an x86 processor for bounded
(SPSC) and unbounded (uSPSC) FastFlow queues. As shown in Figure 5.4 the use
of memory fence instruction comes at a cost.
 0
 20
 40
 60
 80
 100
 120
 140
8 64 1k 8k 8 64 1k 8k
Cl
oc
k 
Cy
cle
s
 
SPSC
uSPSC
Nehalem E7@2.0GHzTilePro64@866MHz
Figure 4: Average latency times for the FastFlow queues
on TilePro64 and Intel processors varying the buffer size.
offloading term. In [5] more information about the FastFlow
accelerator can be found.
IV. PORTING FastFlow ON TILEPro64
Our original idea was to use structured parallel pro-
gramming techniques to exploit general purpose many-
core. Therefore we decided to port FastFlow on TilePro64.
FastFlow is actually provided as a set of .hpp files written
in C++ standard. In principle, recompiling is the only
action required to port the framework on the a different
architecture. However, two particular issues should be taken
into account. First, the Single-Producer Single-Consumer
(SPSC) lock free and wait free FastFlow queue requires
a memory fence instruction, which usually changes with the
instruction set/architecture. Second, some of the synchro-
nization mechanisms of the POSIX threads are known not
to be scalable (e.g. barriers), and less general, proprietary
mechanisms and libraries implementing the same kind of
synchronization should be used instead, if available. Indeed,
solving these issues are the only significant steps we had to
perform to port FastFlow onto the TilePro64.
While considering how to deal with these issues, we also
considered deeper modifications in the FastFlow runtime
support, to better exploit this architecture. In particular: a)
we developed a way to exploit per-virtual-page DDC policies
(see Sec. II) in a programmer-transparent way, so that every
aspect is handled by the FastFlow support, and b) we
redefined the FastFlow accelerator mechanism to support
offloading of computations from the main CPU cores to a
TilePro64 co-processor.
A. SPSC queue & synchronization mechanisms
The first part of the porting focused on the architecture
dependent instructions in the FastFlow runtime support.
In particular, the lock free and wait free SPSC queue
implementation requires to use a Write Memory Barrier
instruction (WMB)3 to enforce memory write operations
ordering. FastFlow was engineered in such a way it turned
out to be quite easy to define the behavior of the WMB for
each supported architecture, as shown in the following code.
1 # ifdef x86 64 // x86 32/64−bit: no memory fence is needed.
2 #define WMB() asm volatile (””: : :”memory”)
3 #endif
4 # ifdef tile // Tilera : using a compiler intrinsic for memory fence.
5 #define WMB() insn mf();
6 #endif
Indeed, the usage of this fence instruction comes at a cost:
Fig. 4 outlines the differences in between the average queue
latency on the TilePro64 and an x86 processor for bounded
(SPSC) and unbounded (uSPSC) FastFlow queues.
All critical synchronizations in FastFlow are imple-
mented on top of the SPSC queue, in a lock-free fash-
ion. However, some portions of code exist, used in non-
critical path, where atomic operations and pthread-based
synchronization mechanisms are used. Such mechanisms,
implemented in kernel space, are inefficient on the TilePro64
with a high number of threads, therefore we substituted them
with equivalent Tilera’s TMC spin-based user-level routines,
which are more scalable and predictable.
B. Memory Allocation Policies
As explained in Section III, the third layer of FastFlow
provides parallel patterns. The pattern structure of the ap-
plication can be used to derive important static information
on the flow of data and tasks among threads. In order to
provide optimized memory management policies based on
this knowledge, we decided to exploit the flexible cache
coherence mechanisms by defining enhanced cache coher-
ence settings for virtual pages containing task data. We
implemented three “Memory Allocation Policies”, that affect
the selection of the Home Node for the data structures:
• Hash Home Node (HHN). This is the default mode
defined by the architecture: an hash function is used
to uniformly distribute Home Nodes among all the
caches. HHN guarantees automatic cache coherence
and uniform usage of all the caches, although it may
increase the NoC traffic and reduce the effective amount
of cache usable per tile with high parallelism degrees.
• No Home Node - NHN With this policy the homing
mechanism is disabled, resulting in incoherent memory
pages, which can affect the correctness of the applica-
tion. However, the stream-parallel paradigm data-flow
semantics guarantees that each task is managed by only
one concurrent entity at a time. As a consequence
coherency need to be ensured only when a task is
passed to a different concurrent entity. This property
allows us to let the application programmer work freely
with incoherent memory pages both in read and write
3In many works, the WMB instruction is also referred to as store-fence.
134
 0
 20
 40
 60
 80
 100
 120
 140
8 64 1k 8k 8 64 1k 8k
Cl
oc
k 
Cy
cle
s
 
SPS
uSPSC
Nehalem E7@2.0GHzTilePro64@866MHz
Figure 4: Average latency times for the FastFlow queues
on TilePro64 and Intel processors varying the buffer size.
offloading term. In [5] more information about the FastFlow
accelerator can be found.
IV. PORTING FastFlow ON TILEPro64
Our original id a was to use structured parallel pro-
gramming techniques t exploit general purpo e many-
core. Therefore we decided to port FastFlow on TilePro64.
FastFlow is actually provid d as a set of .hpp files written
in C++ standard. In principle, recompiling is the only
action required to port the framework on the a different
architecture. However, two particular issues should be taken
into account. First, the Single-Producer Single-Consumer
(SPSC) lock free and wait free FastFlow queue requires
a memory fence instruction, which usually changes with the
instruction set/architecture. Second, some of the synchro-
nization mechanisms of the POSIX threads are known not
to be scalable (e.g. barriers), and less general, proprietary
mechanisms and libraries implementing the same kind of
synchronization should be used instead, if available. Indeed,
solving th se issu s are the only significant steps we had to
perform to port FastF w onto the TilePro64.
While c nsidering how t deal wi h these issues, we al o
considered deeper modifications in the F stFlow runtime
support, to better exploit this architecture. In particular: a)
we developed a way to exploit per-virtual-page DDC policies
(see Sec. II) in a programmer-transparent way, so that every
aspect is handled by the FastFlow support, and b) we
redefined the FastFlow accelerator mechanism to support
offloading of computations from the main CPU cores to a
TilePro64 co-processor.
A. SPSC queue & synchronization mechanisms
The first part of the porting focused on the architecture
dependent instructions in the FastFlow runtime support.
In particular, the lock free and wait free SPSC queue
implementation requires to use a Write Memory Barrier
instruction (WMB)3 to enforce memory write operations
ordering. FastFlow was engineered in such a way it turned
out to be quite easy to define the behavior of the WMB for
each supported architecture, as shown in the following code.
1 # ifdef x86 64 // x86 32/64−bit: no e ory fence is needed.
2 #define WMB() asm volatile (””: : :”memory”)
3 #endif
4 # ifdef tile // Tilera : using a compiler intrinsic for memory fence.
5 #define WMB() insn mf();
6 #endif
Indeed, the usage of this fence instruction comes at a cost:
Fig. 4 outlines the differences in between the average queue
latency on the TilePro64 and an x86 processor for bounded
(SPSC) and unbounded (uSPSC) FastFlow queues.
All critical synchronizations in FastFlow are imple-
mented on top of the SPSC queue, in a lock-free fash-
ion. However, some portions of code exist, used in non-
critical path, where atomic operations and pthread-based
synchronization mechanisms are used. Such mechanisms,
implemented in kernel space, are inefficient on the TilePro64
with a high number of threads, therefore we substituted them
with equivalent Tilera’s TMC spin-based user-level routines,
which are more scalable and predictable.
B. Memory Allocation Policies
As explained in Section III, the third layer of FastFlow
provides parallel patterns. The pattern structure of the ap-
plication can be used to derive i portant static information
on the flow of data and tasks among threads. In order to
provide optimized memory management policies based on
this knowledge, we decided to exploit the flexible cache
coherence mechanisms by defining enhanced cache coher-
ence settings for virtual pages containing task data. We
implemented three “Memory Allocation Policies”, that affect
the selection of the Home Node for the data structures:
• Hash Home Node (HHN). This is the default mode
defined by the architecture: an hash function is used
to uniformly distribute Home Nodes among all the
caches. HHN guarantees automatic cache coherence
and uniform usage of all the caches, although it may
increa e the NoC traffic and reduce t ffective amount
of cac e usable p r til with high parall lism degrees.
• No Home Node - NHN With this policy the homing
mechanism is disabled, resulting in incoherent memory
pages, which can affect the correctness of the applica-
tion. However, the stream-parallel paradigm data-flow
semantics guarantees that each task is managed by only
one concurrent entity at a time. As a consequence
coherency need to be ensured only when a task is
passed to a different concurrent entity. This property
allows us to let the application programmer work freely
with incoherent memory pages both in read and write
3In many works, the WMB instruction is also referred to as store-fence.
134
number of el ments
Figure 5.4: Average latency times for the FastFlow queues on Tilera T lePro64 and
Intel processors varying the buffer size
5.4 Summary
In this Chapter we analyzed the impact of automatic cache coherence solutions to
the performance of parallel programs defined by well-know parallelism paradigms.
By using, as a first approximation, the base memory latencies derived in the previ-
ous chapter, we reason about how performances of parallelism forms defined by a
structured parallel programming nvironment, e affected by them. Notably, know-
ing the semantic, as well as the structur of the implement tion of each paralle ism
form, we are able to derive important properties like a “producer-consumer” cache
coherence pattern whi h is recurrent in all of them.
These properties, easily drive to definition of a cost model for the various paradigms
both from the point of view of the computation data and the synchronization mech-
anisms.
Notably, we analyzed the state-of-the-art of synchronization techniques and com-
98 5. PARALLEL PARADIGMS AND CACHE COHERENCE
munication mechanism from the point of view of the cache coherence impact, which
are typically used in the implementation of parallel programs.
This Chapter allowed us to start from an evaluation of cache coherence mechanisms
in terms of read and write latencies up to understand how the interactions between
the modules of parallel paradigms are in relation (in terms of performances) with
the coherency state of, and operations executed on, both computation and synchro-
nization data structures.
The next step to complete the evaluation of the impact of cache coherence on par-
allel programs’ performance require to use the results obtained until now (i.e., costs
of the operations executed by the various modules of the parallel program and how
modules obtain the data through the cache coherence protocol) with a cost model
which takes into account of the under-load behavior of a parallel program.
CHAPTER 6
Cost Models for CMP-based Architectures
In this chapter, we finally discuss how to evaluate the under-load memory latencies
in order to complete the evaluation of the performances of parallel application exe-
cuted on CMP-based systems.
In the first part (Section 6.1), we use the results of Queueing Theory for the client-
server model. The work presented in [22], was probably the first to introduce the
main idea of modeling the Processor-Memory subsystem of a shared memory archi-
tecture, considering the memories as servants and the processors as clients. Notably,
we apply the extension of the work proposed in [107].
The results obtained are used to reason about the effect of under-load latencies to
parallel program performances.
In Section 6.2 we discuss the impact of specific choices (e.g., parallel process mapping
and cache coherence optimizations) in the run-time support of parallelism forms.
In the second part of this chapter (Section 6.3), we combine the base memory and
cache latencies defined in Chapter 4, with an interesting performance evaluation
tool: Performance Evaluation Process Algebra (PEPA). This process algebra repre-
sents an alternative approach to study different aspect of the methodology used in
this thesis. Notably, we are able to describe a complete abstract architecture (i.e.,
with different levels of servers for each level in the memory hierarchy) and derive a
complete cost model for the various abstract models used to model automatic and
non-automatic cache coherence solutions.
100 6. COST MODELS FOR CMP-BASED ARCHITECTURES
C1
Cn
T1
Tn
Ts
queue
S
Ta
...
Figure 6.1: Client-server models with request-reply behaviour used to model multi-
processor systems
6.1 Cost Model for Under Load Memory Laten-
cies
To evaluate a multiprocessor system we can use a queuing model with m distinct
and independent servers, each of which is a memory macro-module 1, and p clients,
where p is the average number of PEs sharing the same memory macro-module.
Logically, the interconnection network path from each of the p nodes to the shared
macro-module, including the used switch nodes, belongs to the server. For a com-
plete treatment about the evaluation of network latencies, the interested reader can
consult [89, 107].
An example of this scheme is shown in Figure 6.1, where n is the number of PEs
in the system and p ≤ n. Ti represents the average response time of the module i,
while Ta is the total interarrival time to the queue. Notable cases of this interac-
tion pattern are some client-server parallel applications as well as processor-memory
systems.
The main goal in a client-server system with request-reply behaviour is to esti-
mate the average response time RQ of S.
In client-server models with request-reply behaviour, in order to reduce the re-
sponse time both the service time (utilization factor) and the latency of the server
are critical. Thus, we need techniques able to improve both the bandwidth and
the latency of the shared memory macro-modules and of the network. A modular
1We call a memory macro-module a memory subsystem consisting of a memory interface and
some memory modules
6.1. COST MODEL FOR UNDER LOAD MEMORY LATENCIES 101
memory organization is adopted, often relying on cache line interleaving to increase
the bandwidth. This solution applies to any shared memory support, for example
to shared caches too. In fact, some CMP shared caches have a modular interleaved
structure [95].
Let p be the average number of processing nodes sharing the same memory/-
cache module. This number depends on how data structures are shared between
processors or threads. For example, consider a synchronization variable which is
shared between k processors or threads, then p could be equal to k. It is very im-
portant that p is as low as possible in order to minimize the congestion overhead at
the server.
In a SMP architecture, in which statistically the memory accesses are uniformly dis-
tributed over the m macro-modules, p can be estimated as the mean of the binomial
distribution, i.e. p = n/m.
In NUMA architectures, the uniform distribution does not hold, and the p value de-
pends on specific characteristics of the parallel program and its mapping onto PEs.
p value is greater than one when different processing nodes access the same data,
different data belonging to the same cache line, or to different cache lines allocated
on the same memory macro-module. In the next section we will discuss optimality
issues in NUMA architecture mapping.
The under-load memory access latency is given by the server response time. Let
RQ0 be the base latency, which is the latency evaluated in absence of contention (e.g.,
as evaluated in Chapter 4). In NUMA architectures it is meaningful to distinguish
between remote memory access latency (RQ−rem) and local memory access latency
(RQ−loc). Assuming that, in case of conflict between a remote request and a local
request, this last is served with higher priority, we can write:
RQ−loc = RQ0(1 + ρ)
where ρ is the server utilization factor, which is a result of the queuing model resolu-
tion and expresses a global, average measure of the congestion degree of the requests
to the server. The analysis can be used also for cache-to-cache transfers: the server
is modeled by a remote PrC and by the interconnect path, either inside the same
CMP or between distinct CMPs.
102 6. COST MODELS FOR CMP-BASED ARCHITECTURES
6.1.1 Model Resolution
We use the following system of equations as resolution technique of the client-server
system. 
Tcl = Tp +RQ
RQ = WQ(ρ) +RQ0
ρ =
Ts
TA
TA =
Tcl
p
ρ < 1
Assume that all clients are identical (i.e., T1 = ... = Tn). The first equation is used
to describe the behavior of each client that generates the next request only when the
result of the previous one has been received. The behaviour of a client is cyclic: think
periods (the client ideal service time Tp) alternates to wait ones (depending on the
average response time of S, RQ), leading to a certain client average interdeparture
time Tcl.
Once we know Tcl, we can determine the server average interarrival time TA; by
resorting on Theorem 3.2.2 introduced in Chapter 3, we have that TA = Tcl/p.
The utilization factor of the system is given by ρ = TS/TA.
Finally, the under-load memory access latency RQ is simply given by the average
waiting time WQ plus a constant known in advance, which is the base latency RQ0 .
The expression of WQ depends on the type of Q. Notably, we use a M/D/1 queue
[66], where the symbol M represents an exponential interarrival time distribution,
while the symbol D represents a deterministic service time distribution, in a system
with a single server (1). The service discipline is FIFO and it is assumed that
the queue size is infinite. The deterministic distribution of service time is a good
approximation for a memory subsystem.
For this queue, we get the following fundamental result
WQ = TS
ρ
2(1− ρ)
Therefore, solving the system with respect to RQ leads to a second degree equation
in ρ. The two solutions ρ1 and ρ2 are always such that ρ1 < 1 and ρ2, thus the
solution of the model must be subjected to the constrain ρ < 1.
6.1.2 Complexity vs Approximation of the model
The cost model for a shared memory multiprocessor implies a complex evaluation
because of the rather large number of variables, architectural variants and, perhaps
most important, many different situations related to the parallel application and its
mapping (as we will discuss in Section 6.2).
Our goal is to define a method which is characterized by an acceptable complexity
6.1. COST MODEL FOR UNDER LOAD MEMORY LATENCIES 103
and, at the same time, is able to capture the essential elements and the qualitative
behavior. This implies an approximate approach based on some assumptions. For
these reasons the most meaningful assumptions are:
• All conflicts are concentrated on the memory macro-modules or caches only,
i.e. conflicts on the network switch nodes and links have a negligible impact.
Clearly, this assumption is a simplification, anyway, the CMP internal inter-
connect has a very low latency which contributes to minimize the effect of
network contention;
• let TP be the mean time between two consecutive accesses by the same PE
to a certain memory macro-module (during this time, the processor executes
instructions operating on registers or private caches). We assume that TP is
the mean value of an exponentially distributed random variable. Actually this
distribution depends on the parallel application characteristics, and might be
different from the exponential one. However, for our purposes, the interarrival
time distribution is approximated as an exponential one because of the inde-
pendent behavior of the various PEs. In other words, the combination of p
requests can be approximately characterized by a random behavior. Thus, for
the WQ evaluation a good approximation is represented by the M/D/1 queue.
• Clients are not identical. Though the majority of modules of a structured
parallel paradigm are identical (workers), “service” modules (e.g., emitter and
collector) are present; moreover, a parallel computation might consist of the
graph composition. Thus, the identical client approximation is a worst-case
approximation (e.g., service modules typically cause less memory contention).
• The server service time TS and base latency have to be estimated for the spe-
cific architecture. In this context, another modeling problem arises. Queuing
Theory analytical results are valid for single sequential servers (or, when multi-
ple servers are considered, they are merely independent sequential servers). In
other words, no theoretical results exist for parallel servers. The consequence
is that, for an analytical approach, we are forced to exploit formulas for WQ
(the mean waiting time in queue) which in Queuing Theory have been derived
for servers having equal service time and latency. Of course, in no way we can
accept this assumption for a parallel server subsystem with a pipeline request-
reply behavior: this is the reason for which the classical relation RQ = WQ+TS
is extended to RQ = WQ + LS, where LS represents the server latency (i.e.,
RQ0 in the system equations presented in the previous Section).
With these assumptions, the approximation error is about 20% when the server
utilization factor ρ is high (close to one), while is much lower for servers with (low-
)medium ρ. That is, actually the method is sufficiently approximated in all cases
of our main interest. Anyway, the approximation on RQ is always by-excess, thus
104 6. COST MODELS FOR CMP-BASED ARCHITECTURES
(a) RQ0 = 214τ (b) RQ0 = 82τ
Figure 6.2: RQ/RQ0 for main memory with TM = 30τ
it is reliable for the utilization in the cost model of parallel program design and
implementation [107].
6.1.3 Memory access latency
In Figures 6.2 and 6.3, the ratio between memory access latency RQ and base mem-
ory access latency RQ0 for reading a cache line is shown as a function of the most
relevant parameters p and TP , for some typical combinations of TS and RQ0 . The ef-
fect of p shows the importance of the so-called “low-p mapping” of parallel programs:
with low-p mappings the under-load latency is very close to the base latency.
As is expected, the effect of TP is substantial for fine-grained computations (the
utilization factor increases), while for coarse-grained computations the impact on
memory conflicts tends to become negligible, so the under-load latency RQ tends
to the base one for large TP values. The Figures 6.2 represent typical cases with
external main memory. Depending on the network latency, we see that p < 4 is ac-
ceptable for medium-grained (fine-grained) computations; most important: p ≤ 2 is
acceptable even for very fine-grained computations. This confirms the concept that
low contention can be achieved with good process mappings, even in the presence
of relatively fine-grained computations.
6.1.4 On-chip cache-to-cache transfers
For on-chip cache-to-cache transfers the base latency is lower than the memory
access case: thus, degradation is negligible for any computation grain and p <
10, while contention is appreciable only for higher p values and very fine-grained
6.2. ON PARALLEL PROGRAM MAPPING AND UNDER-LOAD EVALUATION105
(a) TPrC = 16τ RQ0 = 61τ (b) TPrC = 8τ RQ0 = 42τ
Figure 6.3: RQ/RQ0 for on-chip cache-to-cache transfers
computations. This result shown in Figure 6.3 justifies the enormous potentials of
the CMP technology and the trend towards very highly parallel CMPs.
If the value of RQ/RQ0 is substantially greater than one, the parallel program
bandwidth is lowered. To evaluate the contention effect on bandwidth, we must
re-evaluate the cost model.
If the value of RQ/RQ0 > 1, all the server latencies are multiplied by RQ/RQ0 with
respect to the base latency evaluation.
In other words, the actual optimal parallelism degree is evaluated in function of RQ
nopt(RQ) =
⌈
Tid(RQ)
TA(RQ)
⌉
6.2 On parallel program mapping and under-load
evaluation
In this Section we study how to map processes belonging to the run-time support of
structured parallel paradigms, in order to optimize the under-load latency of shared
data structures. The main goal of process mapping is to keep the p value as low as
possible: we speak about low-p mapping strategies. An efficient exploitation of the
architecture is allowed by data structures shared by a relatively low number p of
PEs, even if PEs are relatively distant. In the following, we will speak generically
of channel descriptors to denote the shared data structures used to implements the
communications.
106 6. COST MODELS FOR CMP-BASED ARCHITECTURES
NUMA mapping In a pipeline process structure each channel descriptor is shared
by two PEs (onto which the corresponding stages are mapped), thus p = 2. One
of these two PEs has the channel descriptor in its local memory. Therefore the
data structure is in the node onto which either the destination stage or the source
stage is mapped. Each memory macro-module is shared only by the local processor
and by the processor onto which the neighbor stage is mapped. The achieved p
value represents a very interesting result, leading to negligible contention, and so
the under-load latency is very close to the base one.
The best mapping for a farm program consists in using symmetric (one-to-one)
channels and allocating, for any worker, the input channel data structure (from
emitter to worker) and the two output channel descriptors (from worker to emitter
used to implement load balancing distribution and from worker to collector) in the
local memory of PE onto which the worker itself is mapped. The emitter and col-
lector PEs share all worker local memories, while each worker PE accesses its own
local memory only. In this way we have p = 3 (each worker local memory accessed
by the emitter, collector and worker itself PEs), which again is a very good result
for sharply reducing the contention effects. If the communications from workers to
emitter and to collector are expressed by asymmetric channels, channel descriptors
are forced to be allocated in the emitter local memory and in the collector local
memory, then p = n + 1, i.e. all the workers nodes share, and are in conflict on,
emitter and collector local memories. More in general, the strategy “always allo-
cate the channel descriptor in the local memory of the destination process node”
is far from the optimal one even with symmetric channels (in the example p = n + 1).
Low-p mapping strategies, exploiting symmetric channels, can be applied also
to the implementation of collective communications in any data parallel paradigm
(scatter, gather, multicast), as well as to collective operations (reduce), and to stencil
communications. For example, for a reduce operation, allocating n channel descrip-
tors and target variables in reduce node, corresponds to the maximum contention:
p=n. However, it can be implemented much better: n channel descriptors and
target variables are allocated in worker nodes. This corresponds to the minimum
contention: p=2. Similar considerations are valid for stencil-based computations.
In conclusion, for NUMA architectures we have verified another very important
feature of structured parallel paradigms: owing to the detailed knowledge of the
process patterns for their run-time support, we are able to identify some efficient
implementation strategies in terms of communication forms and related low-p map-
pings.
SMP and NUMA-SMP Low-p mappings are mainly achieved with a large num-
ber of interleaved macro-modules. For this reason, the trend in multiple-CMP archi-
6.2. ON PARALLEL PROGRAM MAPPING AND UNDER-LOAD EVALUATION107
tectures is towards NUMA or, more precisely, NUMA-SMP (as discussed in Chapter
4). However, in CMP-based NUMA-SMP architectures, the local memory organi-
zation is critical: a single macro-module is quite inadequate.
Shared cache and cache coherence The low-p mapping strategies, studied for
NUMA architectures, must be taken into account in the design of cache coherence
strategies.
In fact, the current association cache coherence - SMP architecture is misleading.
Independently of the multiprocessor architecture class (and of the possible existence
of shared caches in CMPs), the basic semantics of cache coherence is close to the
NUMA idea: cache lines are allocated and controlled locally to PEs, cache-to-cache
transfers and invalidation (or update) communications occur between PrC. In other
words, from the under-load modeling point of view, caches behave as shared local
memories of a NUMA machine. Of course, the same directory-based approach is
typically NUMA-oriented.
The above consideration is also important from the performance evaluation view-
point: automatic cache coherence protocols might be a potential source of perfor-
mance degradation, because they might prevent or hinder low-p mapping strategies.
For example, home nodes might become bottlenecks, or external communications
(from distinct PEs) for cache coherence actions like C2C transfers or invalidation-
s/updates requests, increase the request traffic in input to PrC servers, in addition
to the internal requests of the same PE.
As usual, these problems are too complex if studied independently of the parallel
program’s characteristics. However, the set of possible mappings is narrowed by the
implementation of structured parallel paradigms and their run-time supports.
Notably, our idea of a solution is in proper NUMA mappings, in particular for process
exclusive mapping approach and a specific run-time support that aims to provide a
low-p mapping.
For example, in the farm paradigm, a low-p mapping is implemented if the local
memory of a worker node contains all the worker channel descriptors: the input
channel task (Emitter to Worker) and output channels available and result (respec-
tively, Worker to Emitter and Worker to Collector), as well as the target variables
of Worker, Emitter and Collector.
Thus the worker node is the home node of the shared blocks comprising all these
channel descriptors and target variables.
However, if the task channel run-time support is implemented according to the ba-
sic automatic cache coherence solutions, the C2C read requests and invalidation
requests from workers’ nodes to Emitter node re-introduce a high-p contention on
the Emitter PrC. That is, low-p mappings would require that workers do not per-
form requests to the Emitter node.
This is possible by a proper combination of basic optimization cache coherence strat-
egy in the implementation of interprocess communication, as we will discuss in the
108 6. COST MODELS FOR CMP-BASED ARCHITECTURES
next Chapter. The solution focus is: communication channel task is implemented
according to the home-flush strategy, where the send, executed by Emitter, modifies
the channel descriptor and the associated target variable through synchronous flush
operations, which invalidate the blocks in the Emitter cache locally, thus avoiding
both the C2C read requests and the invalidation communications from the worker
node in receive/compute phases. Again, the only servers are workers. On the other
hand, since the worker is the home node of the interested blocks, the modified chan-
nel descriptor and local variable are automatically copied into the worker PrC by
the send run-time support. These considerations apply perfectly to data-parallel
programs too. This kind of design optimization is possible for multiprocessor sys-
tems which provide flexible mechanisms for cache coherence strategies, like home
node selection and flush operation with local de-allocation.
6.3 PEPA: Process Algebra for Quantitative Anal-
ysis
Performance Evaluation Process Algebra (PEPA) [55] is a high-level description lan-
guage for Markov processes which belongs to the class of Stochastic Process Algebras
(SPA). Among the wide class of SPAs, we choose PEPA because it is simple but
at the same time it has sufficient expressiveness for our purposes. The simplicity
comes from the structure of the language: PEPA has only a few elements and a
formal interpretation of all expressions can be provided by a structured operational
semantics.
In this section we just introduce the minimal set of PEPA features strictly necessary
to model client-server with request-reply behavior. The interested reader can refer
to [55] for a detailed description.
First, in Section 6.3.2, we briefly describe how PEPA can be used to perform a per-
formance analysis of graph computations, including graphs with multiple sources,
in order to detect bottlenecks in the computation. As introduced in the first part
of this thesis, this is a fundamental part of our methodology.
Finally, we focus on the possibility to describe different kind of CMP-based archi-
tecture with a PEPA program. Notably, we start from a simple processor-memory
architecture up to CMPs with a complex cache hierarchy and cache coherence pro-
tocols. This allow us to compare the observations and the analysis done until now
about parallel programs’ performance with the results obtained from a high-level
description of parallel applications executed on different kind of platforms.
6.3.1 PEPA Language
A PEPA system is described as a composition of components that undertake ac-
tions. Components correspond to identifiable parts in the system. For instance, in
6.3. PEPA: PROCESS ALGEBRA FOR QUANTITATIVE ANALYSIS 10966 Stochastic Process Algebra Formalization of Client-Server Model
3.3. THE PEPA LANGUAGE 29
Prefix
(↵, r).E
( ,r)     E
Choice
E
( ,r)     E 
E + F
( ,r)     E 
F
( ,r)     F  
E + F
( ,r)     F  
Cooperation
E
( ,r)     E 
E BC
L
F
( ,r)     E  BC
L
F
(↵ /2 L) F
( ,r)     F  
E BC
L
F
( ,r)     E BC
L
F  
(↵ /2 L)
E
( ,r1)     E  F ( ,r2)     F  
E BC
L
F
( ,R)     E  BC
L
F  
(↵ 2 L) where R = r1
r (E)
r2
r (F )
min(r (E), r (F ))
Hiding
E
( ,r)     E 
E/L
( ,r)     E /L
(↵ /2 L) E
( ,r)     E 
E/L
( ,r)     E /L
(↵ 2 L)
Constant
E
( ,r)   E 
A
( ,r)   E 
(A
def
= E)
Figure 3.1: Operational Semantics of PEPA
For any activity instance its activity rate is the product of the apparent rate of the action
type in this component and the probability, given that an activity of this type occurs, that
it is this instance that completes. This leads to the following rule:
E
( ,r1)     E  F ( ,r2)     F  
E BC
L
F
( ,R)     E  BC
L
F  
(↵ 2 L) where R = r1
r (E)
r2
r (F )
min(r (E), r (F ))
On the basis of the semantic rules PEPA can be defined as a labelled multi-transition
system. In general a labelled transition system (S, T, { t  | t 2 T}) is a system defined by
a set of states S, a set of transition labels T and a transition relation
t    S   S for each
t 2 T . In a multi-transition system the relation is replaced by a multi-relation in which
the number of instances of a transition between states is recognised. Thus PEPA may be
Figure 5.1: Structured Operational Semantic of PEPA.Figure 6.4: Structure er ti l e tic f
110 6. COST MODELS FOR CMP-BASED ARCHITECTURES
our context, clients and servers will be the components of the systems. A compo-
nent may be atomic or may itself be composed by components. The language is
indeed compositional in the sense that new components may be formed through the
cooperation of other ones. Each component can perform a finite set of actions. An
action has a duration (or delay) which is a random variable with an exponential
distribution. Consequently, the rate of the action is given by the parameter of the
exponential distribution. For example, the expression
P
def
= (α, r).Q
represents the definition of a new component P which can undertake an action α
at rate r to evolve into another component Q (defined somewhere else). Since the
duration of all actions of the system are exponentially distributed, it is intuitive to
say that the stochastic behaviour of the model is governed by an underlying CTMC
(continuous-time Markov chain).
The syntax of the PEPA language is formally defined by the following grammar:
S ::= (α, r).S | S + S | CS
P ::= P BC
L
P | P/L | C
S denotes a sequential component and P denotes a model component which
executes in parallel. C and CS stand for constants to denote either a sequential or
a model component (the effect of the syntactic separations is to allow to build only
components which are cooperations of only sequential components, which has been
proved in [55] to be a necessary condition for building ergodic Markov processes, i.e.
amenable to steady-state analysis).
The structured operational semantic is shown in Figure 6.4. Below an intuitive
description of the most used PEPA operators is provided. The interested reader can
refer to [55] for a detailed description.
• Prefix ((α, r).P ) This is the basic mechanism to express a sequential behavior
in PEPA. As already said, a component performs an action at rate r behaving
subsequently as P.
• Choice (P+Q) This operator represents a component that may behave either
as P or as Q. Assume that α and β are the actions that enable respectively
P and Q, characterized by their own rate. The idea behind the Choice oper-
ator is that once an action has been completed, the other is discarded. For
instance, if the first action to be completed is β then the component moves to
Q, “forgetting” the other branch.
• Cooperation (P BC
L
Q) This operator denotes the cooperation between P and
Q over L. L is the cooperation set that contains those activities on which the
components are forced to synchronized. The rate of this shared activity has
6.3. PEPA: PROCESS ALGEBRA FOR QUANTITATIVE ANALYSIS 111
M1
M2
M3
M4
M5
M6
M7
M8
T1 = 25
T2 = 30
T3 = 40
T4 = 25
T5 = 150
T6 = 15
T7 = 27
T8 = 120
0.6
0.4
0.3
0.7
0.5
0.5
Figure 6.5: Co-operating modules graph example
to be altered to reflect the slower component in the cooperation (see how in
Figure 5.1). It is important to notice that for actions not in L components
proceed independently and concurrently with their enabled activities. Actually
cooperation is a multi-way synchronization since more than two components
are allowed to jointly perform actions of the same type. When concurrent
components do not have to synchronize the cooperation set L is empty; in
these cases we will use the abbreviation P ‖ Q to denote P and Q running in
parallel. We will use also a simple syntactic shorthand to denote an expression
like (P ‖ P ‖ ... ‖ P ) as P[N], with N the number of times that P is replicated.
Finally, we point out that there can be situations in which two components do
synchronize, but the rate of the shared activity is determined by only one of
the component in the cooperation. In this case the other component is defined
as passive. The rate of the activity for the passive component will be denoted
with the symbol >.
6.3.2 PEPA for graphs
An interesting application of the PEPA language for our purpose, is the analysis of
the steady-state behavior of a computation graph. Especially, in the case of graphs
with multiple-sources, which are cases not covered by the methodology presented in
[77] and discussed in the Introduction part of this thesis.
For example, consider the computation graph in Figure 6.5 We can approximate the
result of steady-state analysis using PEPA with the results shown in Table 6.1. The
table shows for each node of the graph, the service time (in the second column) and
the result of the steady-state analysis in terms of the effective interdeparture time
of each module and its utilization factor.
Therefore, PEPA could be an useful and simple tool used to perform the analysis of
the graph computation in the bottleneck detection phase described in Section 3.2.
112 6. COST MODELS FOR CMP-BASED ARCHITECTURES
node service time interdeparture time utilization factor
M1 25 127.52 19.60
M2 30 123.04 24.38
M3 40 78.63 50.87
M4 25 307.60 8.13
M5 150 262.09 57.23
M6 15 112.33 13.35
M7 27 120.97 22.32
M8 120 129.83 92.43
Table 6.1: PEPA steady-state resolution of a multiple-source computation graph
6.3.3 PEPA for under load memory latency
Another interesting application of this tool is that a PEPA program for the classi-
cal client-server model with request-reply behaviour can be instantiated to model a
processors-memory system just knowing the following parameters: Tp, Ts, p, and
the latencies relative to the message request and response sent through the net-
work interconnection, respectively Treq and Tresp. The resulting PEPA program
is shown below.
Clientthink
def
= (request, rrequest).Clienttwait
Clientwait
def
= (reply,>).Clientthink
Server
def
= (request,>).Server + (reply, rreply).Server
Clientthink[p] BC
reply,request
Server
Each client models a process (on a processing node) that operates forever in a
simple loop, completing in sequence the two phases think and wait. As already
stated, the length of the think phase is TP . At the end, a request action is executed
and the client waits for a reply, i.e. it starts the wait phase. The request action is a
shared action between the clients and the server. It models the situation in which a
client sends a request and the server receives it. The length of the wait phase of the
client is RQ. For this reason, the time needed to complete the reply action (phase
wait) is initially unspecified. In fact, it will be imposed in another PEPA expression
through the cooperation with another component. Therefore, Client components
see reply as a pure synchronization operation.
The server modeling the memory macro-module can either accept a request from
one of the p clients (action request) or send them a reply. The time to complete
a request action is obviously unspecified because it depends on clients. The action
reply is shared to model the fact that a client can go back to the think phase as soon
6.3. PEPA: PROCESS ALGEBRA FOR QUANTITATIVE ANALYSIS 113
as the server has handled its request. Finally, the last expression instantiate a client-
server model with p clients running in parallel that try to synchronize themselves
with the server through the cooperation set containing both the two shared actions
request and reply.
It is useful to highlight that even simpler solutions could be formalized: for instance,
the synchronization on the action request is not strictly necessary. However we
decided to keep it for two reasons. First, it helps to understand the semantic of the
whole system (the “request-reply behaviour”). Second, it will be necessary anyway
in further extensions of this basic model.
Model resolution Once found, steady-state information is exploited to derive
the average response time RQserver of the server. In particular these information
includes:
• the average population in each state of the underlying CTMC
• the throughput of the actions
In our client-server model we are interested in the average number of clients that
reside in state Clientwait (pwait) and in the throughput of the action reply (λreply).
Indeed, by applying Littles Law [66], we can extract the average time that a client
stays in the state Clientwait, which actually corresponds to RQserver :
RQserver =
pwait
λreply
It is extremely important to notice that RQserver is not the under-load memory
access latency, but it is the average time spent by a request at the server. However,
to find out RQ it is enough to take into account the base latency of the network as
in the following equation.
RQ = Treq +RQserver + Tresp (6.1)
Heterogeneous Clients in PEPA
In order to model different types of client, the PEPA program can be espressed as
follows, where there are pi clients of type Clienti for i = 1, ..., C:
114 6. COST MODELS FOR CMP-BASED ARCHITECTURES
Client1think
def
= (request, rrequest1).Client1wait
Client1wait
def
= (reply,>).Client1think
...
ClientCthink
def
= (request, rrequestC).ClientCwait
ClientCwait
def
= (reply,>).ClientCthink
Server
def
= (request,>).Server + (reply, rreply).Server
Client1think[p1]||...||ClientCthink[pC ] BC
reply,request
Server
Thanks to the compositional approach of PEPA, we can directly reuse the same
server component and the definition of a generic client already defined before. So
basically, a generic client has the same behaviour as before and this implies that it
is unnecessary to add further operations apart from the already used request and
reply. As a consequence of this structured approach, also the cooperation set in the
last expression remains the same. Of course, a change occurs in the number of client
definitions. In fact, we want to apply the theory seen above in order to recognize C
classes of processes. According to the theory, we do not want to have a definition per
client but, in order to keep lower the resolution complexity, there must a number
of client definitions equal to C. Every definition has own rate of request, that is
peculiar for that given class. This rate is the inverse of the TP characterizing the
class and it has been found according to the techniques explained in the previous
section, i.e. by profiling in the easiest cases or using the explicit phases technique.
The last expression of the program defines the overall system in which clients of
C classes run in parallel synchronizing themselves with the server. Obviously, each
class of clients specifies the number of clients belonging to that class.
Model resolution Following the procedure of the previous section, we have to
find RQserver in order to evaluate the under-load memory access latency. Having
more wait states, i.e. one for client definition, we can evaluate the average number
of clients staying in the state Clientwait as
RQserver =
c∑
i=1
pwaiti
λreply
=
∑c
i=1 pwaiti
λreply
(6.2)
where pwaiti is the average number of clients belonging to the state Client− iwait
in the steady-state condition of the system. Successively, it is sufficient to add the
base network latencies for the request and for the reply to obtain the under-load
memory access latency as in 6.1.
6.3. PEPA: PROCESS ALGEBRA FOR QUANTITATIVE ANALYSIS 115
Hierarchical Shared Memory
In order to model the impact of various memory and cache hierarchy levels, the
model can be extended as follow. For instance, suppose that the number of requests
satisfied by PrC is c, while m is the number of requests satisfied by M. Let pc be the
probability to satisfy a request in PrC and pm the probability to satisfy a request
in M, we have:
pc = c/c+m pm = m/c+m
The idea is to model processing nodes as clients able to generate requests toward
either PrC or M. In other words, this means to have clients that can choose between
two different actions, that are requestC or requestM .
We know from [55] that we can model this situation with a component engaging in
an action (sends a request with mean duration 1/Tp), which may have two different
possible outcomes resulting from the action (in our case sends a request to PrC
or to M). The client component that performs this single action (sends a request)
would be represented by two separate activities (requestC and requestM). The
activity rates of these activities would be adjusted to capture the probabilities of
the different outcomes.
rC = pc/Tp
rM = pm/Tp
P
def
= (requestc, rC).PwaitC + (requestm, rM).PwaitM
PwaitC
def
= (replyM ,>).P
PwaitM
def
= (replyM ,>).P
Cache
def
= (requestC ,>).(replyC , rcache).Cache+
(requestM ,>).(forward, rforward).Cache
Memory
def
= (forward,>).Memory + (replyM , rmemory).Memory
P [n] BC
requestC,requestM ,replyC
Cache[c] BC
forward,replyM
Memory[m]
Model resolution The way to evaluate the under-load memory access latency in
steady state condition of the system is basically the same as in the previous cases
for both versions. Again, we use Littles law [66] to find out the so-called RQserver .
The difference lies in having more waiting states and more incoming rates to those
states. We can easily adjust Formula 6.2 in this way:
RQserver =
w∑
i=1
pwaiti
λreplyi
(6.3)
116 6. COST MODELS FOR CMP-BASED ARCHITECTURES
where pwaiti is the average number of clients belonging to the state of Pwaiti in
steady-state condition of the system and w is the number of waiting states. Finally,
as usual, we have to add the impact of interconnection structures for having the
under-load memory access latency. Anyway, in this case it is important to recall
that more interconnection structures can be involved in hierarchical shared memory
architectures. The best solution is to consider again interconnection structures log-
ically belonging to the server subsystem with the difference that the base network
latencies Treq and Tresp are evaluated applying the definition of mean value:
Treq = TreqPC · pc+ TreqCM · pm Tresp = TrespPC · pc+ TrespCM · pm
Shared level caches and cache coherence
Finally we can extend the previous models in order to measure the impact of shared
level caches and the interactions between the various clients and servers due to the
cache coherence protocols.
We developed PEPA models corresponding to some of the abstract models defined
in Section 4.2:
1. a single-CMP, with a shared level cache that acts as GC;
2. a single-CMP, single-MINF, without a shared level cache, with GC distributed
among the PrCs;
3. a single-CMP, multiple-MINF, with a shared level cache that acts as GC;
In particular, we show the results of PEPA models used to evaluate the effect of the
low-p mapping strategy. To study this problem, for each model we defined:
• for standard mapping (p=n):
– a client INPUT that represents the emitter or the input module of a farm
or data-parallel paradigms respectively;
– a set of clients W[n] which represent the worker modules, which perform
requests to PrC, ShC (when present), M and to the remote INPUT
private cache PrCINPUT ;
• for low-p mapping:
– a client INPUT that represents the emitter or the input module of a farm
or data-parallel paradigms respectively, which performs requests to the
various remote W[i] private caches PrCi;
– a set of clients W[n] which represents the worker modules, which perform
requests to PrC, ShC (when present) and M .
6.3. PEPA: PROCESS ALGEBRA FOR QUANTITATIVE ANALYSIS 117
The following PEPA model represents the case of a single-CMP, with a shared
level cache that acts as GC with standard mapping (p=n):
W
def
= (req Prc, r PrC ).WPrC wait+
(req ShC , r ShC ).WShC wait+
(req rem, r rem).WRem wait+
(req M , r M ).WM wait
WPrC wait
def
= (reply PrC ,>).W
WShC wait
def
= (reply ShC ,>).W
WRem wait
def
= (reply rem,>).W
WM wait
def
= (reply M ,>).W
PrC
def
= (req Prc,>).(reply PrC , 1.0/tPrC ).PrC +
(req ShC ,>).(reqShC , 1.0/tsw).PrC +
(replyShC ,>).(reply ShC , 1.0/tPrC ).PrC +
(req rem,>).(reqrem, 1.0/tsw).PrC +
(replyrem,>).(reply rem, 1.0/tPrC ).PrC +
(req M ,>).(reqM , 1.0/tsw).PrC +
(replyM ,>).(reply M , 1.0/tPrC ).PrC
GC
def
= (reqrem,>).(l2 forward , 1.0/tlookup).GC
ShC
def
= (reqShC ,>).ShCserve + (reqM ,>).ShClookup
ShCserve
def
= (replyShC , 1.0/tShC ).ShC
ShClookup
def
= (m access , 1.0/tlookup).ShC
INPUT
def
= (local stuff , r in loc).INPUT local+
(l2 forward , r in rem).INPUT remote
INPUT local
def
= (local , 1.0/tPrC ).INPUT
INPUT remote
def
= (replyrem, 1.0/tRem).INPUT
M
def
= (m access ,>).Mserve
Mserve
def
= (replyM , 1.0/tm).M
W [n]BC
...
PrC [n]BC
...
INPUT BC
l2 forward
GC ‖ ShC BC
m access
M
Figures 6.6 and 6.7 shows the evaluation of RQ for each server in the hierarchy:
• PrC, the private level cache of W
• ShC, the shared level cache
• INPUT , the private level cache of INPUT accessed by W
• W (INPUT ), the private level cache of W accessed by INPUT
• M , the main memory
118 6. COST MODELS FOR CMP-BASED ARCHITECTURES
(a) pIN = n+ 1 (b) pW [i] = 2
Figure 6.6: Comparison with PEPA of RQ of each server: low-p mapping strategy
(pW [i] = 2) versus standard mapping strategy (pIN = n + 1) in Single-CMP with
GC implemented at the shared cache level
Figure 6.6 shows the case of a single-CMP, with a shared level cache that acts as
GC, while Figure 6.7 show the case a single-CMP, single-MINF, without a shared
level cache, with GC distributed among the PrCs. In both cases, with
(a) a standard mapping strategy, where each W[i] accesses the INPUT private
level cache (pIN = n+ 1)
(b) a low-p mapping strategy, where each W[i] accesses its private level cache and
INPUT accesses the various W[i] private level caches (pW [i] = 2)
Figures 6.8 and 6.9 shows the corresponding value of RQ/RQ0 .
Finally, we measure the impact of multiple-MINFs on chip in a single-CMP, with
a shared level cache that acts as GC. The results justify the high value of RQ/RQ0
especially in the corresponding model with mM = 1. The corresponding results are
shown in 6.10.
All these results, confirm the results anticipated in the previous section of this
Chapter, and we can use it as a good validation for them especially because here we
are able to take into account the whole system, with different level of servers that
represents each level of the memory/cache hierarchy.
6.3.4 On the resolution of PEPA models
Solving a PEPA model means solving the underlying ergodic CTMC, i.e. computing
the steady-state. We wrote and solved PEPA models using the eclipse-plugin for
PEPA [1]. This tool provides a lot of different numerical resolution techniques to
solve the model. Different techniques can be employed depending on the size of the
resulting CTMC: if the number of states is huge (hundreds of thousands) iterative
6.3. PEPA: PROCESS ALGEBRA FOR QUANTITATIVE ANALYSIS 119
(a) pIN = n+ 1 (b) pW [i] = 2
Figure 6.7: Comparison with PEPA of RQ of each server: low-p mapping strategy
(pW [i] = 2) versus standard mapping strategy (pIN = n + 1) in single-CMP, single-
MINF, without a shared level cache, with GC distributed among the PrCs
(a) pIN = n+ 1 (b) pW [i] = 2
Figure 6.8: Comparison with PEPA of RQ/RQ0 of each server: low-p mapping
strategy (pW [i] = 2) versus standard mapping strategy (pIN = n+ 1) in Single-CMP
with GC implemented at the shared cache level
120 6. COST MODELS FOR CMP-BASED ARCHITECTURES
(a) pIN = n+ 1 (b) pW [i] = 2
Figure 6.9: Comparison with PEPA of RQ/RQ0 of each server: low-p mapping
strategy (pW [i] = 2) versus standard mapping strategy (pIN = n+ 1) in single-CMP,
single-MINF, without a shared level cache, with GC distributed among the PrCs
(a) pIN = n+ 1 (b) pW [i] = 2
Figure 6.10: Comparison with PEPA of RQ of each server: low-p mapping strategy
(pW [i] = 2) versus standard mapping strategy (pIN = n+ 1) a single-CMP, multiple-
MINF, with a shared level cache that acts as GC
6.4. SUMMARY 121
yet approximate techniques are preferred. However, the models that we treat are
extremely small, thus the steady-state has been directly computed employing a very
standard algorithm (i.e., direct solver or conjugate gradient solver for calculating the
steady-state probability distribution). In all other cases, e.g. when the number of
clients significantly grow, a phenomenon known as state space explosion may arise.
However, thanks to the natural structure of our models, we may take full advan-
tage of both state-reduction and fluid-approximation techniques [55]. Briefly, these
techniques aim to solve the state space explosion by exploiting potential symmetries
in the CTMC. The presence of symmetries can be informally deduced looking at
the PEPA expressions: for instance, in our model the set of homogeneous clients
(Client[p]) induces replicated sub-Markov chains in the underlying CTMC. These
replicated subsystems will be exploited to restructure the CTMC itself and lower
the state space size.
6.4 Summary
In this chapter, we deal with the problem of how to evaluate the under-load memory
latencies in order to complete the evaluation of the performances of parallel appli-
cations executed on CMP-based systems.
In the first part, we used the results of Queueing Theory for a client-server model.
Notably, we applied the work proposed in [107], to model under-load memory and
cache access latencies in cache coherent architectures.
The results obtained are used to reason about the effect of under-load latencies to
parallel program performances. Notably, we discuss the impact of specific choices
(e.g., parallel process mapping and cache coherence optimizations) in the run-time
support of parallelism forms.
Finally, we used an interesting performance evaluation tool: Performance Evalu-
ation Process Algebra (PEPA), in order to evaluate the base memory and cache
latencies defined in Chapter 4, by describing a complete abstract architecture (i.e.,
with different level of servers for each level in the memory hierarchy) and deriving a
complete cost model for various abstract models. With this evaluations we have the
confirmation of the impact of the mapping strategies in the performance of parallel
program.
Of course, alternative approaches, based on simulation and/or experimental evalu-
ation, are helpful during some design and evaluation phases: for example, a good
queuing network simulator of parallel architectures is described in [23].
According to the results obtained, we are able to study the performance of alternative
run-time support solutions with different approaches to cache coherence, notably,
evaluating the advantages of specific optimization (i.e., low-p mapping strategies).
122 6. COST MODELS FOR CMP-BASED ARCHITECTURES
Part III
Evaluation of the Proposed
Methodology

CHAPTER 7
A Structured Parallelism Approach to Cache Coherence
All the considerations made in the second part of the thesis, are now used to provide
an optimized run-time support for structured parallel applications. Our approach
aims to design a run-time support for advanced CMP-based multiprocessors. No-
tably, this chapter, gathers our research group efforts in the implementation of run-
time support for CMP-based architectures. Many of the concepts presented here are
yet published [24, 25, 107] and during this thesis were further refined. In particular,
this thesis represents a contribution in the design and study of parallel paradigms’
run-time support with optimization strictly related to the cache coherence problem
and its impact in the parallel performance applications.
In the final part of this Chapter we discuss the implementation of the optimization
discussed in this chapter on the Tilera TilePro64 processor. This architecture rep-
resents a good candidate for the evaluation of our solutions, due to the possibility
of implementing or emulating our ideas.
Notably, with this architecture we are able to achieve an improvement of about 50%
with respect to the use of the default cache coherence solution.
7.1 Optimizations for Parallel Paradigms
Run-time Support
In this Section we analyze some relevant optimization techniques that can be used in
the development of the run-time support of parallel paradigms. The write operation
implementation is a key issue from the performance point of view, in particular for
the contention effects on memory modules or caches. As introduced in the previous
Chapter, because of the cache coherence protocols, PrC and ShC may be regarded
as servers in the multiprocessor queueing model: the read/write operations imply
126 7. A STRUCTURED PARALLELISM APPROACH TO CC
client-server relationships between PEs and caches (e.g., for C2C cache line trans-
fers, write synchronous notifications, and invalidations with acknowledgment). In
order to minimize the server utilization factor, the protocol interactions should be
designed accurately.
In this Section we describe and evaluate optimization techniques which are pre-
sented in, or are an equivalent model of, many systems. In the following we refer to
directory-based invalidation-based architectures, though the discussed techniques
are valid more generally. Notably, in the following we refer to PEhome or simply
home node for a specific cache line as the PE in whose main local memory the line
is allocated. If the architecture is not strictly NUMA, it is the PE in charge of con-
trolling a given partition of blocks (PEhome implements the GC). The home node is
able to serve a cache line request directly via cache-to-cache transfers, when the line
is currently present (thus, valid) in its local cache, or by transferring the requested
line from the main (local) memory, if the cache line is not modified in another PE.
7.1.1 Flexible home node selection
A first general problem is the global synchronization implied by invalidation (or
updating).
In any cache coherence solution, if nsh > 1 denotes the number of copies to be
invalidated, the write operation could have a cost which grows proportionally with
respect to this number.
As already discussed in Chapter 5, the invalidation notifications and acknowledg-
ment are performed in parallel; however they contribute to increasing contention
and in the case of synchronous writes (e.g., for solving the memory ordering prob-
lem) the corresponding cost cannot be entirely overlapped. Thus, the minimization
of current copies of the same block is a must in parallel program design.
Often, according to the specific problem semantics and/or to the design strategy, it
is possible to recognize a very limited number of copies (one copy at most) to be
invalidated: this is a powerful feature of some structured parallel paradigms. For
example, in farm or data-parallel computations the input and output channel data
structure used by each worker are shared only between the worker itself and the
emitter or the collector, limiting the number of copy to be invalidated to one.
Notably, proper strategies for the home node selection can be useful, provided that
flexible mechanisms are provided for this purpose.
7.1.2 Home-flush techniques
In almost all write operations (i.e., except when it is possible to perform the write lo-
cally), the home node is informed/involved by the cache coherence protocols. When
the home node does not coincide with the requestor node, which performs the write
operation an alternative write operation implementation can be provided.
We called this technique home-flush, because is characterized by the de-allocation
7.1. OPTIMIZATIONS FOR PARALLEL PARADIGMS RTS 127
of the referred cache line from the requestor node PrCr and the whole line is sent or
flushed to the home node. The home node uses this data to update the local main
memory when necessary (PEr and PEhome have distinct local main memory) and
the PrCh through a cache-to-cache write request communication.
Thus, we assume the presence of a special instruction flush with the above seman-
tic.
A synchronous version of the flush instruction can be provided in order to easily
solve any memory ordering problems, as discussed in Chapter 5.
The use of this operation reduces home node latency and contention, by avoiding
to involve PEr (which does not hold anymore a copy of the cache line) in subse-
quent read/write operations executed by the home node itself or from other PE on
the same cache line. In other words, the block flush advantage is not only latency
saving (though relevant): more important, contention on PrC is reduced, as well as
the global synchronization implied by invalidation.
Block flush can be used as an alternative mechanism to invalidation in write execu-
tion. When provided, this mechanism represents an important optimization. Similar
solutions have been studied in the literature [27, 56], evaluating the advantages of
this technique in producer-consumer patterns.
In some architectures the flush mechanism, including the de-allocation effect, is as-
sociated to an entire data structure, instead of to a single cache line, which can be
even more powerful. For example, the Tilera TilePro64 provides this instruction for
the management of software cache coherence, flushing the data to the main memory.
While, a pure flush mechanism between PrCs is not explicitly provided, it can be
emulated with the write-through semantics of write operation between a requestor
node and the home node. In other cases [95] the term flush is used in a different way,
meaning that the whole content of one or more cache levels is copied into the main
memory: this mechanism has a quite different semantics and it is not of interest for
the ensuing discussion.
We can evaluate the cost of this mechanism, considering the use of the cache-to-
cache facilities in the abstract model presented in Chapter 4.
A cache-to-cache write request c2c write req is composed of σ+1 words sent from
PEr to PEhome through the interconnection network and PEhome sends back an
acknowledgment ack c2c answer of few words (1 or 2) through the interconnection
network after the data are written in its PrChome and when necessary written back
to the local main memory. This write back can be done of course in parallel and
the cost possibly overlapped. Therefore, we can evaluate the cost as follows
LwriteC2C(M,−,−) = Tnet(σ) + TPrC + [TM ] + Tnet
128 7. A STRUCTURED PARALLELISM APPROACH TO CC
7.1.3 Cooperation mechanisms among cores through inter-
processor communications
An interesting aspect of modern CMP architectures, notably (but not just) network
processors, is the availability of very specific architectural structures that can be
exploited to speed up inter-core cooperation, such as the presence of user-accessible
on-chip core-to-core interconnection networks. These networks, referred to as Mes-
saging Networks) are used to exchange messages containing packet descriptors, i.e.
the packet headers and the initial memory address of the packet in memory. The
transmission of packet descriptors is performed over the messaging network by skip-
ping all the shared memory and cache hierarchy levels, thus exploiting the on-chip
interconnection to limit the memory contention by sharply reducing the communi-
cation latency. Bus, Ring or Mesh networks are available on modern architectures
like Broadcom XLP [82], Cavium [59] and Tilera TilePro64 [19].
These mechanisms can be used for our purposes to implement lightweight cooper-
ation mechanisms among cores through inter-processor communications. In partic-
ular, this kind of communication is performed asynchronously with respect to the
execution of read and write operations. We can consider that the basic communica-
tion corresponds to the sending of a message composed of few words i (e.g., 1− 4),
which will be transmitted to the destination PE in an interrupt message.
Therefore, we can evaluate the latency of an inter-processor communication of i
words in terms of the network latency, as follow
LIP = Tnet(i)
This latency, in modern CMP architectures is comparable to the PrC access latency
for single-CMP architectures with a low-latency on-chip interconnection network.
7.2 Communication run-time support
Standard lock-based run-time supports are based on symmetric mutual exclusion
of shared data structures, as discussed in Chapter 5. This approach is a generic
one, valid for any architecture with multiprogrammed mapping and classical low-
level scheduling with passive waiting. It is typically oriented to the execution of
concurrent jobs, not necessarily parallel or highly-parallel, including sequential or
concurrent applications and concurrent operating system services. The use of sym-
metric lock-based techniques is a limitation for low-latency communications, though
the design exploits some notable optimizations, notably user-space implementation
and communication overlapping. This scheme can be implemented with exclusive
mapping too, replacing the low-level scheduling sections with busy waiting synchro-
nization.
Our approach is an optimized, inherently lock-free version, entirely based on asym-
metric notify-based Rdy-Ack synchronization for exclusive mapping processes. We
7.2. COMMUNICATION RUN-TIME SUPPORT 129
aim to design an optimized run-time support for advanced CMP-based multiproces-
sors with exclusive mapping. The exclusive mapping approach is oriented to single
highly parallel programs, in particular structured parallel programs, which exploit
the whole PE set. The target is low-latency interprocess communication, possibly
associated to communication overlapping. In this way, we are able to apply the
optimizations introduced in the previous Section (which derive from the knowledge
of parallel paradigms’ structure) to the run-time support of the parallel applica-
tion. In this Section we define and evaluate this latter approach. Notably, the
optimization techniques introduced in the previous section are exploited to define
an algorithm-dependent solution to the cache coherence problem in current CMP-
based architectures.
7.2.1 The Rdy-Ack Communication Model
We start with a very basic mechanism, that we call Rdy-Ack communication, used
by PEs to synchronize and exchange messages. This mechanism provides a point-
to-point communication between two partners, sender (S) and receiver (R), with a
buffer of one position (vtg). S and R exploit two primitives send and receive that
implement the communications as summarized in Figure 7.1.
Sender Receiver
send (msg) {
    wait until ack is present
    copy msg in vtg
    reset ack
    signal rdy to the receiver
}
ack
rdy
vtg
receive (data) {
    wait until rdy is present
    copy vtg in data
    reset rdy
    signal ack to the receiver
}
Figure 7.1: Abstract definition of the send-receive operations in the rdy-ack com-
munication model
The pseudo-code of the primitives uses two boolean events:
• the ready (RDY) event, which specifies the presence of a new message, and
• the acknowledgment (ACK) event, which represents the reception of the last
transmitted message.
With the signal operation, the corresponding event is set to true, while the reset
one sets the event to false. To ensure correctness, the RDY and the ACK events are
130 7. A STRUCTURED PARALLELISM APPROACH TO CC
initialized to false and true respectively.
As discussed in Chapter 5, in a message-passing implementation model, S and R
use the send and receive operations to exchange messages, while in a passing-pointer
approach the message copied into vtg can be a memory pointer to a shared data
structure, exchanging data structures by reference. In the following we refer in
both case to vtg as the exchanged unit and, when necessary, we specify if vtg
represents either a set of words (a copy of the entire message) or a single machine
word representing a memory pointer.
7.2.2 Rdy-Ack Based on Shared Memory Synchronizations
To understand the principle which underlies this solution we start with a first im-
plementation of the rdy-ack communication model which consists in a symmetric
communication mechanism based on shared memory variables.
We define a VTG data structure in order to associate to the target variable vtg the
corresponding RDY and ACK events, which are implemented by two boolean flags
(initialized to 0 and 1 respectively). Figure 7.2 shows the VTG data structure and the
send and receive operations on a rdy-ack communication based on shared memory
variables (ra sm), where the waiting of an event is implemented by a while-loop on
the corresponding flag.
VALUE
ACK
RDY
VTG
1  send (ra_sm, msg) {
2   while(ra_sm->ack == 0);
3   <copy msg_value in vtg_value>
4   ra_sm->ack = 0;
5   ra_sm->rdy = 1;
6  }
1  receive (ra_sm, data) {
2   while(ra_sm->rdy == 0);
3   <copy vtg_value in data>
4   ra_sm->rdy = 0;
5   ra_sm->ack = 1;
6  }
Figure 7.2: Algorithms of the send-receive operations in the rdy-ack communication
model based on shared memory synchronizations
Correctness of the Send-Receive Algorithms
Let us consider an abstract multi-processor architecture M respecting the Sequential
Consistency memory model (Section 5.2.3). Accordingly, load/store instructions of
the same processor are executed in the program order and they can be interleaved
with instructions of different processors in any sequential order. In this case, the
following proposition is valid.
7.2. COMMUNICATION RUN-TIME SUPPORT 131
Proposition 7.2.1. The send and receive algorithms executed on M implement
a lock free single-producer single-consumer shared buffer of one position.
Proof. Initially (RDY, ACK) = (0,1). This means that the sender can proceed by
executing the send while the receiver is eventually waiting on line 2 of the receive.
The sender copies msg in the vtg value field and sets the flags such that (0,1) →
(1,0). Now the receiver is the only one of the two partners that can execute the
communication primitive. It reads the message and copies it in a private variable
data, and sets the flags such that (1,0) → (0,1) going back to the initial condition.
It is worth noting that line 3 in the send must be executed after ACK is equal
to 1 (otherwise the new message can overwrite a previous and possible unreceived
message), and line 5 after line 3 (the ready must be set to 1 after the store of the
message in msg is visible to the receiver). Similarly, line 3 in the receive must be
executed if and only if ready is equal to 1, and line 5 after line 3 (saving the message
in the private variable before it is overwritten by the sender).
When the architecture adopts a weak memory consistency model, we can ensure
correctness by forcing the right order of the operations as discussed in Chapter 5).
Notably, in message-passing implementation we need to ensure the atomicity of the
copies, while in the passing-pointer solution the send algorithm has to avoid reading
of stale data.
Zero-copy Receive in Message-Passing Solutions
In order to avoid the copy (unless the message is very short) and to utilize the
vtg value directly in the receiver computation phase (after the receive execution),
we provide an alternative zero-copy version of the receive.
The semantics must be equivalent to the basic algorithm with copy: thus the ACK
cannot be signaled until the process has terminated to utilize (or actually to copy, if
convenient) the vtg value. Otherwise the sender could modify the vtg value during
its utilization. The solution consists in providing a simple additional set ack prim-
itive, whose effect is to put ACK = 1. The use and the algorithms of the zero-copy
version used by the receiver are summarized in Figure 7.3.
In the following, we use this version of the receive operation in the case of
message-passing implementations.
Communication with any asynchrony degree
Let us now extend the basic rdy-ack solution to communications with more than
one buffer position. We denote by k ≥ 1 the asynchrony degree, which represents
the maximum number of messages that a sender can send without waiting for the
first sent message being received. In the basic implementation we have k = 1. A
higher asynchrony degree can be obtained by using k instances of VTG.
The sender and the receiver have two private array CH of k memory pointers to the
132 7. A STRUCTURED PARALLELISM APPROACH TO CC
send (ra_sm, msg) {
 while(ra_sm->ack == 0);
 <copy msg_value in vtg_value>
 ra_sm->ack = 0;
 ra_sm->rdy = 1;
}
receive (ra_sm, data_ref) {
 while(ra_sm->rdy == 0);
 <copy vtg_ref in data_ref>
 ra_sm->rdy = 0;
}
<Compute phase>
set_ack (ra_sm) {
 ra_sm->ack = 1;
}
Figure 7.3: Algorithm of the zero-copy receive in the rdy-ack communication model
based on shared memory synchronizations and an example of its use
target variable instances and a corresponding private index (initialized to zero).
The VTG instances are the only data structures shared between the sender and the
receiver, and they are used in a round-robin fashion by using index to denote the
next VTG to use. Figure 7.4 shows the data structures used in this solution. Each
SENDER
...
0
1
k-1
index
CH
...
0
1
k-1
index
CH
RECEIVER
VALUE
ACK
RDY
VTG[0]
VALUE
ACK
RDY
VTG[k-1]
...
SHARED
Figure 7.4: Rdy-Ack communication data structures for asynchrony degree k > 1
time the sender wants to transmit a new message, the VTG indexed by the sender
CH[index] is selected and the send primitive is executed on it (according to the
same psudo-code of Figure 7.2) and the sender index is incremented by 1 modulo
k. Symmetric actions are performed on the receiver’s side using the private CH and
index variables.
This algorithm guarantees the correct synchronization. That is, sender (receiver) is
blocked on the current VTG if it contains ACK = 0 (RDY = 0), or it completes the
primitive if ACK = 1(RDY = 1).
A first notable advantage with respect to standard lock-based solutions is achieved
in terms of caching exploitation: CH are private data structures which represent a
clear opportunity of reuse, and the first cache line of VTG contains all the needed
synchronization information. Notably, each CH resides permanently in the PrC of
the sender and the receiver, with a consequently very low access overhead with re-
7.2. COMMUNICATION RUN-TIME SUPPORT 133
spect to the case of k = 1.
Let us study the detailed implementation of rdy-ack send and receive on a
single VTG (k = 1). As introduced in Section 6.2, we have two possible approaches
in the implementation of run-time support based on the use of automatic cache
coherence:
1. the use of the basic invalidation semantics, which does not guarantee low-p
mappings, or
2. exploiting the home-flush technique, which aims to achieve low-p mappings at
least for structured parallel paradigms.
Solution 1 is feasible in almost any architecture with standard automatic cache
coherence. Solution 2 is feasible when the architecture provides mechanisms for
home node selection and for flush with de-allocation as introduced in Section 7.1.
Implementation and cost model for automatic cache coherence
Consider again the pseudo-code of send and zero-copy receive for k = 1.
In this case, we can apply the same consideration made in Chapter 5 to derive a
cost model for this rdy-ack implementation based on automatic cache coherence.
Notably, here VTG encapsulates both data and synchronization information.
Let PEsender and PEreceiver be the respective processing nodes. Though it is likely
that home node PEhome coincides with one of them, no specific strategy nor opti-
mization is assumed in this first run-time support solution.
Consider the send operation. The first read of the ACK value causes the read of
a modified value from PEreceiver. If, at the first test, it is ACK = 1, then only one
cache line transfer occurs (exploiting in-cache retry), otherwise an additional block
transfer is paid (due to the invalidation caused by the receiver).
In a message-passing implementation, we estimate the send latency Tsend as the sum
of the setup phase Tsetup and the copy phase NlinesTtransm, where Nline is the number
of cache lines used for the msg/vtg value. The setup phase includes the cache line(s)
transfer for the ACK value and the write on the first cache line of VTG for modifying
RDY and ACK. Therefore, we can estimate the setup cost as
Tsetup ∼ (1 + pwait)Lread(I,−,M) + Lwrite(S, S(1),−)
where pwait denotes the probability of the waiting condition.
The message copy involves write operations of modified values which, in the worst
case, are still in PrCreceiver. We can assume that the message value (msg) is present
in PrCsender, or, in the worst case (i.e., for long messages) in ShC when present or
M , with the additional transfer latency overhead. Therefore, the copy cost can be
estimated as
Ttransm ∼ Lwrite(I,−,M)
134 7. A STRUCTURED PARALLELISM APPROACH TO CC
While, in the case of additional cache line transfer we have
Ttransm ∼ Lread(I, ∗,−) + Lwrite(I,−,M)
In the case of short messages, notably when the vtg value is in the same cache line
of RDY and ACK value, the setup phase already includes the read latency of the
copy phase relative to the vtg value. So in this best case scenario we have
Ttransm ∼ 0
Therefore, we have the following send latency in message-passing implementation
in the medium case
TMPsend = Tsetup +NlinesTtransm (7.1)
∼ (1 + pwait)Lread(I,−,M) + Lwrite(S, S(1),−)
+NlinesLwrite(I,−,M)
In a passing-pointer implementation, VTG is represented with few words (e.g., RDY,
ACK and memory pointer), therefore we can assume the same send latency of the
message-passing implementation in the case of short messages. As discussed in
Chapter 5, passing-pointer implementation in WSO systems requires a memory
barrier before the reference is copied into the vtg value. For this reason, an additional
(possibly partially overlapped) latency is paid during the send operation.
T PPsend = [NlinesTwrite(I,M/E,−)] + Tsetup (7.2)
∼ [NlinesTwrite(I,M/E,−)]
+(1 + pwait)Lread(I,−,M) + Lwrite(S, S(1),−)
Concerning the receive operation, the same consideration to that required for
the send and the ACK value, can be applied to the RDY value. Therefore the Treceive
can be estimated as
Treceive ∼ Tsetup
This evaluation is valid both for message-passing and passing-pointer implementa-
tions.
When RDY =1 the msg/vtg value is available for the compute phase, during which
it is read when needed, as discussed in Chapter 5.
Notably, in a message-passing solution, when set ack is executed, we can as-
sume that the first VTG cache line is still in PrCreceiver adding a negligible cost
of Lwrite(M,−,−) to the receive latency.
All latencies must be evaluated under-load and for a specific target architecture.
As discussed in Chapter 6, in this solution with the basic invalidation semantics, the
ratio RQ/RQ0 might be substantially greater than one in some parallel programs.
7.2. COMMUNICATION RUN-TIME SUPPORT 135
Implementation and cost model for automatic cache coherence: exploit-
ing the home-flush technique
As discussed in Section 7.1, we can provide an optimized version of send and
receive run-time support by using the home-flush technique, in order to use a
low-p mapping strategy.
In parallel paradigms, we can have either a communication between a home node
sender and non-home node receiver or a communication between a non-home sender
and a home node receiver. Notably, we can say that a channel with home node
sender has always a non-home node receiver: this is general for any parallel pro-
gram (e.g. for a worker-worker channel in a stencil-based data parallel program, just
one worker is home node for the channel descriptor). On the other side, a channel
with non-home sender has a home node receiver: this is not general (e.g. two part-
ners might be both non-home nodes), however it is very likely.
Send/Receive for the home node For the home node, the run-time support of
send and receive operations is the same as for of the basic invalidation solution.
The corresponding receive-set ack (receive in pointer-passing implementation),
executed by a non-home receiver, reads the first modified cache line of VTG from the
home node and updates RDY and ACK through a flush. As described in Section
7.1, the flush operation executed by a non-home receiver node:
1. de-allocates the cache line from the receiver cache,
2. copies the block into the PEsender local memory and into PrCsender through a
cache-to-cache write request communication.
According to 2, in the home send operation, the read operation on the ACK value are
performed locally on PrCsender. Moreover, also writing operations are executed on
PrCsender and, according to 1, do not invalidate the receiver cache line(s). Therefore,
we have
Tsetup ∼ (1 + pwait)Lread(E,−,−)
for the wait condition.
Also the copy phase has a very low latency, writes are executed locally and do not
perform invalidation, resulting in
Ttrasm ∼ Lwrite(E,−,−)
Therefore, we can evaluate the send operation latency in message-passing solution
as
TMPh−send = Tsetup +NlinesTtrasm (7.3)
∼ (1 + pwait)Lread(E,−,−) +NlinesLwrite(E,−,−)
136 7. A STRUCTURED PARALLELISM APPROACH TO CC
While, for pointer-passing solution we have
T PPh−send = Tsetup (7.4)
∼ (1 + pwait)Lread(E,−,−)
In message-passing solution, the non-home receiver will read the modified VTG from
PrCsender in the compute phase. Once used, the corresponding cache line(s) must be
de-allocated from PrCreceiver. No copy is transmitted to the sender and the sender
has not to invalidate such blocks. For the pointer-passing solution, the same consid-
erations are applied to the msg value, which is in the worst case still in PrCsender .
The setup latency is also valid for the receive operation, with similar consideration
for the RDY value, both in message-passing and pointer-passing solutions, resulting
in
TMPh−receive = T
PP
h−receive = Tsetup (7.5)
∼ (1 + pwait)Lread(E,−,−)
Send/Receive for the non-home node When a non-home sender (receiver)
reads ACK (RDY) in VTG, if it finds ACK = 0 (RDY = 0) the corresponding cache
line must be de-allocated and the request is repeated until ACK = 1 (RDY =
1). That is, the wait condition corresponds to read operations on the PrCreceiver
(PrCsender). This avoids unnecessary invalidations by the home node during the
synchronization phase, in order to minimize the contention on non-home nodes.
However, this feature is paid for repeated read operations on the PEhome. This
drawback can be partially alleviated by using a periodic retry technique.
Tsetup−send ∼ (1 + pwait)Lread(I,−,M)
Concerning the send operation, in message-passing implementation, the message
copy is executed by Nlines flush operations. Therefore, we have
Ttrasm ∼ LwriteC2C(M,−,−)
For short messages, only the first VTG cache line is used, resulting in
Ttrasm ∼ 0
Therefore, in the general case we have
TMPnh−send = Tsetup−send +NlinesTtrasm (7.6)
∼ (1 + pwait)Lread(I,−,M) +NlinesLwriteC2C(M,−,−)
In a passing-pointer implementation, the same operation can be applied to the
msg value, which can be flushed if modified to the PrCreceiver, resulting in a send
latency almost equivalent to the message-passing solution.
T PPnh−send = Tsetup−send +NlinesTtrasm (7.7)
∼ (1 + pwait)Lread(I,−,M) +NlinesLwriteC2C(M,−,−)
7.2. COMMUNICATION RUN-TIME SUPPORT 137
Regarding the receive operation, as said before, in the receive-set ack (receive
in pointer-passing implementation) operations the receiver, reads the first modified
cache line of VTG from the home node and updates RDY and ACK through a flush.
Therefore, we have
TMPnh−receive = T
PP
h−receive = Tsetup−receive (7.8)
∼ (1 + pwait)Lread(I,−,M) + LwriteC2C(M,−,−)
Considerations about the under-load latencies
A first analysys shows that all solutions have comparable setup latency, except the
home version which performs operations on the local PrC. In the same way, for the
message copy latency in the message-passing solutions, where the home version still
shows the minimum latency.
In message-passing solutions, the setup latency is negligible for long messages (im-
pact on send) and for relatively coarse-grained calculations (impact on receive)
and that the message copy latency is negligible for short messages.
As we said, home-flush solution is characterized by low-p mapping, but the reduced
contention delays are partially paid with flush latencies of non-home nodes for mes-
sage copy (message flush in pointer-passing solution) and during receive-set ack
(receive in pointer-passing solution) and compute phases. Anyway, often, in struc-
tured parallel programs, non-home nodes perform send/receive through nondeter-
ministic commands (e.g. farm emitter/collector), thus busy waiting is not applied
to a single VTG, thus partially compensating the latency penalty.
Moreover, in the basic invalidation solution, the compute phase finds the VTG value
(and possibly the msg value in a pointer-passing solution) in PrCsender (unless it is
very large). This base latency saving is paid with greater contention, which must
be carefully evaluated case by case.
Finally, in the basic invalidation solution, the careful choice of home node (if pro-
vided) can lead to further latency reduction (some of which has been evaluated in
the home-flush solution): reads and writes executed by home node are not affected
by the cache coherent overhead.
7.2.3 Rdy-Ack Based on Inter-processor Communications
Let us now introduce an alternative rdy-ack solution, based on interprocessor com-
munications.
In this solution synchronization is greatly simplified and it is more efficient, since it
is implemented by interprocessor communications and private data structures only.
That is, no shared variables are used for RDY and ACK values. In a message-
passing solution shared memory is used for target variable values only: any cache
coherence approach can be used, and the home-flush technique reveals quite nat-
ural. In a pointer-passing solution also the reference copy can be implemented by
138 7. A STRUCTURED PARALLELISM APPROACH TO CC
interprocessor communications and any cache coherence approach can be applied to
the msg value. For this reason, this Rdy-Ack solution is suitable for non-automatic
cache coherence too.
Consider the basic case of a symmetric communication channel with asynchrony
degree k = 1 in an exclusive mapping architecture.
In the message-passing solution, only the vtg value is shared (msg value in the
passing-pointer solution), without any additional shared information for RDY and
ACK which are implemented through interprocessor communications and private
data structures.
All the RDY-ACK synchronization is done according to a wait-notify scheme. In
the message-passing solution, vtg is write-only for the sender, while it is used by
the receiver in the compute phase only. The send message copy is performed by
explicit flush of each vtg value cache line. As discussed in Chapter 5, in order to
guarantee memory ordering in the sequence message copy notify(RDY), all flushes
are synchronous, or a Memory Barrier is inserted before notify(RDY).
In the pointer-passing solution, the notify ready actually corresponds to both the
ready event and the send of the msg reference to the receiver. Also in this case,
explicit flush of each msg value cache line can be performed before the notify(RDY)
ensuring the right memory ordering when necessary.
Handling multiple communications
Let us now extend this initial case to the possibility of having a generic number
of communication channels per process/thread and any asynchrony degree for each
channel.
Consider a parallel program and the set of the communication channels used by each
module. We can associate a unique identifier to each channel CH1, ...CHnch . For
each process, a private Channel Table TAB CH, indexed by CH identifiers, is provided.
Notably, TAB CH[CHi] is the pointer to the corresponding channel structure, which
is represented in Figure 7.5 for the message-passing solution.
For each channel CHi, in addition to the ki shared VTG instances and to the private
pointer structures CH for the sender and the receiver, two private data structures
EVENT ACK and EVENT RDY are provided for sender and receiver respectively. Each of
these event structures has ki entries with binary values = {0, 1}. Each entry is the
(RDY,ACK) pair of the corresponding VTG instances used in the wait-notify scheme.
As in the shared-memory Rdy-Ack solution, VTG instances are used in a round-robin
fashion, and for each instance the same wait-notify synchronization technique of the
basic case is applied. The complete definition of send and receive-set ack oper-
ations for the Rdy-Ack communication based on interprocessor communications is
shown in Figure 7.6, where index is the current index value of CH.The definition
and implementation of wait and notify operations is the following and summa-
rized in the pseudo-code in Figure 7.7. Each interprocessor communication consist
7.2. COMMUNICATION RUN-TIME SUPPORT 139
SENDER
...
0
1
k-1
index
CH
...
0
1
k-1
index
CH
RECEIVER
VALUE
VTG[0]
VALUE
VTG[k-1]
...
SHARED
...0 1 k-1
ACK
...0 1 k-1
RDY
ch_ID ch_ID
INTERPROCESSOR
COMMUNICATION
FACILITY
ch_ID
index
Figure 7.5: Rdy-Ack channel structure for the message-passing solution based on
interprocessor communications
send (ra_io, msg) {
 wait(ra_io,ack,index);
 <copy msg_value in vtg_value>
 notify(ra_sm,rdy,index++);
}
receive (ra_io, data_ref) {
 wait(ra_io,rdy,index);
 <copy vtg_ref in data_ref>
}
<Compute phase>
set_ack (ra_sm) {
 notify(ra_io,ack,index++);
}
Figure 7.6: Pseudo-code of send, receive and set ack operations for Rdy-Ack
channel structure for the message-passing solution based on interprocessor commu-
nications
140 7. A STRUCTURED PARALLELISM APPROACH TO CC
wait(ra_io, event, index) {
 if(ra_io->event[index] == 1)
  ra_io->event[index] = 0;
 else {
  while(receive_IP_comm(ch_id,ev,id))
    if !(event==ev & ra_io->ch_id==ch_id & index==id)
      tab_ch[ch_id]->ev[id] = 1;
 }
}
notify(ch_id, event, index) {
 send_IP_comm(tab_ch[ch_id]->PE,ch_id,event,index);
}
interrupt(ch_id, event, index) {
 tab_ch[ch_id]->ev[id] = 1;
}
Figure 7.7: Pseudo-code of wait, notify and interrupt-handler operations for
Rdy-Ack channel structure for the message-passing solution based on interprocessor
communications
SENDER
index
...
0
1
k-1
index
CH
RECEIVER
SHARED
...0 1 k-1
ACK
...0 1 k-1
RDY
INTERPROCESSOR
COMMUNICATION
FACILITY
ch_ID
index
ch_ID
index
msg_ref
ch_ID
ch_ID
Figure 7.8: Rdy-Ack channel structure for the pointer-passing solution based on
interprocessor communications
in the communication with the notify operation of the pair (CHi, index) in or-
der to perform the corresponding synchronization on the (RDYindex, ACKindex) of
TAB CH[CHi].
The reception of this message is treated as an interrupt: a run-time support interrupt
-handler is called each time a new message arrive. The message is inspected to
determine the corresponding communication channel and the corresponding event
is set (i.e., TAB CH[CHi]→ EV ENT [index] = 1).
The wait operation checks the corresponding event and if it is not RDY or ACK =
1, waits for new interprocessor communications also buffering other possible event
received for other channels used by the same process. For pointer-passing solution,
we can use similar data structures and algorithms as summarized in Figures 7.8. In
this case, no VTG are shared ; when a new interprocessor communication is received
by the receiver the message pointer communicated is copied in a local private buffer
associated to the CH of the receiver in the corresponding TAB CH[CHi].
7.2. COMMUNICATION RUN-TIME SUPPORT 141
Cost model for rdy-ack based on interprocessor communications
In the message-passing solution the setup phase corresponds to the execution of the
wait and notify operations. All private data structures are reused in the local
PrC, possibly in the higher level.
The most likely situation in the wait execution is that the event (RDY/ACK) has
already been set in CH; otherwise, if rcv com is executed, it is likely that the re-
ceived message is consistent with the waited condition. The condition relative to
the reception of another event is rarely verified, and the probability drops rapidly
with the number of the loop iterations.
The notify operation has a negligible impact too, because of the asynchronous la-
tencies of the interprocessor communication, which is overlapped.
Concerning VTG, it is used in write-only mode in the send operation, so we can
force a non-allocation policy in order to avoid the read operation latency. VTG is not
accessed at all in the receive operation, it is used only in the compute phase.
Finally, in the setup phase we need to consider the contribution of the interrupt-
handler, which, as discussed in Section 7.1, in architecture with an efficient inter-
processor communication system, can be comparable to the PrC access latency.
Therefore we have for both send and receive-set ack operations
Tsetup ∼ Lread(M/E/S,−,−)
In old-style machines which implement interprocessor communications in kernel-
mode, the interrupt-handler has a substantial cost, resulting in a larger Tsetup.
In this case, this rdy-ack approach may prove not convenient.
The message copy phase, with a non-allocation policy and according to the auto-
matic or non-automatic implementation is paid during the send operation in the
case of the home-flush performed by the sender, or during the compute phase by
reading from the GC if no cache-to-cache write requests are used. Again, we can
assume that the message is present in PrCsender. Therefore, we have for solutions
which adopt the home-flush technique the following Ttrasm latency
Ttrasm ∼ LwriteC2C(M,−,−)
While, for non-automatic cache coherence systems in which cache-to-cache write
requests are not supported, we have
Ttrasm ∼ Lwrite(I,M,−)
which is paid during the compute phase.
Analogous considerations can be made for the pointer-passing solution. Notably,
the cache-to-cache write requests can be exploited when possible for the msg value.
Therefore, in the general case we have
T IPcommsend = Tsetup +NlinesTtrasm (7.9)
∼ Lread(M/E/S,−,−) + LwriteC2C(M,−,−)
142 7. A STRUCTURED PARALLELISM APPROACH TO CC
T IPcommreceive = Tsetup ∼ Lread(M/E/S,−,−) (7.10)
In general, for each specific architecture and parallel program, we are able to es-
timate Tsetup and Ttrasm accurately. This rdy-ack solution shows a very simple
and efficient way of implementing synchronization mechanism and exemplifies the
potential simplifications and optimizations achievable with the non-automatic, or
flush-based automatic, cache coherence solutions. Potentially, this run-time sup-
port is characterized by a base communication latency which is comparable to, or
even lower than, the shared memory rdy-ack version, provided that the interpro-
cessor communication system is implemented in an efficient way (like in modern
CMP-based architectures). For this reason, this approach is particularly suitable
for fine-grained computations too.
Considerations about the under-load latencies
The absence of using shared-variables for RDY-ACK values during the send and
receive operations, and the application of optimizations (e.g., flush strategies) for
the access to vtg and msg values, avoid some critical situations of contention. There-
fore, this solution is able to achieve low-p mappings using the general principles of
Chapter 6 and the optimizations of Section 7.1.
On the other hand, we need to study the impact of RDY and ACK notifications.
All PEs exchange interprocessor messages in order to perform synchronizations. For
example, in a farm the Emitter and Collector receive a number of notifications
which is equal to the number nw of workers (analogously for data parallel INPUT
and OUTPUT modules). We cannot model this situation using the classical client-
server model with request-reply behaviour. In fact, RDY/ACK notifications are
asynchronous. In this case the system is modeled as an acyclic graph (Section 6),
so some nodes might potentially become bottlenecks.
The critical parameter is now the service time of nodes for serving asynchronous
notification requests, which is determined by the interrupt-handler operation
service time (or the equivalent computation in the wait operation).
As we already said, this service time is very low in modern CMP-based systems. In
this way, the notification interarrival time to the most stressed nodes, which is equal
to the stream interrarrival time (for the Emitter/Input), or to the ideal service time
of the whole parallel paradigm, is actually greater than the notification service time.
This results in a utilization factor of the corresponding module ρ < 1.
Otherwise, in no way the parallel program would be able to achieve the ideal band-
width, independently of the notification implementation and evaluation issue.
A final condition is that also the interprocessor communication implementation pro-
vides a sufficient asynchrony degree/buffering capability in order to not add an over-
head in the queuing delay of the corresponding modules.
In conclusion, asynchronous notifications do not affect system performance on con-
dition that the architecture provides a low overhead interprocessor communication
mechanisms with sufficiently large asynchrony degree.
7.3. ASYMMETRIC RDY-ACK COMMUNICATIONS 143
7.3 Asymmetric Rdy-Ack Communications
Non-determinism in asymmetric communications is made easier to design, and effi-
cient, in the Rdy-Ack model with exclusive process mapping and busy waiting.
Support for many-to-one communications can be simply achieved with the receiver
process/thread testing the selected channels RDYs in a round-robin fashion until
RDY = 1 is met. Let CH1, ..., CHm be the channel set used for the asymmetric
communication. If CHi is the most recently used channel, the round-robin scan
starts from CH(i+1)mod(m).
With the exclusive process/thread mapping, we avoid any global synchronization of
senders, as confirmation of the real lock-free nature of this rdy-ack model. Instead,
with multiprogrammed mapping lock-based global synchronization or alternative
approaches based on CAS instructions are necessary [79, 104] for a correct execu-
tion of process low-level scheduling.
Therefore, the latency overhead is relatively small, and it is further reduced if proper
data structures are used to allow testing of several RDYs simultaneously.
Support for one-to-many communications can also be efficiently supported and
the implementation is straightforward: ACKs are tested (simultaneously and) in a
round-robin fashion.
For example, an on-demand farm Emitter tests the task channels’ ACKs in a round-
robin fashion until ACK = 1 is met. The selected channel is a good candidate for
a load-balanced distribution of the next task. The non-determinism implemented
with this solution avoids the available communications from workers to Emitter. In
this way we reduce the communication traffic, and, notably, we are able to reduce
contention also in automatic cache coherence approaches without the home-flush
optimization.
One-to-many and many-to-one communications are used for data distribution
and data collection, respectively, in farm and data parallel paradigms. In both
cases, the flush-based communication run-time support is even better exploited. As
introduced in Section 7.2.2, several channels are scanned, thus no retry is performed
on the same channel by non-home node sender (receiver), respectively Emitter and
Input process (Collector and Output process) in farm and data parallel paradigms.
Therefore, we can assume the costs of send and receive operations also for asym-
metric communications with an additional overhead due to the overhead of testing
multiple RDY/ACK values.
144 7. A STRUCTURED PARALLELISM APPROACH TO CC
7.4 Implementation and Evaluation on Tilera
TilePro64
In this Section we present our experience in the implementation of the rdy-ack
communications as inter-thread interaction mechanisms on the Tilera TilePro64.
Although it is a domain-specific parallel architecture, this architecture represents a
notable example of how advanced architectural structures, such as user-accessible
on-chip interconnection networks and configurable cache coherence protocols, are of
great importance to design lightweight cooperation mechanisms enabling efficient
parallel implementations.
As introduced in Chapter 1, the Tilera TilePro64 is equipped with 64 identical
PEs (called tiles) interconnected by an on-chip network named iMesh. Each link
consists of two 32-bit-wide unidirectional physical links carrying the traffic in both
directions. The iMesh network is composed of five independent 2D meshes each one
carrying a different kind of traffic. Notably, the User Dynamic Network (UDN) sup-
ports the explicit transfer of small messages (up to 128 32-bit words) among tiles
under application programmer control. Each tile has five UDN hardware queues
connected directly to the processor registers. Special assembler instructions are pro-
vided to perform the enqueue/dequeue and the transmission over UDN. The UDN
serves user-land processes or threads, providing a flexible and low latency coopera-
tion mechanism.
The Tilera TilePro64 provides also a flexible cache subsystem named Dynamic
Distributed Cache (DDC) which implements the automatic cache coherence pro-
tocols according to the abstract model [AM2b] described in Chapter 4.
Figure 7.9, summarizes the actions performed by the automatic cache coherence
in read and write operations. This architecture allows to finely control the cache
coherence mechanism offering the following features:
1. flexible home node selection
2. write-through C2C between the requestor node and the home node
3. disabling of the automatic cache coherence with explicit flush and de-allocation
mechanisms
All these characteristics, make Tilera TilePro64 an interesting candidate for the
comparison of the performance of structured parallel applications with the different
run-time supports presented in this Chapter.
First Results on Tilera TilePro64
Before analyzing the implementation of the rdy-ack solutions studied in the previous
Sections, we report the first experiences of our research group in the implementation
7.4. IMPLEMENTATION AND EVALUATION ON TILEPRO64 145
C
h
i
p
Mem int Mem int
I/
o
n
etw
o
r
k
 in
t
Mem int Mem int
T
i
l
e
Sw
T
i
l
e
Sw
Local Node home Node
❶ load (x)
❷ x
P
L2+dir
❸ x
L3
P
L2+dir
X
(a) read operation
C
h
i
p
Mem int Mem int
I/
o
n
etw
o
r
k
 in
t
Mem int Mem int
T
i
l
e
Sw
T
i
l
e
Sw
Local Node
home Node
❶ store (x[0]=1)
P
L2+dir
❸ invalidate (x)
L3
  P
L2+dir
T
i
l
e
Sw
P
L2+dir
X
X
❷ x[0]=1
❹ inv (x)
X
❺ ack inv (x)
❻ ack
shared Node
(b) write operation
Figure 7.9: Tilera TilePro64 automatic cache coherence protocols
146 7. A STRUCTURED PARALLELISM APPROACH TO CC
of run-time supports for parallel patterns on Tilera TilePro64 architecture.
In [24], we discussed the porting of the FastFlow [15] framework on this architec-
ture. Notably, Fastflow is a passing-pointer solution which provides programmers
with predefined and customizable stream parallel paradigms such as task farms and
pipelines. In our porting, we deal both with the implementation of stand-alone
applications and applications executed by using the Tilera TilePro64 as a software-
accelerator. We obtained very interesting results, related to the cache coherence
problems, by encapsulating at the skeleton level three alternative cache coherence
allocation strategies for the task exchanged between the modules of a farm skeleton:
• Hash Home Node (HHN), which corresponds to the default cache coherence
protocol defined by the architecture: a hash function is used to uniformly
distribute home nodes among all the caches. HHN guarantees a uniform usage
of all the caches, although it may increase the network traffic and reduce the
effective amount of cache usable per tile with high parallelism degrees.
• No Home Node (NHN), which disables the automatic cache coherence, result-
ing in incoherent memory pages. Coherency is ensured only when a task is
passed to a different concurrent entity. When the work on the local task is
finished and before sending the task to another concurrent entity, the Fast-
Flow run-time automatically and transparently adds memory flush operations
to enforce cache coherence.
• Fixed Home Node (FHN), which specifically selects, for each task, a PE that
becomes its home node. This strategy aims to remove most of the performance
overhead of the DDC mechanism. This characterization is actually possible for
the farm paradigm, where each task is entirely processed by a single thread.
Although theoretically very promisingly, the main problem relative to this pol-
icy arises considering that the destination thread for a specific task is usually
defined late at runtime. This means that, in a pointer passing environment
such as FastFlow, it is usually necessary to copy the task on a new memory
area after the worker is elected, to select the proper home node (thus voiding
all the effect of pointer passing).
We executed a matrix multiplication A[N][N]xB[N][N] written in FastFlow exploiting
the farm skeleton on a stream of 3200 matrices. Figure 7.10 shows the results for
two test cases: one using matrices of integers with N=64 and the other with N=128.
For each one we tested the three cache coherence strategies supported in the farm
paradigm. With high parallelism degrees we can actually see very different results
depending on the strategy used.
In the case of 64x64 matrices, each matrix takes 16KB of space, so that the entire
working set of each worker for each task is 48KB (two input matrices plus an output
one). This means that the working set is small enough to fit in the L2 cache of one tile
and therefore the number of memory transfers are minimized. We are expecting an
7.4. IMPLEMENTATION AND EVALUATION ON TILEPRO64 147
0	  
10	  
20	  
30	  
40	  
50	  
60	  
1	   5	   9	   13	   17	   21	   25	   29	   33	   37	   41	   45	   49	   53	  
Sp
ee
du
p	  
nw	  
HHN	   NHN	   FHN	   ideal	  
(a) 64x64 integer matrices
0	  
10	  
20	  
30	  
40	  
50	  
60	  
1	   5	   9	   13	   17	   21	   25	   29	   33	   37	   41	   45	   49	   53	  
Sp
ee
du
p	  
nw	  
HHN	   NHN	   FHN	   ideal	  
(b) 128x128 integer matrices
Figure 7.10: FastFlow stream matrix multiplication (AixBi) using different cache
coherence strategies: Hash Home Node (HHN), No Home Node (NHN) and Fixed
Home Node (FHN)
extremely high scalability, that is in fact verified with the FHN and NHN strategies.
In contrast, the standard cache coherence protocol works very well up to ∼ 20 nodes,
then suddenly stops scaling. This is because with a large parallelism degree, the L2
home must contain and manage an update copy of cache lines for which is responsible
for. Thus, the cache available for each tile is less than the required. In this case,
the working set of the algorithm does not fit in the cache and the performance of
the sequential code executed by the workers suddenly decrease. On the other hand
the NHN implementation is indeed very good, as it is able to obtain aligned results
with the best option for automatic cache coherence.
By using larger matrices we expect the working set to not fit the caches in any of
the policies. Still, it represent an interesting experiment as we are stressing the
memory, and thus we can actually see if the coherency protocol helps or aggravates
the situation.
In this test is possible to observe the benefits of using the automatic cache coherence
when using the HHN policy: when running sequential programs or parallel ones with
small parallelism degree, we may have that the sum of all L2 caches is large enough
to contain the working set of the application, so that the performance can be much
better than the other two strategies. This also means that the speedup (calculated
with respect to a standard sequential version which uses HHN strategy) is indeed
an unfavorable metric for the other two modes.
The incoherent policy works surprisingly well: by removing the cache coherency
protocol we reduce the amount of memory requests, or at least the amount of traffic
on the network, ending with far better results with respect to any implementation
that exploits automatic cache coherence.
148 7. A STRUCTURED PARALLELISM APPROACH TO CC
Rdy-Ack implementations
The results obtained with FastFlow were very promising. The communication mech-
anisms between the modules of the parallel paradigms is ⊥-based lock-free and wait-
free, where synchronization data and the exchanged value are strictly coupled. This
restriction poses a limitation in space on the possible optimizations offered by cur-
rent CMP-based architectures. On the other hand, with a rdy-ack communication
model we are able to provide:
• specific and possibly various home selection strategies for each data structure
used by the run-time support
• specific home selection strategies for each parallel paradigm
• different synchronization mechanisms according to the grain of the parallel
computation
Moreover, with the rdy-ack solutions we provide alternative message-passing and
pointer-passing solutions, in order to choose the better implementation model ac-
cording to the performance parameters of the specific application .
Rdy-Ack based on Shared Memory Synchronizations The API provided
by the Tilera Multicore Library (TMC) [35] offers an explicit home node selection
with which the user can choose the PE which will be the home node for a specific
cache line or data structure. As discussed, the cache coherence protocols in this
architecture define the write operations in terms of write through communications
between a generic PrC and the PrChome for the corresponding cache line.
We can exploit these mechanisms to emulate the home-flush techniques in the im-
plementation of the rdy-ack run-time support. Notably, consider the activities per-
formed by the two partners on a rdy-ack communication channel on the different
fields of VTG, which are shown in Table 7.1.
RDY ACK VALUE
Sender WRITE-ONLY READ-WRITE WRITE-ONLY
Receiver READ-WRITE WRITE-ONLY READ-ONLY
Table 7.1: Reading and writing activities on the VTG fields
Our solution consists in partitioning the fields of the VTG in different cache lines:
• a cache line contains only the ACK flag and its home node is PEsender
• one or more cache lines (depending on the message-passing vs passing-pointer
solution) contains the RDY and the vtg value, which are homed on PEreceiver
7.4. IMPLEMENTATION AND EVALUATION ON TILEPRO64 149
SENDER
...
0
1
k-1
index
CH
...
0
1
k-1
index
CH
RECEIVER
VALUE[0]
ACK[0]
RDY[0]
VALUE[k-1]
ACK[k-1]
RDY[k-1]
SHARED
...
...
snd HOME rcv HOME
Figure 7.11: Rdy-Ack implementation based on Shared Memory synchronizations
with home-flush optimization on Tilera TilePro64
Figure 7.11 summarizes this solution. During the send operation, the RDY flag
and the vtg value are transmitted directly from PEsender to PEreceiver, which is the
home node of that cache line(s) owning the updated copy. The opposite behavior
occurs during the execution of the receive, when the new value of the ACK flag is
transmitted directly from PEreceiver to PEsender.
In order to avoid invalidation messages, as happens in the home-flush techniques,
since the de-allocation is not possible without the side effect of the de-allocation also
from the home node PrC with the no allocation policy. Notably, the write opera-
tions on the write-only fields of VTG can be performed with the no allocation policy
available on this architecture. In this way, the sender and the receiver do not need
to transfer their write-only part of VTG into their PrC.
The run-time support for asymmetric communications (i.e., many-to-one and
one-to-many) is straightforward, according to the definition of the asymmetric rdy-
ack communication described in Section 7.3. Figure 7.12 summarizes the data struc-
tures used in both types of asymmetric communications.
Finally, send and receive operations are defined in order to ensure the correct-
ness of the protocol, by means of memory fence instructions, since Tilera TilePro64
adopts a weak memory consistency model.
Rdy-Ack based on Interprocessor Communications In addition to the fa-
cilities exploited in the shared memory case, this implementation relies on the UDN
on-chip network for the interprocessor communications. Every tile can transmit a
message composed of one header word and the payload by specifying the destina-
tion tile and a tag associated with the message which is used to forward the message
to the corresponding UDN hardware queue of the five available. As described in
Section 7.2.3, we transmit a payload with 2 or 3 words depending on the message-
passing/pointer-passing solution.
150 7. A STRUCTURED PARALLELISM APPROACH TO CC
...
0
1
N-1
next
N_CH
...
0
1
k-1
index
CH[next]
VALUE
ACK
RDY
VTG[0]
VALUE
ACK
RDY
VTG[k-1]
...
SHARED
VALUE
ACK
RDY
VTG[0]
VALUE
ACK
RDY
VTG[k-1]
...
SHARED
...
...
0
1
k-1
index
CH[0]
...
0
1
k-1
index
CH[N-1]
...
Figure 7.12: Asymmetric Rdy-Ack implementation based on Shared Memory syn-
chronizations with home-flush optimization on Tilera TilePro64
SENDER
...
0
1
k-1
index
CH
...
0
1
k-1
index
CH
RECEIVER
VALUE
VTG[0]
VALUE
VTG[k-1]
...
SHARED
...0 1 k-1
ACK
...0 1 k-1
RDY
ch_ID ch_ID
UDN
NETWORK
ch_ID
index
(a) Message-passing
SENDER
index
...
0
1
k-1
index
CH
RECEIVER
...0 1 k-1
ACK
...0 1 k-1
RDY
ch_ID
ch_ID
UDN
NETWORKch_ID
index
ch_ID
index
msg_ref
SHARED
(b) Passing-pointer
Figure 7.13: Rdy-Ack implementation based on interprocessor communications on
Tilera TilePro64
Figure 7.13 summarizes the data structures used in both message-passing and passing-
pointer solution. The run-time support for asymmetric communications (i.e., many-
to-one and one-to-many) is straightforward, according to the definition of the asym-
metric rdy-ack communication described in Section 7.3, with similar data structures
to the previous shared memory case.
As in the previous case, send and receive operations are defined in order to
ensure the correctness of the protocol, by means of memory fence instructions, since
Tilera TilePro64 adopts a weak memory consistency model.
Finally, since the buffering space of a UDN queue is of 128 32-bit words, in order
to give a proper asynchrony degree for an entire parallel program, we exploit all the
UDN queues for the interprocessor communications.
7.4. IMPLEMENTATION AND EVALUATION ON TILEPRO64 151
????? ???? ????????? ?? ?????????? ????? ??? ???? ???????? ???? ????
?????????? ?????????????? ??????????? ??? ???????? ???? ????????
????????? ??? ?????? ????????? ????????? ??????????
?? ???? ??????? ??????? ???????? ???????? ?????? ?????????
?????????????????????????????????????????????????????????
???????????? ???????????????????? ??????????????
?? ???? ??????? ??????? ???????? ??????????????????
????? ?? ??????? ??????? ??? ???? ?????? ?????? ??? ???? ????
??????????? ?????????? ????? ????? ????
?? ??????????????????????????????????????????? ?????????
????????????????
????? ??????????? ????????? ??? ?? ?? ???? ??????????? ??? ????
?????????? ?????????? ????????? ??? ???? ????????? ??? ????????
???? ???????? ?????????????? ???????? ???? ?????? ??????? ?? ????
??? ??????? ???? ????????? ?????????? ???? ???? ???????? ?????????
????????? ?????? ???? ?????????????? ???????? ???????? ??? ????
????????? ??? ???? ? ???? ???????? ???????? ???? ??????? ???? ????
?????????? ?????????????????????????????????? ???????????????????
?????? ???? ?????? ????? ???? ?????? ???? ???? ???????? ???? ? ??? ??
??????? ???? ?????????? ???????? ???? ??????????? ??? ????? ????
?????
?????? ?? ???????? ????? ???? ???????? ??? ?????? ?????
?? ?????? ??????? ?????? ??????
??????????? ?? ?????? ??????? ?????? ??????
??? ?????? ??????? ?????? ??????
?? ??????? ??????? ??????? ???????
??? ??????? ?? ??????? ??????? ??????? ???????
??? ??????? ??????? ??????? ???????
?? ??????? ??????? ??????? ???????
??? ????????????? ?? ??????? ??????? ??????? ???????
??? ??????? ??????? ??????? ???????
?????? ???? ????????? ???????? ??? ???? ?????????? ?????????????????
?????????? ???????????????
??? ???????? ???? ???????????? ??????? ???? ???????? ???????????
?????? ??? ???? ???? ?????? ?????????????? ?????????? ??? ?????? ?????
???????? ???? ???????? ??? ???? ??????? ??????? ????????? ????? ???
?? ??????????? ???????? ????? ?? ????????????????? ???????? ???
???????????? ???? ??????? ??? ???? ?????? ?????? ???? ??? ?????????
???? ????????????????????????????????????????? ??????????????????
?????? ??? ???? ????? ?????? ???? ??????????? ?????? ?????? ??? ?????
??????????? ??? ????????? ???? ????? ????? ?????????? ??? ??????????
???? ??????????? ????? ??????????? ??? ???? ??????? ????????? ????????
??????????????????????????????????????????? ????????????????????
???????????????????????????????????????????????????????????????
??????????????????????????????? ?????????????????????????
??????????????????????????????????????????????????????????????????
??? ???? ?????? ?????????? ??? ???? ?????????? ?????? ??? ?????????
???????????????????????? ????????????????????????????????????????
???????? ???????? ??? ???????? ???? ????????????????
?? ??????????? ????????? ??? ???? ??????????? ???????? ????
??????????????? ?????? ??????? ?????????? ???? ???? ???? ????? ????
???? ????????? ??? ??? ???? ????????? ???? ???? ???????????????
????????????? ???????????? ???? ????? ??????? ??????? ?????????
????? ???? ???? ???????? ???? ???? ??????? ??? ???? ?????????????
???????????? ???? ?????????? ????? ????? ???????? ?????? ??????
?????????? ???????? ??????? ?????? ????????? ??? ???? ?????? ????
??? ?????????? ????? ???? ???????? ????????? ??? ?????????? ????
????? ??????? ????????? ?????????? ??? ???? ????????? ????? ????? ????
???
?????????? ??????????? ????????????????
?? ????
?? ???????? ????
?? ??????
??? ? ????
? ?
???????????
?????? ??????? ????????
???? ?????????? ???????????????
??????????????????????????????????????
?? ???? ? ?? ?? ??? ?????
?? ????????????? ? ??????????? ?
?? ?????????? ????
??
?? ????
??
?? ???????
???? ??????????? ???????????????
????? ???? ?????????????? ???????? ????? ???? ?????????? ??????
???????????
???? ?????????????? ????????? ???????? ?????????? ???????????????????
???????? ???? ??????? ??? ????? ???? ?????????????? ???????? ??? ????
?????? ??? ???? ???? ??? ???? ????? ??????? ??????? ????????????????
????????????????? ?????????????????????????? ??? ??????????
??? ???? ????????? ???????? ???? ????????? ???? ???? ??????????? ???????
???? ??????? ???? ????????? ???? ?????? ???? ???????? ???? ?????????
????? ????????? ??? ????????? ??? ???? ???????????? ????????????
?????????????? ???????? ???????? ?????? ?????????????? ????????
???????????? ?????????????????????????????? ??????????????????????
?? ??????? ????????? ???????? ???? ??????? ???? ???? ?????????? ????
??????? ??? ????? ???? ????? ????? ??? ???? ?????????????? ??????????
?????? ????? ??? ?? ?????? ????? ?????? ???? ????????? ???? ???? ???????????
??????????????????????????????????????????????????????????????????
?????????????????????????????????????????????????? ???????????????????
???????????????????????????????????????????????????????????
???????? ????????? ????????????????????????????????????????????? ?????
???????????????? ?????? ??? ???? ?????????????? ??????? ???? ?????
????? ??? ????????? ??? ???????? ???? ????????? ??????? ???????? ???
???? ?????????????? ?????????? ????? ??????????? ??? ???????? ???? ???
???????? ??? ???? ????????? ???????? ??????????
??? ????????? ???? ??????????? ?????????????? ??????????
??? ????????? ???? ????????????? ???? ????? ???? ????????? ??? ????
?????????? ???????????????? ????? ???? ??????? ???? ???? ???
??????? ?????????????? ???????? ??? ??????????? ??????????????
??????????? ???? ????? ??? ????? ?????????? ??? ??? ????? ???? ?????
??????? ????????? ??? ???? ??????????? ?????????? ????????? ???
???? ?????????? ???? ????? ???? ?????????????? ????????? ?????
???? ???????? ????? ??????? ???? ???? ??????????? ???? ???????? ????
?????? ??? ????? ??? ? ???? ????? ????? ??? ????????? ???? ?????????
??????????????????????????????????????????????????? ????
???? ??????? ??????? ???????? ??????????? ????? ???? ??????
??????? ?????????????? ?????????? ??? ?????? ???????
??? ???? ???????? ????? ???? ????? ??? ?????? ???? ???? ????????
????????? ????? ???? ??????? ??????? ???????? ??? ????? ??????????
????? ???? ?????? ??? ?????????? ??????? ???????? ??? ??? ????? ????
Figure 7.14: Rdy-Ack symmetric communication latency for passing-pointer solution
evaluated with the ping-pong micro-benchmarks
First results A first approach to the implementation of the rdy-ack communica-
tion model on Tilera TilePro64 has been presented by our research group in [25].
As stated, it was a first step toward the evaluation of the communication model pre-
sented in this Chapter. Notably, this work studies the implementation of rdy-ack
communications based on shared memory and interprocessor communications only
for passing-pointer solutions and for one-to-one and many-to-one communications.
We briefly report some important results that can be useful to the evaluation of the
rdy-ack implementations provided in the following part of this Section.
A set of micro-benchmarks are used to study the communication latency of differ-
ent implementations using the shared memory and the UDN supports. The latency
benchmarks are carried out using a ping-pong scheme. A sender transmits a one-
word message to a receiver executed on a different tile and waits for a reply from
the receiver. The receiver receives the message, and sends back a reply to the
sender. The benchmark consists of many iterations I. The execution time of a pair
of send/receive operation (named Texehange) is measured as the completion time
of the benchmark TC divided by the number of iterations (i.e., Texehange = TC/I).
The communication latency to execute a single communication operation can be
estimated by Lcom ∼ Texchange/2. Figure 7.14, summarizes the results for symmetric
communications, comparing the rdy-ack based of shared memory synchronizations
using automatic cache coherence (ch sym sm) and the emulation of the home-flush
technique (ch sym cache) with respect to the solution based on the UDN interpro-
cessor communications (ch sym udn).
These experiments confirm the evaluation made with the cost model in the pre-
vious Section. The home-flush techniques, even if emulated on this architecture,
improve the latency of the shared memory support. Notably, for write-only cache
lines it is more convenient to use the non-home run-time support for the sender,
while, for read-write data used by a single thread the home run-time support is the
best solution to reduce the automatic cache coherence overhead. This optimiza-
tion leads to a 50-65% improvement with respect to the standard automatic cache
coherence solution. More important are the results obtained using the UDN on-
152 7. A STRUCTURED PARALLELISM APPROACH TO CC
????? ???? ????????? ?? ?????????? ????? ??? ???? ???????? ???? ????
?????????? ?????????????? ??????????? ??? ???????? ???? ????????
????????? ??? ?????? ????????? ????????? ??????????
?? ???? ??????? ??????? ???????? ???????? ?????? ?????????
?????????????????????????????????????????????????????????
???????????? ???????????????????? ??????????????
?? ???? ??????? ??????? ???????? ??????????????????
????? ?? ??????? ??????? ??? ???? ?????? ?????? ??? ???? ????
??????????? ?????????? ????? ????? ????
?? ??????????????????????????????????????????? ?????????
????????????????
????? ??????????? ????????? ??? ?? ?? ???? ??????????? ??? ????
?????????? ?????????? ????????? ??? ???? ????????? ??? ????????
???? ???????? ?????????????? ???????? ???? ?????? ??????? ?? ????
??? ??????? ???? ????????? ?????????? ???? ???? ???????? ?????????
????????? ?????? ???? ?????????????? ???????? ???????? ??? ????
????????? ??? ???? ? ???? ???????? ???????? ???? ??????? ???? ????
?????????? ?????????????????????????????????? ???????????????????
?????? ???? ?????? ????? ???? ?????? ???? ???? ???????? ???? ? ??? ??
??????? ???? ?????????? ???????? ???? ??????????? ??? ????? ????
?????
?????? ?? ???????? ????? ???? ???????? ??? ?????? ?????
?? ?????? ??????? ?????? ??????
??????????? ?? ?????? ??????? ?????? ??????
??? ?????? ??????? ?????? ??????
?? ??????? ??????? ??????? ???????
??? ??????? ?? ??????? ??????? ??????? ???????
??? ??????? ??????? ??????? ???????
?? ??????? ??????? ??????? ???????
??? ????????????? ?? ??????? ??????? ??????? ???????
??? ??????? ??????? ??????? ???????
?????? ???? ????????? ???????? ??? ???? ?????????? ?????????????????
?????????? ???????????????
??? ???????? ???? ???????????? ??????? ???? ???????? ???????????
?????? ??? ???? ???? ?????? ?????????????? ?????????? ??? ?????? ?????
???????? ???? ???????? ??? ???? ??????? ??????? ????????? ????? ???
?? ??????????? ???????? ????? ?? ????????????????? ???????? ???
???????????? ???? ??????? ??? ???? ?????? ?????? ???? ??? ?????????
???? ????????????????????????????????????????? ??????????????????
?????? ??? ???? ????? ?????? ???? ??????????? ?????? ?????? ??? ?????
??????????? ??? ????????? ???? ????? ????? ?????????? ??? ??????????
???? ??????????? ????? ??????????? ??? ???? ??????? ????????? ????????
??????????????????????????????????????????? ????????????????????
???????????????????????????????????????????????????????????????
??????????????????????????????? ?????????????????????????
??????????????????????????????????????????????????????????????????
??? ???? ?????? ?????????? ??? ???? ?????????? ?????? ??? ?????????
???????????????????????? ????????????????????????????????????????
???????? ???????? ??? ???????? ???? ????????????????
?? ??????????? ????????? ??? ???? ??????????? ???????? ????
??????????????? ?????? ??????? ?????????? ???? ???? ???? ????? ????
???? ????????? ??? ??? ???? ????????? ???? ???? ???????????????
????????????? ???????????? ???? ????? ??????? ??????? ?????????
????? ???? ???? ???????? ???? ???? ??????? ??? ???? ?????????????
???????????? ???? ?????????? ????? ????? ???????? ?????? ??????
?????????? ???????? ??????? ?????? ????????? ??? ???? ?????? ????
??? ?????????? ????? ???? ???????? ????????? ??? ?????????? ????
????? ??????? ????????? ?????????? ??? ???? ????????? ????? ????? ????
???
?????????? ??????????? ????????????????
?? ????
?? ???????? ????
?? ??????
??? ? ????
? ?
???????????
?????? ??????? ????????
???? ?????????? ???????????????
??????????????????????????????????????
?? ???? ? ?? ?? ??? ?????
?? ????????????? ? ??????????? ?
?? ?????????? ????
??
?? ????
??
?? ???????
???? ??????????? ???????????????
????? ???? ?????????????? ???????? ????? ???? ?????????? ??????
???????????
???? ?????????????? ????????? ???????? ?????????? ???????????????????
???????? ???? ??????? ??? ????? ???? ?????????????? ???????? ??? ????
?????? ??? ???? ???? ??? ???? ????? ??????? ??????? ????????????????
????????????????? ?????????????????????????? ??? ??????????
??? ???? ????????? ???????? ???? ????????? ???? ???? ??????????? ???????
???? ??????? ???? ????????? ???? ?????? ???? ???????? ???? ?????????
????? ????????? ??? ????????? ??? ???? ???????????? ????????????
?????????????? ???????? ???????? ?????? ?????????????? ????????
???????????? ?????????????????????????????? ??????????????????????
?? ??????? ????????? ???????? ???? ??????? ???? ???? ?????????? ????
??????? ??? ????? ???? ????? ????? ??? ???? ?????????????? ??????????
?????? ????? ??? ?? ?????? ????? ?????? ???? ????????? ???? ???? ???????????
??????????????????????????????????????????????????????????????????
?????????????????????????????????????????????????? ???????????????????
???????????????????????????????????????????????????????????
???????? ????????? ????????????????????????????????????????????? ?????
???????????????? ?????? ??? ???? ?????????????? ??????? ???? ?????
????? ??? ????????? ??? ???????? ???? ????????? ??????? ???????? ???
???? ?????????????? ?????????? ????? ??????????? ??? ???????? ???? ???
???????? ??? ???? ????????? ???????? ??????????
??? ????????? ???? ??????????? ?????????????? ??????????
??? ????????? ???? ????????????? ???? ????? ???? ????????? ??? ????
?????????? ???????????????? ????? ???? ??????? ???? ???? ???
??????? ?????????????? ???????? ??? ??????????? ??????????????
??????????? ???? ????? ??? ????? ?????????? ??? ??? ????? ???? ?????
??????? ????????? ??? ???? ??????????? ?????????? ????????? ???
???? ?????????? ???? ????? ???? ?????????????? ????????? ?????
???? ???????? ????? ??????? ???? ???? ??????????? ???? ???????? ????
?????? ??? ????? ??? ? ???? ????? ????? ??? ????????? ???? ?????????
??????????????????????????????????????????????????? ????
???? ??????? ??????? ???????? ??????????? ????? ???? ??????
??????? ?????????????? ?????????? ??? ?????? ???????
??? ???? ???????? ????? ???? ????? ??? ?????? ???? ???? ????????
????????? ????? ???? ??????? ??????? ???????? ??? ????? ??????????
????? ???? ?????? ??? ?????????? ??????? ???????? ??? ??? ????? ????
(a) Asymmetric communication between 2
threads
?????
?????? ?? ???????? ????? ???? ???????? ??? ???? ???? ?????
?? ?????? ??????? ??????? ?????? ??????
???????????? ?? ?????? ??????? ??????? ?????? ??????
??? ?????? ??????? ??????? ?????? ??????
?? ??????? ??????? ??????? ??????? ??????
??????????? ?? ??????? ??????? ??????? ??????? ??????
??? ??????? ??????? ??????? ??????? ??????
?????? ????? ????????? ???????? ??? ???? ?????????? ?????????????????
??????????? ???????????????
??????????????????????????? ????????????????????????????????
?????????? ???????????????? ??? ??? ?????? ??????? ????? ??? ??????
???? ??????? ??????? ????????? ???? ??????????? ?????????? ????
?????????????????????????? ????????????? ???????? ??? ??? ????? ?????
????? ??? ?????????? ???? ?????????? ??? ????? ??????????? ????? ????
????????????????????????????????????????????????????????????
?? ???????? ??? ???? ??? ?????????????? ????????? ???? ??????? ???
????? ??? ???? ??????????? ?????????????? ?????????? ????? ????
???????? ??? ????????? ??? ???? ??????? ??????? ???? ????? ??????????
???? ????????? ?????? ???? ??????? ??? ???? ??????? ???????? ???????
??? ???? ??????? ??? ???? ?????????? ??? ???? ?????????? ?????????? ????
?????????? ??????? ??????????? ?????? ???? ?????????????? ??????????
??????? ???? ?????????????? ???????? ?? ??????? ????? ??? ???????
???? ?????????????????? ?????? ???? ??????????? ???? ??????????
??? ???? ??????????? ?????????????? ?????????? ??? ??????????
???? ??????? ??? ????????? ??????? ???? ?????????? ????? ???? ???????
?????????? ????????? ??? ???? ????????? ?????????? ???? ??????????
??????????????????????????????????????????????????????????????????
?????????????????????????????????????? ??????????????????????????
???????????? ??? ???? ???????? ????? ??? ???? ??????????? ???? ?????????
???????? ???? ??????? ??????? ???? ???? ????????? ??? ????? ??? ???????
?? ??????
?? ??????? ????
?? ??????
?? ????
??
?? ???
??????
???????????????????????????????????
??? ??? ??? ???
?????????? ????????
????? ???? ????????? ??? ???? ?????????????????? ???????????
??? ??? ???? ?????? ???? ???? ???????? ??? ????? ?????????? ???
??????? ?????? ????????????? ????????? ????????????? ?????????? ???
???? ????? ?????????????? ?????????? ???? ???????????????????????
??? ???? ????? ????????? ?????? ??? ???? ???? ???????? ????????
??????????? ??????????????????????? ???? ?????? ???????????? ??????
?????? ??? ???? ???????????? ????? ???????? ???????? ??????????? ????????
??????????????????????????????????????????????????????????????
?????????????????????????????????????????????????????????????
??????????????????? ??????????????????????? ????????????????????
????? ??? ???? ????????? ??? ????????? ????? ????????? ??? ?????????? ???
??????????? ????????????????????????????????????????????????
?????????????????????????????????????? ??? ???????????????????????
?? ? ? ? ? ??? ?????????????????
??????????????????????? ??????????????????? ???????????????
??????? ????????? ??? ???? ??????????? ???? ???????????? ?????????
???
??????????????????????????????????????????????????????????????
??? ????????? ????? ?????? ???????? ??? ?? ??????? ??????? ??? ??????????
??? ??????? ??? ??? ?? ????? ???? ????? ???????? ??? ???? ??????
??????? ????????? ??? ???? ????? ??? ???? ???????? ???? ????? ???????
???? ???????????? ???????? ???? ?????????????? ???????????????? ?????
??? ?????????????? ??? ?? ????????? ????????? ?? ??? ????????? ???
??? ???? ??????? ??????? ??????????? ??? ?????????
??
????? ????????
????
???? ??????? ??????? ??? ????????? ??? ???? ????????? ??? ???????
???????? ?????? ?????????? ???? ???? ??????? ????????
??? ????????? ?? ?????????????? ???????????????? ??? ???? ????????
?????? ??? ?????????????????? ???? ???????????????? ??????? ????
?????? ??? ?? ???? ??? ????????? ??? ???????? ???? ???? ?? ?????????? ?????
????????????? ???????????? ??? ?????????? ???? ????????????????? ????
?????????? ??? ?????? ????????? ???? ???? ????????????? ??? ???????
????????? ???? ?? ???? ??? ???????? ???? ??????????????? ???????? ????
???????????? ??? ?? ?????????? ??? ???? ???????? ?????? ???????? ???
????? ???????????? ???????? ???? ????????? ??? ???? ???? ?????????
???????????? ????????? ??? ?? ??????? ??? ???? ????? ??? ???? ????????
?????????????????????? ????????????????????????????? ???? ???????
??????? ??? ???????? ???? ????????? ??????? ?? ???????? ?????? ????
?????????? ???? ???????? ??? ???????????? ??? ?????????? ???? ????????
??????? ???????? ??? ???? ??????? ??? ???? ???????? ???????? ???
???????? ?????? ????????????? ????? ???? ??? ????? ?????????? ??? ????
???????????? ???? ?? ??????? ????????????? ????? ????? ?????? ??? ??????
???? ???????? ?????????? ???? ???????? ???????? ?? ????????? ?????
??????????????? ???????? ??? ???? ???????????? ??? ???? ????????
???? ?????? ???? ???????????? ????????? ????? ?? ??????????? ?????????????
????????????????????????????????????????????????????????????????
??? ???? ???? ??????????? ??? ???????? ?? ????? ??? ???? ??????? ??????
??? ????????? ????????? ???????? ????????? ???? ????????? ??? ??????
??????????? ??? ???? ??????? ???????? ??? ???? ?????????? ?????? ???
??????????? ?????????????? ???????????
???
???????? ? ???
????? ? ?? ? ?? ?? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ??? ? ?? ? ?? ?? ?? ? ??? ???? ??? ? ??? ? ??? ??? ??? ???? ??? ??? ?
???? ??????? ?????????????? ???? ??????????? ??????????????
????? ???????? ????????????????????? ????????? ????????????? ????????????
???????????????? ???????????????????????????????????????
???????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????
???????? ????? ???????? ???????? ???????? ??????????????????????? ???
????????????????????????????????? ??????????????????????? ????????
???? ????? ??? ?????? ???? ????? ????????? ????????? ????? ??????????????
???? ????? ??????? ?? ??? ?????? ?? ??? ?? ?? ?? ???? ?? ?????? ????
?????????????????? ????????? ? ??????????????????????????????
??????? ???? ???? ?????? ?????????????? ?????????? ???? ??? ?????
(b) Overhead of the non-deterministic selection
Figure 7.15: Rdy-Ack many-to-one communication latency for passing-pointer solu-
tion evaluated with the ping-pong micro-benchmarks
chip network. The corresponding solution considerably outperforms the best shared
memory variant. The result is that the communication latency is one third of the one
of the best shared memory implementation. For each solution, the communication
latency is influenced by the distance (1/8/14 hops in the figure) between PEsender
and PEreceiver. The longer the distance the greater the latency, except for the stan-
dard automatic cache coherence solution in which distance between PEsender and
PEreceiver is alleviated by the allocation of the PEhome.
To evaluate the many-to-one communication mechanism, the same ping-pong bench-
mark is used to evaluate the additional overhead of the asymmetric mechanism in
communications between two threads. Another benchmark is used to study the
overhead of the non-deterministic selection by changing the number of the senders.
Notably, one sender sends messages to the receiver following the ping-pong scheme,
while the other senders are idle. Figure 7.15a, summarizes the results for asymmet-
ric communications, where (ch sym sm) represents the results for rdy-ack shared
memory solution with the emulation of the home-flush technique.
In the average case the gain of using the UDN support compared with the shared
memory version is more than 50%. As discussed in [25], the UDN support shows
that the many-to-one mechanism features a penalty of 20% in communication la-
tency with respect to the one-to-one case. The reason is that in the asymmetric
communications (as discussed previously) the UDN message is composed of two 32-
bit words (ch id, msg ref), while in the symmetric case the first word is not necessary
because the communication is between a static pair of PEs.
Figure 7.15b shows the important result that the latency offered by the many-to-one
communication does not depend on the number of senders with the UDN support.
In contrast, the shared memory solution has a cost proportional to the number of
senders.
7.4. IMPLEMENTATION AND EVALUATION ON TILEPRO64 153
S1 S3
T1 T3
S2
T2
M
NxN
M
NxN
NxNNxN =F(    )
Figure 7.16: A 3-stage pipeline computation example
Implementation of parallel applications Consider the 3-stage pipeline com-
putation represented in Figure 7.16, in which:
• the first stage (S1) represents a module which encapsulates M matrices of
integer values A[N][N] and generates a stream of these matrices. Let Ai be the
generic stream element.
• the second stage (S2) is defined with the pseudo-code in Listing 7.1
• the third stage (S3) receives the results from the previous stage and stores all
the received matrices of integer values B[N][N] into other M matrices.
Although this computation is very generic, an increasing number of emerging ap-
plications use this pattern for different purposes, such as network traffic and sensor
data processing, e-business transactions monitoring and real-time analysis of data
streams form social media.
1 i n t A[N ] [ N] , B[N ] [ N ] ;
2 ra comm input stream , output stream ;
3 whi le ( t rue ) {
4 r e c e i v e ( input stream , A) ;
5 f o r ( i n t i =0; i<N; i++)
6 f o r ( i n t j =0; j<N; j++)
7 B[ i ] [ j ] = F(A[ i ] [ j ] , . . . ) ;
8 send ( output stream , B) ;
9 }
Listing 7.1: Pseudo-code of Stage 2 in a 3-stage pipeline computation
Depending on the average service time of function F (TF ) and the architecture/run-
time supports parameters (i.e., Tsend and Treceive), the second stage S2 can represent
the bottleneck of the 3-stage pipeline. As studied in the previous chapters, in order
154 7. A STRUCTURED PARALLELISM APPROACH TO CC
to eliminate the S2 bottleneck, we can evaluate the optimal parallelism degree of S2
as follow:
nopt =
⌈
TS2−id
TA−S2
⌉
where
TS2−id = Treceive + Tcalc + Tsend ' Treceive +N2 ∗ TF + Tsend
We study the performance of the computation and its parallelization with the farm
paradigm.
In order to compare the different run-time support solutions, we studied different
versions of this computation, changing the following parameters:
• N = 32, 64, 128 in order to study the computation with different Tcalc;
• with the same N values, we study the effect of coarser grain computation;
notably, Tilera TilePro64 does not have floating-point units and decimal op-
erations are emulated by software, thus, by using single precision decimal
numbers as data type, we study the effect of coarser grain computation with
the same number of cache misses;
We report in table 7.2 the most important latencies studied for the corresponding
abstract model [AM2b] and the evaluations obtained by the benchmarks presented
in Chapter 4. We use the write latencies without overlapping in order to consider
the impact of the memory fence instructions used in the run-time support to solve
the memory ordering problem in this WSO architecture.
Using this base latency we are able to evaluate Tsend and Treceive for the various
run-time support solutions, considering σ = 16, as studied in Section 7.2. Notably,
we have
• for rdy-ack based on shared memory synchronization with automatic cache
coherence (ra sm), varying the PEhome allocation, in the best case scenario
where pwait = 0
TMPsend ' 172− 224τ +N2 ∗ (3− 5τ)
T PPsend ' [N2 ∗ (2− 4τ)] + 172− 224τ
TMPreceive ' T PPreceive = 172− 224τ
• for rdy-ack based on shared memory synchronization with the emulation of
the home-flush technique (ra sm home)
TMPh−send ' T PPh−send = 2− 8τ +N2 ∗ (0− 1)τ
TMPh−receive ' T PPh−receive = 2− 8τ
TMPnh−send ' T PPnh−send = 160− 204τ +N2 ∗ (0− 1)τ
7.4. IMPLEMENTATION AND EVALUATION ON TILEPRO64 155
Table 7.2: Reading and Writing Operations Latencies in Tilera TilePro64
Read/Write Cache Block Read Base Latencies Benchmarks Results
and CC state (clock cycles)
Lread(M/E/S,−,−) TPrC 2-8
Lread(I,M/E/S,−) TLC + Lnet + TGC + Lnet(σ) + TPrC 40-70
TLC + Lnet + Tlookup−GC
Lread(I,−,M) +Lnet + TM 160-204 (non-home)
+Lnet(σ) + TPrC 120 (home)
Lwrite(E,−,−) Lnet + TGC + Lnet + TPrC 5-28 (non-home)
2-7 (home)
Lwrite(I,M/E,−) TLC + Lnet + TGC + Lnet(σ) + TPrC 45-73 (non-home)
2-7 (home)
TLC + Lnet + Tlookup−GC
Lwrite(I, S(nsh),−) +Lnet(σ) + [Linv(nsh)] 52-332 (non-home)
+TPrC
TLC + Lnet + Tlookup−GC
Lwrite(S, S(nsh),−) +Linv(nsh) + [Lnet] 12-292 (non-home)
+TPrC 7-285 (home)
Lwrite−C2C(M,−,−) Lnet(σ) + TPrC + Lnet 30-60
156 7. A STRUCTURED PARALLELISM APPROACH TO CC
TMPnh−receive ' T PPnh−receive = 160− 204τ +N2 ∗ (0− 1)τ
• for rdy-ack based on interprocessor communications (ra ip), we used the results
in [25] to have a good estimation of the LIP latency
T iPsend ' 50− 129τ
T IPreceive ' 20− 69τ
Of course, we need to consider the corresponding under-load latencies, in the
various cases. Notably, for the ra sm solution the architecture is modeled as a
NUMA, independently of the global organization of nodes and external memories.
The NUMA-equivalent characterization derives from the client- server interactions
of PEs for performing the cache coherence protocol actions. This solution is not
characterized by the low-p mapping strategies, p and, consequently, RQ/RQ0 grows
with the parallelism degree (nw) of the farm parallelization (Chapter 6). Notably,
pE = nw + 1, while for the initial pipeline computation is at most p = 4 if PEhome
does not coincide with PES1, PES2, PES3.
We initially compare the results obtained with integer matrices varying the N di-
mension. Figure 7.17 shows the results for the message-passing and pointer-passing
solutions with the various run-time support respectively for N = 32/64/128.The
results are very interesting.
In general, the finer the computation the more the gain of using the UDN support
for interprocessor communications. Notably, for the case N = 32 with the message-
passing implementation we achieve an improvement of about 50% with ra sm io
with respect to the ra sm home solution.
We can also notice the particular behavior of the ra sm solution, especially in cases
N = 64/128, in both the implementation solutions. This behavior is due to the
problem discussed previously: with a large parallelism degree the default hashing
home strategy used by the automatic cache coherence solution causes the decrease
of the availability of the L2 cache for each node. Of course, the bigger are the
matrices the smaller is the parallelism degree at which this phenomenon starts. In
the case N = 64, the problem is a bit alleviated in the message-passing solution,
probably because the working set of each node is composed of the msg value and
the vtg value, requiring more accesses to the main memory which in some cases can
be nearer with respect to the home node selected by the default mechanism.
With the ra sm home this problem is clearly solved by selecting the proper home
node. In this case, we can see that the passing-pointer solution does not offer appre-
ciable improvements with respect to the message-passing solution due to the copy
required to select at run-time the home node, as discussed before.
Another important consideration is in the differences between the message-
passing and the passing-pointer solutions in the ra io case. In the passing-pointer
7.5. SUMMARY 157
implementation the messages exchanged through the UDN network require an addi-
tional word for the message reference with respect to the message-passing solution.
This has the side effect of reducing the effective asynchrony degree of the commu-
nications with high parallelism degree. Although, all the UDN queues are exploited
we have to use an asynchrony degree k = 1 due to the number of communications
channels used by the parallel application. This problem was anticipated and its
effect discussed in Section 7.2.3.
As we can expect the speedup increases with bigger problem sizes, when the com-
munication overhead, that influences the emitter’s TA, is a smaller portion of the
overall execution time. However, it is still far from the ideal one. The reason is
that the parallel efficiency is limited by the available memory bandwidth. In fact,
though the exploitation of the four on-chip memory controllers (MINFs) and the
corresponding memory macro-module to store the matrices, the memory bandwidth
is not sufficient to sustain a high number of working nodes.
To demonstrate this fact we show in Figure 7.18 the results obtained with coarser
grain computations by using single precision decimal numbers as data type. In this
case, MINFs are subjected to a lower pressure from the nodes and the parallelization
achieves better speedup with N = 32 and near optimal speedup with N = 64/128.
The advantage of using the UDN for interprocessor communications is remarkable
with small matrices, while it decreases with larger problem sizes up to achieving the
same performance of the ra sm home solution.
These results confirm our ideas about the rdy-ack communications based on in-
terprocessor communications, about how exploiting such architectural feature we
achieve scalable parallelizations of fine-grained problems.
Moreover, the home-flush techniques, even if emulated in this architecture, offers
significant advantages and a more predictable behavior with respect to the default
automatic cache coherence solution.
7.5 Summary
All the considerations made in the second part of the thesis, have been used in this
chapter to provide an optimized run-time support for structured parallel applications
for advanced CMP-based multiprocessors. With this chapter we contribute in the
design and study of parallel paradigms’ run-time support with optimization strictly
related to the cache coherence problem and its impact in the parallel performance
applications.
Notably, a complete set of run-time supports have been provided and evaluated,
such as message-passing vs passing-pointer solutions, as well as solutions based on
automatic cache coherence or based on specific optimizations guided by the cost
models derived from the previous chapters. In the final part of this section we
discuss the implementation of the optimization discussed in this chapter on the
158 7. A STRUCTURED PARALLELISM APPROACH TO CC
Tilera TilePro64 processor. This architecture represents a good candidate for the
evaluations of our solutions, due to the possibility of implementing or emulating our
ideas.
Notably, with this architecture we are able to achieve an improvement of about 50%
with respect to the use of the default cache coherence solution.
7.5. SUMMARY 159
(a) Message-passing N=32 (b) Passing-pointer N=32
(c) Message-passing N=64 (d) Passing-pointer N=64
(e) Message-passing N=128 (f) Passing-pointer N=128
Figure 7.17: Speedup of the farm computation executed on Tilera TilePro64 with
the various run-time supports: integer values matrices with N=32/64/128
160 7. A STRUCTURED PARALLELISM APPROACH TO CC
(a) Message-passing N=32 (b) Passing-pointer N=32
(c) Message-passing N=64 (d) Passing-pointer N=64
(e) Message-passing N=128 (f) Passing-pointer N=128
Figure 7.18: Speedup of the farm computation executed on Tilera TilePro64 with
the various run-time supports: float matrices with N=32/64/128
CHAPTER 8
Conclusions
With this thesis we studied performance models and optimizations for CMP-based
architectures, with particular attention to the cache coherence problem.
In particular, we focused on the performance prediction of parallel patterns, in order
to evaluate the impact of off-the-shelf automatic cache coherence solutions.
To achieve this result, we developed an abstract model for cache coherent CMPs
which is a simplified view of a concrete target architecture able to describe the es-
sential performance properties and abstract from all the others that are useless. It
provides the base access latency for memory and cache operations in terms of the
automatic cache coherence solution adopted. The model provides a first result to-
ward the evaluation of the impact of automatic cache coherence on parallel program
performances, by analytically defining the base memory and cache access latencies
of reading and writing operations in terms of the coherency protocol adopted.
Starting from this model and by using performance evaluation solutions, such queu-
ing networks and process algebra (i.e., PEPA) we are able to estimate the average
response time of the various level of the memory hierarchy according to the parallel
applications defined through well-known parallelism paradigms. This cost model is
fundamental in the definition of the parallel paradigms run-time support, showing
for example how a specific mapping strategy can improve performances by minimiz-
ing the under load latencies.
Moreover, the results obtained with the resulting cost model, allow us to compare the
impact of different cache coherence solutions, e.g., automatic vs algorithm-dependent
solutions.
Notably, the latter are designed in order to provide an optimized run-time support
for structured parallel applications. This optimizations represent the results of the
considerations made through combined analysis of the cost model of the specific
architecture and the parallel application implementations.
162 8. CONCLUSIONS
A Rdy-Ack run-time support for current CMP-based architecture is presented and
its cost model is defined for message-passing and passing-pointer implementation
models, both based on standard automatic cache coherence, optimized cache coher-
ence solutions based on the use of the home-flush technique and on interconnection
communications for synchronizations.
This latter solutions represent a lock-free run-time support which is able to provide
better performance results with respect to solutions based on standard automatic
cache coherence, especially for fine-grained parallel computations. Notably, we are
able to reduce the number of memory accesses, cache transfers and synchronizations,
and increasing computation parallelism with respect to the use of the automatic
cache coherence alternative.
Finally, Tilera TilePro64 architecture has been used to provide a validation of the
results obtained for the general case with the cost model. This architecture repre-
sents a good candidate for the evaluations of our solutions, due to the possibility of
implement or emulates all our ideas in terms of cache coherence optimizations.
Notably, with this architecture, in some cases, we are able to achieve an improve-
ment of about 50% with respect to the use of the default cache coherence solution.
This thesis has to be considered a small, yet very important, part in our long-term
project. In particular, the use of performance models is pervasive in our approach,
as they are used both at compile- and at run-time: first to select the best implemen-
tation, and then to drive the adaptation policies [77] for dynamic run-time system.
With this thesis we demonstrated the possibility of how a specific problem like cache
coherence can be studied and modeled in order to capture the effects on parallel pro-
grams’ performance. Both from the performance modeling and optimization point
of view, many other aspects may be addressed in the future. In particular, we be-
lieve that a further study of the trends in hardware technologies and of new parallel
paradigms definition and implementation is required.
Bibliography
[1] The pepa eclipse plugin. http://www.dcs.ed.ac.uk/pepa/.
[2] A comparative evaluation of hardware-only and software-only directory proto-
cols in shared-memory multiprocessors. J. Syst. Archit., 50:537–561, Septem-
ber 2004.
[3] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, and Mary K. Vernon. Compar-
ison of hardware and software cache coherence schemes. SIGARCH Comput.
Archit. News, 19(3):298–308, 1991.
[4] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, and Mary K. Vernon. Compar-
ison of hardware and software cache coherence schemes. SIGARCH Comput.
Archit. News, 19:298–308, April 1991.
[5] Anant Agarwal, Ricardo Bianchini, David Chaiken, Kirk L. Johnson, David
Kranz, John Kubiatowicz, Beng-Hong Lim, Kenneth Mackenzie, and Donald
Yeung. The mit alewife machine: architecture and performance. SIGARCH
Comput. Archit. News, 23:2–13, May 1995.
[6] Anant Agarwal, Richard Simoni, John Hennessy, and Mark Horowitz. An
evaluation of directory schemes for cache coherence. In In Proceedings of the
15th Annual International Symposium on Computer Architecture, pages 280–
289, 1988.
[7] Samy Al Bahra. Nonblocking algorithms and scalable multicore programming.
Commun. ACM, 56(7):50–61, July 2013.
[8] M. Aldinucci, M. Danelutto, and P. Teti. An advanced environment support-
ing structured parallel programming in java. Future Gener. Comput. Syst.,
19(5):611–626, July 2003.
164 8. BIBLIOGRAPHY
[9] Marco Aldinucci, Sonia Campa, Marco Danelutto, Peter Kilpatrick, and Mas-
simo Torquati. Targeting distributed systems in fastflow. In Ioannis Cara-
giannis, Michael Alexander, RosaMaria Badia, Mario Cannataro, Alexandru
Costan, Marco Danelutto, Frdric Desprez, Bettina Krammer, Julio Sahuquillo,
StephenL. Scott, and Josef Weidendorfer, editors, Euro-Par 2012: Parallel
Processing Workshops, volume 7640 of Lecture Notes in Computer Science,
pages 47–56. Springer Berlin Heidelberg, 2013.
[10] Marco Aldinucci, Sonia Campa, Marco Danelutto, and Marco Vanneschi. Be-
havioural skeletons in gcm: Autonomic management of grid components. pages
54 –63, 2008.
[11] Marco Aldinucci, Massimo Coppola, Marco Danelutto, Marco Vanneschi, and
Corrado Zoccolo. Assist as a research framework for high-performance grid
programming environments. pages 230–256, 2005.
[12] Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, Massimiliano Meneghin,
and Massimo Torquati. An efficient unbounded lock-free queue for multi-
core systems. In Christos Kaklamanis, Theodore Papatheodorou, and PaulG.
Spirakis, editors, Euro-Par 2012 Parallel Processing, volume 7484 of Lecture
Notes in Computer Science, pages 662–673. Springer Berlin Heidelberg, 2012.
[13] Marco Aldinucci, Marco Danelutto, Marco Vanneschi, and Corrado Zoccolo.
Assist as a research framework for high-performance grid programming envi-
ronments, 2004.
[14] Marco Aldinucci, Sergei Gorlatch, S. Gorlatch, Susanna Pelagatti, Christian
Lengauer, and S. Pelagatti. Towards parallel programming by transformation:
The fan skeleton framework, 2001.
[15] Marco Aldinucci, Massimo Torquati, and Massimiliano Meneghin. Fastflow:
Efficient parallel streaming applications on multi-core. CoRR, abs/0909.1187,
2009.
[16] Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt
Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen,
John Wawrzynek, David Wessel, and Katherine Yelick. A view of the parallel
computing landscape. Commun. ACM, 52:56–67, October 2009.
[17] S. Orlando S. Pelagatti B. Bacci, M. Danelutto and M. Vanneschi. P3l: A
structured high-level parallel language, and its structured support. Concur-
rency: Practice and Experience, page 7(3):225255, 1995.
[18] J.-L. Baer and W.-H. Wang. On the inclusion properties for multi-level cache
hierarchies. In ISCA ’88: Proceedings of the 15th Annual International Sym-
posium on Computer architecture, pages 73–80, Los Alamitos, CA, USA, 1988.
IEEE Computer Society Press.
8.0. BIBLIOGRAPHY 165
[19] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay,
M. Reif, Liewei Bao, J. Brown, M. Mattina, Chyi-Chang Miao, C. Ramey,
D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montene-
gro, J. Stickney, and J. Zook. Tile64 - processor: A 64-core soc with mesh
interconnect. In Solid-State Circuits Conference, 2008. ISSCC 2008. Digest
of Technical Papers. IEEE International, pages 88–598, Feb 2008.
[20] C. Bertolli. Fault Tolerance for High-Performance Applications Using Struc-
tured Parallelism Models. PhD thesis, Department of Computer Science, Uni-
versity of Pisa, 2008.
[21] Carlo Bertolli, Daniele Buono, Gabriele Mencagli, and Marco Vanneschi.
Expressing adaptivity and context awareness in the assistant programming
model. 23:32–47, 2010.
[22] D.P. Bhandarkar. Analysis of memory interference in multiprocessors. Com-
puters, IEEE Transactions on, C-24(9):897–908, Sept 1975.
[23] D. Buono. Optimizations and Cost Models for multi-core architectures: an
approach based on parallel paradigms. PhD thesis, Department of Computer
Science, University of Pisa, 2014.
[24] D. Buono, M. Danelutto, S. Lametti, and M. Torquati. Parallel patterns
for general purpose many-core. In Parallel, Distributed and Network-Based
Processing (PDP), 2013 21st Euromicro International Conference on, pages
131–139, Feb 2013.
[25] D. Buono and G. Mencagli. Run-time mechanisms for fine-grained parallelism
on network processors: The tilepro64 experience. In High Performance Com-
puting Simulation (HPCS), 2014 International Conference on, pages 55–64,
July 2014.
[26] David R. Butenhof. Programming with POSIX threads. Addison-Wesley Long-
man Publishing Co., Inc., Boston, MA, USA, 1997.
[27] Liqun Cheng, J.B. Carter, and Donglai Dai. An adaptive cache coherence pro-
tocol optimized for producer-consumer sharing. In High Performance Com-
puter Architecture, 2007. HPCA 2007. IEEE 13th International Symposium
on, pages 328–339, Feb 2007.
[28] Hoichi Cheong. Life span strategy - a compiler-based approach to cache coher-
ence. In Proceedings of the 6th international conference on Supercomputing,
ICS ’92, pages 139–148, New York, NY, USA, 1992. ACM.
[29] Hoichi Cheong and Alexander V. Viedenbaum. Compiler-directed cache man-
agement in multiprocessors. Computer, 23:39–47, June 1990.
166 8. BIBLIOGRAPHY
[30] Lynn Choi and Pen-Chung Yew. A compiler-directed cache coherence scheme
with improved intertask locality. In Proceedings of the 1994 conference on
Supercomputing, Supercomputing ’94, pages 773–782, Los Alamitos, CA, USA,
1994. IEEE Computer Society Press.
[31] Nathan Chong and Samin Ishtiaq. Reasoning about the arm weakly consis-
tent memory model. In Proceedings of the 2008 ACM SIGPLAN Workshop
on Memory Systems Performance and Correctness: Held in Conjunction with
the Thirteenth International Conference on Architectural Support for Program-
ming Languages and Operating Systems (ASPLOS ’08), MSPC ’08, pages 16–
19, New York, NY, USA, 2008. ACM.
[32] Murray Cole. Bringing skeletons out of the closet: A pragmatic manifesto for
skeletal parallel programming. Parallel Comput., 30(3):389–406, March 2004.
[33] Golang.org Community. The go programming language website, 2014.
http://golang.org.
[34] Pat Conway, Nathan Kalyanasundharam, Gregg Donley, Kevin Lepak, and
Bill Hughes. Cache hierarchy and memory subsystem of the amd opteron
processor. IEEE Micro, 30(2):16–29, March 2010.
[35] Tilera Corporation. Tile Processor User Architecture Manual, 2011.
http://www.tilera.com/scm/docs/UG101-User-Architecture-Reference.pdf.
[36] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik
Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken.
Logp: towards a realistic model of parallel computation. SIGPLAN Not.,
28:1–12, July 1993.
[37] David Culler, J.P. Singh, and Anoop Gupta. Parallel Computer Architecture:
A Hardware/Software Approach. Morgan Kaufmann, 1st edition, 1998. The
Morgan Kaufmann Series in Computer Architecture and Design.
[38] Leonardo Dagum and Ramesh Menon. Openmp: An industry-standard api
for shared-memory programming. Computing in Science and Engineering,
5:46–55, 1998.
[39] M. Danelutto. Efficient support for skeletons on workstation clusters. Parallel
Processing Letters, 11(1):41–56, 2001.
[40] M. Danelutto and Stigliani M. SKElib: parallel programming with skeletons
in C. In Bode A., Ludwing T., Kar W.l, and Wismu¨ller R., editors, Euro-Par
2000 Parallel Processing, number 1900, pages 1175–1184, August/September
2000.
8.0. BIBLIOGRAPHY 167
[41] John Darlington, Yi-ke Guo, Hing Wing To, and Jin Yang. Parallel skeletons
for structured composition. SIGPLAN Not., 30(8):19–28, August 1995.
[42] Michel Dubois and Faye A. Briggs. Effects of cache coherency in multiproces-
sors. Computers, IEEE Transactions on, C-31(11):1083–1099, Nov 1982.
[43] S. J. Eggers and R. H. Katz. A characterization of sharing in parallel programs
and its application to coherency protocol evaluation. In Proceedings of the 15th
Annual International Symposium on Computer architecture, ISCA ’88, pages
373–382, Los Alamitos, CA, USA, 1988. IEEE Computer Society Press.
[44] Noel Eisley, Li-Shiuan Peh, and Li Shang. In-network cache coherence. In Pro-
ceedings of the 39th Annual IEEE/ACM International Symposium on Microar-
chitecture, MICRO 39, pages 321–332, Washington, DC, USA, 2006. IEEE
Computer Society.
[45] Tarek El-Ghazawi, William Carlson, Thomas Sterling, and Katherine Yelick.
UPC: Distributed Shared-Memory Programming. Wiley-Interscience, 2003.
[46] Bin feng Qian and Li min Yan. The research of the inclusive cache used
in multi-core processor. In Electronic Packaging Technology High Density
Packaging, 2008. ICEPT-HDP 2008. International Conference on, pages 1
–4, 28-31 2008.
[47] Steven Fortune and James Wyllie. Parallelism in random access machines.
In Proceedings of the tenth annual ACM symposium on Theory of computing,
STOC ’78, pages 114–118, New York, NY, USA, 1978. ACM.
[48] John Giacomoni, Tipp Moseley, and Manish Vachharajani. Fastforward for
efficient pipeline parallelism: A cache-optimized concurrent lock-free queue. In
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Prac-
tice of Parallel Programming, PPoPP ’08, pages 43–52, New York, NY, USA,
2008. ACM.
[49] S. Gorlatch, C. Wedler, and C. Lengauer. Optimization rules for program-
ming with collective operations. In Parallel Processing, 1999. 13th Interna-
tional and 10th Symposium on Parallel and Distributed Processing, 1999. 1999
IPPS/SPDP. Proceedings, pages 492–499, Apr 1999.
[50] C. Basso B. M. Bass S. S. Woodward J. D. Brown H. Franke, J. Xenidis and
C. L. Johnson. Introduction to the wire-speed processor and architecture.
Technical report, IBM J. Res. Dev., 2010.
[51] Daniel Hackenberg, Daniel Molka, and Wolfgang E. Nagel. Comparing cache
architectures and coherency protocols on x86-64 multicore SMP systems. In
MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Sympo-
sium on Microarchitecture, pages 413–422, New York, NY, USA, 2009. ACM.
168 8. BIBLIOGRAPHY
[52] Daniel Hackenberg, Daniel Molka, and Wolfgang E. Nagel. Comparing cache
architectures and coherency protocols on x86-64 multicore smp systems. In
Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Mi-
croarchitecture, MICRO 42, pages 413–422, New York, NY, USA, 2009. ACM.
[53] Mark Heinrich, Jeffrey Kuskin, David Ofelt, John Heinlein, Joel Baxter,
Jaswinder Pal Singh, Richard Simoni, Kourosh Gharachorloo, David Nakahira,
Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John Hennessy. The
performance impact of flexibility in the stanford flash multiprocessor. SIGOPS
Oper. Syst. Rev., 28:274–285, November 1994.
[54] Mark D. Hill, James R. Larus, Steven K. Reinhardt, and David A. Wood. Co-
operative shared memory: software and hardware for scalable multiprocessor.
SIGPLAN Not., 27:262–273, September 1992.
[55] Jane Hillston. A Compositional Approach to Performance Modelling. Cam-
bridge University Press, New York, NY, USA, 1996.
[56] Henry Hoffmann, David Wentzlaff, and Anant Agarwal. Remote store pro-
gramming: A memory model for embedded multicore. In Proceedings of the
5th International Conference on High Performance Embedded Architectures
and Compilers, HiPEAC’10, pages 3–17, Berlin, Heidelberg, 2010. Springer-
Verlag.
[57] Weiwu Hu, Weisong Shi, and Zhimin Tang. Jiajia: A software dsm system
based on a new cache coherence protocol. In Peter Sloot, Marian Bubak,
Alfons Hoekstra, and Bob Hertzberger, editors, High-Performance Computing
and Networking, volume 1593 of Lecture Notes in Computer Science, pages
461–472. Springer Berlin / Heidelberg, 1999. 10.1007/BFb0100607.
[58] Roman Iakymchuk and Paolo Bientinesi. Modeling performance through
memory-stalls. SIGMETRICS Perform. Eval. Rev., 40(2):86–91, October
2012.
[59] Cavium Inc. Octeon, 2010. http://www.cavium.com/OCTEON_MIPS64.html.
[60] Intel Corporation. ntel 64 and IA-32 Architectures Software Developers Man-
ual, vol. 3A: System Programming Guide, Part 1. Number 253668-024US. Aug
2007.
[61] Engin Ipek, Bronis R. de Supinski, Martin Schulz, and Sally A. McKee. An
approach to performance prediction for parallel applications. In Proceedings
of the 11th International Euro-Par Conference on Parallel Processing, Euro-
Par’05, pages 196–205, Berlin, Heidelberg, 2005. Springer-Verlag.
8.0. BIBLIOGRAPHY 169
[62] J. Jeers and J. Reinders. Intel xeon phi coprocessor high performance pro-
gramming. Newnes, 2013.
[63] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and
D. Shippy. Introduction to the cell multiprocessor. IBM J. Res. Dev.,
49(4/5):589–604, July 2005.
[64] Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. Power7:
Ibm’s next-generation server processor. IEEE Micro, 30(2):7–15, March 2010.
[65] M. Kistler, M. Perrone, and F. Petrini. Cell multiprocessor communication
network: Built for speed. Micro, IEEE, 26(3):10–23, May 2006.
[66] Leonard Kleinrock. Queueing Systems, volume I: Theory. Wiley Interscience,
1975. (Published in Russian, 1979. Published in Japanese, 1979. Published in
Hungarian, 1979. Published in Italian 1992.).
[67] George Kurian, Jason E. Miller, James Psota, Jonathan Eastep, Jifeng Liu,
Jurgen Michel, Lionel C. Kimerling, and Anant Agarwal. Atac: A 1000-
core cache-coherent processor with on-chip optical network. In Proceedings of
the 19th International Conference on Parallel Architectures and Compilation
Techniques, PACT ’10, pages 477–488, New York, NY, USA, 2010. ACM.
[68] Edya Ladan-Mozes and Nir Shavit. An optimistic approach to lock-free fifo
queues. In Rachid Guerraoui, editor, Distributed Computing, volume 3274 of
Lecture Notes in Computer Science, pages 117–131. Springer Berlin Heidel-
berg, 2004.
[69] Leslie Lamport. Specifying concurrent program modules. ACM Trans. Pro-
gram. Lang. Syst., 5(2):190–222, April 1983.
[70] Edward D. Lazowska, John Zahorjan, G. Scott Graham, and Kenneth C.
Sevcik. Quantitative System Performance: Computer System Analysis Using
Queueing Network Models. Prentice-Hall, Inc., Upper Saddle River, NJ, USA,
1984.
[71] Stephen Lewin-Berlin. Exploiting multicore systems with cilk. In Proceedings
of the 4th International Workshop on Parallel and Symbolic Computation,
PASCO ’10, pages 18–19, New York, NY, USA, 2010. ACM.
[72] M. Leyton and J.M. Piquer. Skandium: Multi-core programming with al-
gorithmic skeletons. In Parallel, Distributed and Network-Based Processing
(PDP), 2010 18th Euromicro International Conference on, pages 289–296,
Feb 2010.
170 8. BIBLIOGRAPHY
[73] T.G. Mattson, R.F. Van der Wijngaart, M. Riepen, T. Lehnig, P. Brett,
W. Haas, P. Kennedy, J. Howard, S. Vangal, N. Borkar, G. Ruhl, and S. Dighe.
The 48-core scc processor: the programmer’s view. In High Performance Com-
puting, Networking, Storage and Analysis (SC), 2010 International Conference
for, pages 1 –11, nov. 2010.
[74] Edward M. McCreight. The Dragon Computer System: An Early Overview.
Technical report, Xerox Corporation, Polo Alto Research Center, Palo Alto,
Ca., 94304, December 7, 1984.
[75] Paul E. McKenney. Selecting locking primitives for parallel programming.
Commun. ACM, 39(10):75–82, October 1996.
[76] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable syn-
chronization on shared-memory multiprocessors. ACM Trans. Comput. Syst.,
9(1):21–65, February 1991.
[77] G. Mencagli. A Control-Theoretic Methodology for Adaptive Structured Par-
allel Computations. PhD thesis, Department of Computer Science, University
of Pisa, 2012.
[78] M. Meneghin. An Optimization Theory for Structured Stencil-based Parallel
Applications. PhD thesis, Department of Computer Science, University of
Pisa, 2010.
[79] Maged M. Michael and Michael L. Scott. Non-blocking algorithms and
preemption-safe locking on multiprogrammed shared memory multiprocessors.
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 51:1–26,
1998.
[80] A. Moga and M. Dubois. A comparative evaluation of hybrid distributed
shared-memory systems. J. Syst. Archit., 55:43–52, January 2009.
[81] Mark Moir, Daniel Nussbaum, Ori Shalev, and Nir Shavit. Using elimina-
tion to implement scalable and lock-free fifo queues. In Proceedings of the
Seventeenth Annual ACM Symposium on Parallelism in Algorithms and Ar-
chitectures, SPAA ’05, pages 253–262, New York, NY, USA, 2005. ACM.
[82] Netlogic. The netlogic xlp processor family. http://www.netlogicmicro.
com/Products/MultiCore/XLP.asp.
[83] Chris J. Newburn, Byoungro So, Zhenying Liu, Michael McCool, Anwar Ghu-
loum, Stefanus Du Toit, Zhi Gang Wang, Zhao Hui Du, Yongjian Chen, Gan-
sha Wu, Peng Guo, Zhanglin Liu, and Dan Zhang. Intel’s array building blocks:
A retargetable, dynamic compiler and embedded language. Code Generation
and Optimization, IEEE/ACM International Symposium on, 0:224–235, 2011.
8.0. BIBLIOGRAPHY 171
[84] M. F. P. O’Boyle, R. W. Ford, and E. A. Stohr. Towards general and exact dis-
tributed invalidation. J. Parallel Distrib. Comput., 63:1123–1137, November
2003.
[85] Scott Owens, Susmit Sarkar, and Peter Sewell. A better x86 memory model:
X86-tso. In Proceedings of the 22Nd International Conference on Theorem
Proving in Higher Order Logics, TPHOLs ’09, pages 391–407, Berlin, Heidel-
berg, 2009. Springer-Verlag.
[86] S. Owicki and A. Agarwal. Evaluating the performance of software cache
coherence. SIGARCH Comput. Archit. News, 17(2):230–242, April 1989.
[87] Mark S. Papamarcos and Janak H. Patel. A low-overhead coherence solution
for multiprocessors with private cache memories. In ISCA ’84: Proceedings
of the 11th annual international symposium on Computer architecture, pages
348–354, New York, NY, USA, 1984. ACM.
[88] Davide Pasetto, Massimiliano Meneghin, Hubertus Franke, Fabrizio Petrini,
and Jimi Xenidis. Performance evaluation of interthread communicationmech-
anisms on multicore/multithreaded architectures. In Proceedings of the 21st
International Symposium on High-Performance Parallel and Distributed Com-
puting, HPDC ’12, pages 131–132, New York, NY, USA, 2012. ACM.
[89] F. Petrini. Communication Performance of Wormhole Interconnection Net-
works. PhD thesis, Department of Computer Science, University of Pisa, 1997.
[90] A.J. Field P.G. Harrison, A.J. Bennett. Modelling and validation of shared
memory coherency protocols. Technical report, Dept. of Computing, Imperial
College, 1996.
[91] Sabela Ramos and Torsten Hoefler. Modeling communication in cache-
coherent smp systems: A case-study with xeon phi. In Proceedings of the
22Nd International Symposium on High-performance Parallel and Distributed
Computing, HPDC ’13, pages 97–108, New York, NY, USA, 2013. ACM.
[92] James Reinders. Intel threading building blocks. O’Reilly & Associates, Inc.,
Sebastopol, CA, USA, first edition, 2007.
[93] S. K. Reinhardt, J. R. Larus, and D. A. Wood. Tempest and typhoon: user-
level shared memory. SIGARCH Comput. Archit. News, 22:325–336, April
1994.
[94] Yunseok Rhee and Joonwon Lee. Broadcast directory: A scalable cache co-
herent architecture for mesh-connected multiprocessors. Journal of Systems
Architecture, 46(10):903 – 918, 2000.
172 8. BIBLIOGRAPHY
[95] Efraim Rotem, Alon Naveh, Avinash Ananthakrishnan, Eliezer Weissmann,
and Doron Rajwan. Power-management architecture of the intel microarchi-
tecture code-named sandy bridge. IEEE Micro, 32(2):20–27, March 2012.
[96] Daniel Sanchez, George Michelogiannakis, and Christos Kozyrakis. An anal-
ysis of on-chip interconnection networks for large-scale chip multiprocessors.
ACM Trans. Archit. Code Optim., 7(1):4:1–4:28, May 2010.
[97] Daniel Sanchez, George Michelogiannakis, and Christos Kozyrakis. An anal-
ysis of on-chip interconnection networks for large-scale chip multiprocessors.
ACM Trans. Archit. Code Optim., 7:4:1–4:28, May 2010.
[98] Harjinder S. Sandhu, Benjamin Gamsa, and Songnian Zhou. The shared re-
gions approach to software cache coherence on multiprocessors. SIGPLAN
Not., 28:229–238, July 1993.
[99] Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams.
Understanding power multiprocessors. SIGPLAN Not., 46(6):175–186, June
2011.
[100] D. B. Skillicorn and W. Cai. A cost calculus for parallel functional program-
ming. J. Parallel Distrib. Comput., 28(1):65–83, July 1995.
[101] David B. Skillicorn and Domenico Talia. Models and languages for parallel
computation. ACM Comput. Surv., 30(2):123–169, June 1998.
[102] Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Don-
garra. MPI-The Complete Reference, Volume 1: The MPI Core. MIT Press,
Cambridge, MA, USA, 2nd. (revised) edition, 1998.
[103] P. Sweazey and A. J. Smith. A class of compatible cache consistency protocols
and their support by the IEEE futurebus. In ISCA ’86: Proceedings of the
13th annual international symposium on Computer architecture, pages 414–
423, Los Alamitos, CA, USA, 1986. IEEE Computer Society Press.
[104] Philippas Tsigas and Yi Zhang. A simple, fast and scalable non-blocking
concurrent fifo queue for shared memory multiprocessor systems. In Proceed-
ings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and
Architectures, SPAA ’01, pages 134–143, New York, NY, USA, 2001. ACM.
[105] D.M. Tullsen, J.L. Lo, S.J. Eggers, and H.M. Levy. Supporting fine-
grained synchronization on a simultaneous multithreading processor. In High-
Performance Computer Architecture, 1999. Proceedings. Fifth International
Symposium On, pages 54 –58, jan 1999.
[106] Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM,
33:103–111, August 1990.
8.0. BIBLIOGRAPHY 173
[107] M. Vanneschi. High performance computing: parallel processing models and
architectures. Pisa University Press, 2014.
[108] Marco Vanneschi. Heterogeneous hpc environments. In David Pritchard and
Jeff Reeve, editors, Euro-Par98 Parallel Processing, volume 1470 of Lecture
Notes in Computer Science, pages 21–34. Springer Berlin Heidelberg, 1998.
[109] Marco Vanneschi. The programming model of assist, an environment for par-
allel and distributed portable applications. Parallel Comput., 28:1709–1732,
December 2002.
[110] Mary K. Vernon and Mark A. Holliday. Performance analysis of multiprocessor
cache consistency protocols using generalized timed petri nets. SIGMETRICS
Perform. Eval. Rev., 14(1):9–17, May 1986.
[111] Robert Virding, Claes Wikstro¨m, and Mike Williams. Concurrent program-
ming in ERLANG (2nd ed.). Prentice Hall International (UK) Ltd., 1996.
[112] Quing Yang, L.N. Bhuyan, and B.-C. Liu. Analysis and comparison of cache
coherence protocols for a packet-switched multiprocessor. Computers, IEEE
Transactions on, 38(8):1143–1153, Aug 1989.
