Multithreading opportunities for program optimizations by GIULIANI, SIMONE
University of Pisa
and
Scuola Superiore Sant’Anna
Master Degree in Computer Science and Networking
Master Thesis
Multithreading Opportunities
for Program Optimizations
Candidate Supervisor
Simone Giuliani Prof. Marco Vanneschi
Academic Year 2012/2013
In my walks, every man I meet is my superior in some way,
and in that I learn from him.
(Ralph Waldo Emerson)
Acknowledgements
In questa parte vorrei ringraziare un sacco di persone che in qualche modo
hanno contribuito a farmi stare bene.
Inizio dal professor Vanneschi che e` stato il mio mentore e una delle persone
piu` importanti per me in questi ultimi anni. Posso tranquillamente affermare
che mi ha cambiato il modo di ragionare ed e` la prima persona ad avermi in-
dicato il mio personale “interruttore cerebrale”, permettendomi di organizzare
i pensieri in modo strutturato ed efficace. E’ stato un onore lavorare per lui,
sia da studente che da rappresentante degli studenti.
Piu` in generale vorrei ringraziare anche quei professori che si danno gior-
nalmente da fare per far funzionare una realta` eterogenea come MCSN. Spero
che il clima che ho vissuto qua dentro rimanga intatto ancora per molto tempo,
ma sono sicuro che sara` cos`ı.
Ringrazio moltissimo la List Group per avermi permesso di studiare tran-
quillamente durante gli ultimi anni della mia carriera da studente.
Vorrei ringraziare sentitamente i miei genitori per aver sempre creduto in
me, anche nei momenti piu` neri della mia vita. Se sono arrivato fin qui e` soprat-
tutto per merito vostro. Siamo una grande squadra unita e non smetteremo
mai di esserlo...
Un bacione gigantesco anche ai nonni, che sono una delle fortune piu` grandi
che ho. Non vedo l’ora di festeggiare insieme a voi tutti!
Ringrazio anche gli zii e i miei super-cuginozzi: la mia sorella “de facto”
Ari, il matematico Fili, il creativo France e il giovanissimo atleta Samu, che
non vedo l’ora di sfidare sui 100m.
Ringrazio Francesca, la mia bellissima ragazza, perche´ i suoi occhi mi danno
una forza incredibile e la sua presenza mi scalda il cuore, anche quando fuori
fa veramente freddo. Sono molto fortunato.
Ringrazio veramente, ma veramente di cuore, Francesco Venturini e Francesca
Martinelli, le mie ancore di salvataggio, nonche´ custodi di una grande parte di
me. Siete semplicemente speciali e stare con voi mi riempie il cuore di gioia.
Un abbraccio grandissimo al mio gruppo storico: Tommy, l’Arga, il Mancio,
Miche, Seba, Matteo, Deb, Fra, Anna e i rispettivi (e le rispettive) consorti.
Piu` ci penso e piu` mi rendo conto che siamo un gruppo di persone assortite
veramente bene.
Ringrazio moltissimo gli amici della 5MB, a cui voglio un bene dell’anima.
In particolare: Frala, Fili, Miche, Ventu, Gabrio, Luca e anche Giulia, Gio-
4vanna e Alessia. Un bacione a tutti voi e AIC.
Un ringraziamento importante lo devo alle persone fantastiche che ho co-
nosciuto in questa magistrale. Tutti voi siete stati in grado di darmi qualcosa
e studiare insieme e` stato molto bello e soprattutto molto formante. In par-
ticolare vorrei ringraziare Francesca Pacini, Daniele De Sensi e Gianmarco
Saba. Le nostre riunioni straordinare per MCSN non saranno facili da dimen-
ticare. Ne approfitto adesso per fare gli auguri di buona fortuna ad Alessandro,
Roberto, Marco e i nuovi rappresentanti. Un ringraziamento “architetturale”
speciale lo vorrei fare a Paolo Giangrandi e Fabio Luporini: siete sempre stati
straripanti di consigli per me. Spero di rivedervi presto entrambi e poter ri-
cambiare tutta la vostra gentilezza. Un grande pollice su per il mio amico Bob,
per la sua mitica carbonara e per la sua ospitalita`. Un ringraziamento speciale
va anche a tutti gli altri MCSN e “simili”: il Cica, Andreyu, Picci, Sina, Dol-
cey, Ion, Tizi, Angela, Davide, Ema, Fili, Leta, Melat, Yonas, Tudor, Haile,
Virgi, Lotta, Rosario, Bande, Farru, Dean, Giacomo, Lorenzo e Gianluca.
Ringrazio anche Emilio, Hind, Nebbia, Ali, Anna, Samu e tutti i ragazzi della
triennale che non ho piu` avuto modo di frequentare durante questi ultimi anni.
Ringrazio il mio gruppo fantacalcistico, in particolare Manuel e il Martino
a cui voglio molto bene, nonostante i litigi per il regolamento. Ricordatevi
che, ovunque andremo, potremo dire a testa alta di aver suonato nei Dippity
Doo! E io vi ringrazio per avermi coinvolto, a suo tempo, in questo fantastico
progetto che porto sempre nel cuore.
Ringrazio tutte le amiche/amici/parenti di Francesca, in particolare: il
babbo, la mamma, la nonna, Ste, Ghigo, Irene, Danda, i due Stefani, Filippo,
Lorenzo, Maria Claudia, Betta, Chiara, Totta, Linda, Gabriele, Alberto, Luca,
Sara, Francesca e Paola: siete delle persone veramente splendide. In partico-
lare vorrei ringraziare Guia e la sua mamma, che e` stata una professoressa
d’eccezione per il mio inglese un po’ arrugginito.
A proposito di inglese... Fatemi ringraziare anche i miei amici Erasmus! An-
dre, Giulia, Ele, Eli e Monique: smack! Urge rimpatriata!
Infine ringrazio gli anni dediti allo studio del pianoforte e gli anni meravi-
gliosi in cui ho praticato atletica leggera, per avermi dato una forza d’animo
e una tenacia incredibili. Non so se ce l’avrei fatta senza queste “palestre” di
vita.
Dai! Dai! Dai!
Contents
1 Introduction 9
1.1 Increase of Parallelism Degree . . . . . . . . . . . . . . . . . . . 12
1.2 Thread Cooperation . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Speculative Precomputation (or Helper Thread) . . . . . 13
1.2.2 Threaded Multipath Execution . . . . . . . . . . . . . . 13
1.2.3 Communication Threads . . . . . . . . . . . . . . . . . . 14
1.2.4 Speculative Multithreading . . . . . . . . . . . . . . . . . 15
1.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.1 Speculative Precomputation . . . . . . . . . . . . . . . . 15
1.3.2 Multipath Execution . . . . . . . . . . . . . . . . . . . . 16
1.3.3 Speculative Multithreading . . . . . . . . . . . . . . . . . 16
1.3.4 Communication delegation . . . . . . . . . . . . . . . . . 17
1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Architecture 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Interleaved Multithreading (Fine grain multithreading) . . . . . 20
2.3 Blocking Multithreading (Coarse Grain Multithreading) . . . . . 20
2.4 Simultaneous Multithreading . . . . . . . . . . . . . . . . . . . . 22
2.5 Multithreading in parallel computing . . . . . . . . . . . . . . . 23
2.6 Commercial platforms . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.1 Pentium 4 (HyperThreading) . . . . . . . . . . . . . . . 24
2.6.2 SUN ULTRASPARC (NIAGARA) T1 . . . . . . . . . . 24
2.6.3 SUN MAJC . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.4 IBM Power . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Prefetching Opportunities 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Predictable access pattern . . . . . . . . . . . . . . . . . . . . . 29
3.3 Unpredictable access pattern . . . . . . . . . . . . . . . . . . . . 30
3.4 Case of Study: Chasing Pointer . . . . . . . . . . . . . . . . . . 30
5
CONTENTS 6
3.5 Natural Pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Jumper Pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7 Speculative Precomputation . . . . . . . . . . . . . . . . . . . . 39
4 Solving the Branch Problem 42
4.1 Case of Study: the Branch Problem . . . . . . . . . . . . . . . . 43
4.2 Compilation Time techniques . . . . . . . . . . . . . . . . . . . 45
4.3 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Branch Prediction for Pipelined CPU . . . . . . . . . . . . . . . 49
4.5 MultiPath Execution . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6 Threaded Multipath Execution . . . . . . . . . . . . . . . . . . 53
4.7 Threaded Multipath Execution for Pipelined CPU . . . . . . . . 54
5 Other opportunities 57
5.1 Communication Threads . . . . . . . . . . . . . . . . . . . . . . 57
5.1.1 Communications in Parallel Applications . . . . . . . . . 57
5.1.2 Communication optimizations . . . . . . . . . . . . . . . 57
5.1.3 Communication Processor . . . . . . . . . . . . . . . . . 59
5.1.4 Communication Thread . . . . . . . . . . . . . . . . . . 60
5.2 Speculative Multithreading Architecture (SpMT) . . . . . . . . 62
5.2.1 Architectural Support . . . . . . . . . . . . . . . . . . . 62
5.2.2 Run-time Support . . . . . . . . . . . . . . . . . . . . . 62
5.2.3 Case of Study . . . . . . . . . . . . . . . . . . . . . . . . 63
6 Coop. Threads vs Incr. Parallelism Degree 65
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Case of study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.1 Increase of Parallelism Degree . . . . . . . . . . . . . . . 67
6.2.2 Cooperative Threads . . . . . . . . . . . . . . . . . . . . 68
6.3 Case of Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3.1 Increase of Parallelism Degree . . . . . . . . . . . . . . . 69
6.3.2 Cooperative Threads . . . . . . . . . . . . . . . . . . . . 70
6.4 Formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7 Conclusions 78
Bibliography 79
List of Figures
2.1 Interleaved Multithreading with Scalar CPU. . . . . . . . . . . . 21
2.2 Interleaved Multithreading with Superscalar CPU. . . . . . . . . 21
2.3 Blocking Multithreading with Scalar CPU. . . . . . . . . . . . . 21
2.4 Blocking Multithreading with Superscalar CPU. . . . . . . . . . 22
2.5 Simultaneous Multithreading (with Superscalar CPU). . . . . . 22
2.6 Performance of Multithreading on Benchmark Suite Spec95 and
Splash2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Linked List representation . . . . . . . . . . . . . . . . . . . . . 31
3.2 Graphical Simulation of Chasing Pointer Algorithm - 1 . . . . . 33
3.3 Graphical Simulation of Chasing Pointer Algorithm - 2 . . . . . 33
3.4 Graphical Simulation of Chasing Pointer Algorithm - 3 . . . . . 33
3.5 Graphical Simulation of Chasing Pointer Algorithm - 4 . . . . . 35
3.6 Graphical Simulation of Chasing Pointer Algorithm - 5 . . . . . 35
3.7 Linked List representation for Jumper Pointer Technique . . . . 36
3.8 Linked List representation for Jumper Pointer Technique with
transient phase optimization . . . . . . . . . . . . . . . . . . . . 38
3.9 Multithreaded CPU with replication of IU and EU Master . . . 39
4.1 Array Sum Graphical Simulation 1 . . . . . . . . . . . . . . . . 44
4.2 State Machines diagram for Branch Prediction with 1-bit . . . . 47
4.3 State Machines for Branch Prediction with 2-bit . . . . . . . . . 48
4.4 Branch Prediction Schema for Pipelined CPU . . . . . . . . . . 51
4.5 A register mapping scheme for a simultaneous multithreading
processor, with an added mapping synchronization bus to enable
threaded multi-path execution . . . . . . . . . . . . . . . . . . . 53
4.6 Threaded Multipath Execution Schema . . . . . . . . . . . . . . 56
5.1 Non Overlapping communications and calculations . . . . . . . 59
5.2 Fully Overlapping communications and calculations . . . . . . . 59
5.3 Partial Overlapping communications and calculations . . . . . . 59
7
LIST OF FIGURES 8
5.4 Processing node with communication processor . . . . . . . . . 61
5.5 Values of the arrays L and K . . . . . . . . . . . . . . . . . . . . 64
5.6 Speculative Multithreading Architecture Example . . . . . . . . 64
6.1 Graphical Simulation of a farm’s worker execution - Case of
Study 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Graphical Simulation of a farm’s worker execution (with Threaded
Multipath Execution) - Case of Study 1. . . . . . . . . . . . . . 68
6.3 Graphical Simulation of a farm’s worker execution - Case of
Study 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.4 Graphical Simulation of a farm’s worker execution (with Threaded
Multipath Execution). . . . . . . . . . . . . . . . . . . . . . . . 71
6.5 Service Time Ratio. T
′′
iter = 5. . . . . . . . . . . . . . . . . . . . 76
6.6 Service Time Ratio. T
′′
iter = 7. . . . . . . . . . . . . . . . . . . . 76
6.7 Service Time Ratio. T
′′
iter = 9. . . . . . . . . . . . . . . . . . . . 77
6.8 Service Time Ratio. T
′′
iter = 11. . . . . . . . . . . . . . . . . . . 77
Chapter 1
Introduction
The introduction of Multiprocessor On Chip (CMP) led to a substantial re-
formulation of the Moore law stating that the number of cores in a single chip
doubles every one year and a half[1][3].
The tech boom related to CMP gave a strong impulse to parallel program
design diminishing its “gap” with parallel architectures.
Nowadays a leading trend related to high performance products is represented
by CMP with multithreading CPU nodes.
Basically the CPU multithreading feature tries to overcome the underutiliza-
tion of superscalar processors, due to the lack of exploitable instruction level
parallelism (ILP), allowing the simultaneous processing of different programs
during the same time slot.
In multithreading architectures a thread is a concurrent computational entity
supported directly at firmware level (these threads are usually called hardware
threads).
Multithreading technology opens a broad range of possible optimizations that
can be applied to improve the performance of sequential and parallel applica-
tions.
Usually multithreading architectures are characterized by shared units between
the threads in execution and this can represent, in general, both an opportu-
nity and a limitation for the development of applications.
The cache fault unpredictability [11], the branch problem[6] and the cost
of communications [7] in parallel applications represent overheads that can be
mitigated with program optimizations targeted for multithreading architec-
tures.
Cache fault unpredictability can represent a serious problem in some situ-
ations. For example, this problem can arise if an application presents degrees
9
CHAPTER 1. INTRODUCTION 10
of randomness in the access pattern.
The statical unpredictability of data locality and reuse, in general causes stalls
in the pipelined CPU, due to continuous cache faults.
This problem obviously cannot be faced by relying only on the on-demand
paging strategy[1][2] but it is necessary to anticipate the load of the “difficult
to predict” blocks in cache (prefetching technique).
Such technique is merely application dependent in the sense that the compil-
ers usually operate on the following principle: as soon as the “faulty” logical
address is available, the related cache block is requested.
There is a lot of literature based on ad-hoc prefetching techniques aiming to
solve specific problems and the case of study in Chapter 3 presents a few ex-
amples of specific prefetching techniques used for algorithms based on linked
lists (chasing pointer algorithms[12][13]). These techniques can be extended,
in general, to applications presenting unpredictable access patterns.
As regards the unpredictability of cache faults, a possible opportunity given
by multithreading architectures is to implement an ad-hoc prefetching activity,
decoupling it from the main thread and executing it speculatively in the same
CPU but in anticipation. Such idea tries to exploit the shared cache between
the threads in execution. This technique is called speculative precomputation
(or helper thread [5][14][15][16][17][18]).
The Branch Problem can be a source of degradations for sequential and
parallel applications and it regards the frequency with which (conditional)
branch instructions appear in the compiled code.
The problem of such instructions does not depend on the bubbles created
by the jump in the pipeline execution, but it depends primarily on the pos-
sible data dependencies on registers with which these instructions can be
involved[1][3].
Furthermore, different studies empirically show that “average” programs
present a great number of branch instructions (statistics count approxima-
tively one conditional branch instruction every five instructions[19][6]).
Compilation time techniques as delayed branch, loop unfolding and macro ex-
pansion overcome succesfully the branch problem on pipelined scalar CPUs[1][3].
Often powerful compilation time techniques are not supported for superscalar
architectures. In this case the instructions are simply grouped in the order in
which they appear in the compiled code, forming the so called “long instruc-
tions”.
In superscalar processors, the probability to have a branch instruction not lo-
cated in the last position of a “long instruction” can be an additional source
CHAPTER 1. INTRODUCTION 11
of service time degradation when compared with the case of scalar CPUs.
Branch prediction technique has been invented to mitigate, mainly, the prob-
lems of superscalar architectures. This is done providing a support entirely im-
plemented at firmware level, aiming to obtain a greater fluidity of the pipeline
execution of instructions, in spite of the presence of conditional branches.
This technique requires invalidation and commit mechanisms to be implemented[1][3].
Chapter 4 provides an exhaustive example of branch prediction support, tar-
geted for pipelined CPUs.
Multipath Execution is a technique studied in literature to drastically mitigate
branch degradations.
When the multipath execution is applied to a branch instructions both paths
are executed speculatively by different threads.
In theory this technique works particularly well for multithreading architec-
tures and it is called Threaded Multipath Execution[6].
The communication latency related to the modules of a parallel applica-
tion can have a great impact on performances, in particular in message pass-
ing environments[7]. If the underlying architecture allows the delegation of
the execution of communication primitives to a communication processor, it is
possible to investigate the possibility to mask fully or, at least, partially, the
communication overheads.
In [7] the authors show that is possible to build an efficient run-time support
for communication primitives, targeted for multithreading architectures. This
is done using hardware threads as communication processors.
The most reasonable use of multithreading architectures for parallel appli-
cations should be the increase of parallelism degree.
Conversely, different studies[57][58] show that the following way does not lead
automatically to significant advantages, but it also can be a source of further
degradations.
Basically, the use of multithreading architectures can be divided into two ex-
ecution models:
• Increase of Parallelism Degree.
• Cooperative Threads.
CHAPTER 1. INTRODUCTION 12
1.1 Increase of Parallelism Degree
The increase of parallelism degree allocating more threads in a single multi-
threading CPU can be a valid execution model, in case the number of cores of
a multiprocessor machine is not sufficient or even close to the optimal paral-
lelism degree of the parallel application[1][3].
As already mentioned, different studies suggest that such technique does not
offer great advantages because it may cause an increasing contention of those
resources that have already limited the performance with a single thread per
core.
For example, if the single thread presents high locality and reuse, increase the
parallelism degree could lead to a mature reallocation of the cache blocks[38].
The motivation is that the cache size cannot be sufficient to hold many work-
ing sets at the same time.
Suppose to have a parallel computation modelled as a farm and big data struc-
tures as input stream elements. If the workers show a lot of locality and/or
reuse related to such data structures (so that the working sets fit barely in the
shared cache), increase the parallelism degree would cause a continuous rel-
location of cache blocks with the consequent worsening of the workers’ service
time.
Some implementations of multithreaded CPUs are characterized by the shar-
ing of floating point units for the different threads in execution. In case of
computational intensive applications, the contention on such resources could
limit badly the performance of the processor because a great number of stalls
would occur.
1.2 Thread Cooperation
The thread cooperation execution model includes techniques for which the
threads are coordinated with each other to speed up the single thread execu-
tion.
The techniques treated in this thesis, some of which already mentioned, will
be:
• Speculative Precomputation (or Helper Thread).
• Threaded Multipath Execution.
• Speculative Multithreading.
• Communication Threads.
CHAPTER 1. INTRODUCTION 13
1.2.1 Speculative Precomputation (or Helper Thread)
This technique is the most used program optimization targeted for multi-
threading architectures.
The instructions related to the main thread are executed together, in the
same CPU, with the instructions of another thread, built ad-hoc, called helper
thread. This is done in such a way that the helper thread anticipates the same
execution of instructions that could cause cache faults in the main thread.
Basically, the helper thread behaves as an ad-hoc activity dedicated to prefetch
data and its instructions anticipate those related to the main thread. It has
to be activated neither too early nor too late; indeed it is fundamental to es-
tablish the point in which the instructions related to the helper thread start
to be executed. To determine such point, the compiler calculates the so called
prefetching distance that quantifies the needed “anticipation gap” between the
two threads to obtain some positive effects. The use of the helper thread for
prefetching purposes can be decisive for applications presenting certain prop-
erties, as the already mentioned cache fault unpredictability.
Briefly, the speculative precomputation technique starts with the identification
of the “hypothetical faulty loads” (this kind of analyses can be done through
profiling or at compilation time).
Once detected such “dangerous loads” the compiler generates the helper thread
instructions to be executed.
Studies regarding helper threads [5][14][16] highlight, on average, speed up
ranging from 15% to 20% with up to eight hardware contexts used.
These results are obtained taking into account a speculative precomputation
schema in which helper thread can be spawned only by the non-speculative
(main) thread (this solution is called basic trigger).
These studies show that if helper threads are spawned by other speculative
(helper) threads (chaining trigger schema) the improvement can be very high,
ranging from 76% to 169% with up to eight hardware contexts used.
1.2.2 Threaded Multipath Execution
Another technique of cooperative execution model is represented by the Threaded
Multipath Execution.
This technique tries to overcome the above mentioned branch problem on mul-
tithreading architectures using the hardware contexts to execute speculatively
multiple paths of execution. Whenever a conditional branch has been solved,
the thread executing the wrong path is cancelled and the context used becomes
CHAPTER 1. INTRODUCTION 14
free to be allocated for a new computation.
The paper [6] concentrates in particular on the efficiency of such technique
assuming the existence of some features implementing a fast way to spawn
threads and a mechanism to identify good candidates among conditional branch
instructions in the program.
This opportunity will be discussed in depth in Chapter 4 including the needed
support at the firmware/assembler level.
The study [6] claims an average speed up on single program ranging from
14% to 23%, depending on misprediction penalty and for programs with high
mispredictions rate.
1.2.3 Communication Threads
The basic idea of communication threads is to associate to each module of a
parallel or sequential computation one or more threads to which delegate the
execution of communication primitives[7].
When a module is characterized by:
• Sequences of calculations and communications.
• Semantic of the application allowing communications to be overlapped.
• Asynchronous and non-blocking communications.
the communication threads technique can be definitely taken into account.
This technique is based on delegation mechanism that is a way to assign a
computation to a different entity. The run-time support of the delegation
mechanism has to be as much efficient as possible, with a cost much lower
than the communication itself. Only in this case it is possible to obtain posi-
tive effects on parallel applications, overlapping the communications with cal-
culations.
When the communication primitives will be invoked, the instructions related
to the calculation are combined dinamically with the run-time support instruc-
tions of the communication primitives.
The work [7] shows that the use of message passing paradigms for parallel
programming on CMP can be a good alternative to shared memory paradigms.
Message passing paradigms are able to offer similar performances compared to
the shared memory models, given a well-designed run-time support.
CHAPTER 1. INTRODUCTION 15
1.2.4 Speculative Multithreading
Speculative Multithreading[31] (also known in literature as Thread Level Spec-
ulation) consists in the speculative execution of sections of a serial application
code. The hardware contexts of a multithreading architecture are used to al-
locate these sections.
The approach dictated by the Speculative Multithreading is characterized by
a first step in which the serial application is splitted in sequential sections
without taking into account the dependencies of the original program.
The second step is represented simply by the parallel execution of the sections,
exploiting the hardware contexts of the underlying architecture. The parallel
execution is performed without worrying about the possible violations of the
semantic, due to a wrong order execution of the sections.
Whenever the run-time support detects a violation, it invalidates the violating
sections without impacting on the architectural state of the processor.
The same violating sections will be re-executed later when, hopefully, the de-
pendency will be solved. If there are not violations in the execution a commit
phase starts. The run-time support of speculative architectural multithreading
has to implement commit/invalidation mechanisms.
Its design is much more complicated with when compared to the support of
multipath execution because it additionally has to implement an heavier mech-
anism able to track operations and to detect violations. Furthermore it imple-
ments tools for the extraction of the sections to be executed in parallel.
1.3 Related Works
This section presents the bibliography related to papers and publications, the
contents of which have contributed to the elaboration of this thesis.
1.3.1 Speculative Precomputation
The main motivation behind the use of the helper threads is the data prefetch-
ing activity with the precomputation of logical addresses.
The speculative precomputation technique is also used for other purposes than
the data prefetching, even if they are not very common in practice.
For example, among these purposes we have: precomputation of future branch
outcomes[20] and precomputation of logical addresses of future instructions[21].
The speculative precomputation technique is appealing from a practical point
CHAPTER 1. INTRODUCTION 16
of view for two main reasons: the ease with which helper threads can be gen-
erated and the fact that they do not impact on the correctness of the program.
There is a lot of scientific literature treating the metodology related to the con-
struction of helper threads. These can be produced by: the programmer/compiler[22],
the firmware level [14] and other helper threads[17].
Speculative precomputation is also introduced in: event-driven compilation[23],
simultaneous subordinate microthreading[20] and precomputation of depen-
dent live-in data[24].
1.3.2 Multipath Execution
Threaded multipath execution simply represents a particular application of
the multipath execution targeted for multithreading architectures.
Studies on multipath execution date back 1967 and it is worth to mention at
least some of the results conceived for the support of such technique.
The study of multipath execution started with the design of the IBM 360/91
machine, implementing the so called Eager Execution[25].
Many studies regarding the multipath execution take into account underlying
architectures based on superscalar processors, while other studies are based on
architectures characterized by “new components”[28][29][30].
Among the machines based on superscalar processors there are: the already
mentioned IBM 360/91[25], the The PolyPath[26] and the PrincePath[27] ar-
chitectures.
PolyPath architecture implements the Selective Eager Execution (SEE) in
which the multipath execution is not applied to every branch instruction but
it is based on a run-time evaluation to decide whether activate it or not. The
PolyPath shows an average speed up to 14% from unipath to multipath model.
PrincePath is very similar to PolyPath. It basically has an additional buffer
supporting longer multipath executions. The PrincePath architecture shows
an average speed up of 15-16% from unipath to multipath.
1.3.3 Speculative Multithreading
There is a lot of literature regarding the speculative multithreading topic. A
couple of interesting studies are [31] and [32], treating respectively the specula-
tive execution of loop iterations and the implementation of efficient speculative
multithreading, exploiting the SMT shared registers.
There are also studies debating the convenience of using SMT or CMP archi-
tecture for speculative multithreading techniques.
CHAPTER 1. INTRODUCTION 17
The work in [33] concludes that, assuming an equal area occupancy for both
SMT and CMP, the SMT solution is preferable from a performance and an
energy point of view.
In [34] the authors conduct a research concentrating only on the performance
point of view finding that SMT architectural support for speculative multi-
threading outperforms respect to a classic CMP support.
1.3.4 Communication delegation
In a message passing environment the idea to delegate communications to an
other entity is already implemented by high performance interconnection as
Infiniband and Myrinet.
In literature there are studies analyzing the use of cores and processors (ex-
ploiting simple interconnection networks) as [35]. This latter regards a general
purpose communication engine for MPICH.
The work in [36] tries to optimize the implementations of collective operations,
in particular the MPI broadcast operation optimized for Infiniband networks.
1.4 Thesis Structure
This thesis deals with the mentioned topics and it is structured in the following
way:
• Chapter 2 will detail the multithreading technology and it will present
also some commercial implementations of multithreading architectures.
• Chapter 3 will discuss about the prefetching techniques for multithread-
ing and non multithreading CPUs, examining in depth the Speculative
Precomputation or Helper Thread cooperative execution model. All the
techniques that will be faced in this chapter will be evaluated for the
pipelined CPU presented in [1][3].
• Chapter 4 will investigate the branch problem and some techniques used
at compilation time and at run-time to mitigate its effects. In particular
the Threaded Multipath Execution will be examined in depth and a
possible support of such technique will be presented. All the techniques
covered in this chapter will be analyzed for the pipelined CPU presented
in [1][3].
• Chapter 5 will discuss the Speculative Multithreading and Communica-
tion Threads techniques.
CHAPTER 1. INTRODUCTION 18
• Chapter 6 will investigate the choice between the thread cooperation and
the increase of parallelism degree execution models, making some as-
sumptions on the application and on the underlying architectures.
• Chapter 7 will present the conclusions of the work done and possible
ideas for future works.
Chapter 2
Architecture
2.1 Introduction
The main idea behind multithreading architectures is to create a processor able
to take advantage of both Instruction Level Parallelism (ILP) and Thread Level
Parallelism (TLP).
The idea of multithreading (MT) can be interpreted in two different ways:
• as a technological alternative to multiprocessor architectures, allowing a
multiple program/process execution.
• as a technological evolution of ILP architectures exploiting in a better
way the resources on-chip, to mitigate the degradations related to CPU
ILP.
At firmware level, the support of multithreading architecture must implement:
• m independent contexts : each context is represented by an Instruction
Counter register and a set of registers visible at the assembler level.
• a tagging mechanism to distinguish instructions belonging to different
threads in the CPU.
• a thread switching mechanism for the context switch at a thread granu-
larity.
As regards the issuing of instructions, multithreading architectures can be
divided in:
19
CHAPTER 2. ARCHITECTURE 20
• Single Issue: in a time slot only instructions belonging to a single thread
are issued. This kind of multithreading support can be implemented by
a scalar or superscalar architecture.
Among the single-issue multithreading architectures it is possible to dis-
tinguish two different multithreading models that will be described in
the following subsections:
– Interleaved Multithreading (IMT) or Fine Grain Multithreading.
– Blocking Multithreading (BMT) or Coarse Grain Multithreading.
• Multiple Issue: in a time slot instructions belonging to multiple threads
are issued. This kind of multithreading support can be implemented only
by superscalar architectures and it is called Simultaneous Multithreading
(SMT).
2.2 Interleaved Multithreading (Fine grain mul-
tithreading)
In Interleaved Multithreading (IMT), at each time slot instructions belonging
to different threads are issued. If one of the active threads encounters a high
latency event (as a cache fault or a “long” arithmetic operation), it is not
scheduled until such event is not complete.
The advantage of Interleaved Multithreading is the lack of overhead related to
the context switch between threads. The main drawback is that it can slow
down the single thread performance.
Figure 2.1 shows the IMT technique applied to a scalar CPU and Figure 2.2
shows the same technique applied to a superscalar one. A, B, C, D are active
threads and the gray boxes (unfilled) represent empty issue slots caused by
dependencies in the execution of threads’ instructions.
2.3 Blocking Multithreading (Coarse Grain Mul-
tithreading)
In Blocking Multithreading (BMT) a single thread is executed until it reaches
an event triggering a thread switch.
The advantage with respect to IMT is that the single thread execution is
performed at the maximum speed until a thread switch occurs.
CHAPTER 2. ARCHITECTURE 21
Figure 2.1: Interleaved Multithreading with Scalar CPU.
Figure 2.2: Interleaved Multithreading with Superscalar CPU.
The drawbacks of this technique are due to the degradations caused when short
stalls occur. Every time it happens, the pipeline has to be flushed and filled
up with instructions related to the new thread. This operation can imply few
cycles overhead.
BMT technique is exploited usually to mitigate high latency operations where
the cost of the pipeline refillement is negligible compared to the stall time.
Figure 2.3 shows the BMT technique applied to a scalar CPU and Figure 2.4
shows the same technique applied to the superscalar CPU. A,B,C are active
threads and the gray boxes (unfilled) represent empty issue slots caused by
dependencies in the execution of instructions.
Figure 2.3: Blocking Multithreading with Scalar CPU.
CHAPTER 2. ARCHITECTURE 22
Figure 2.4: Blocking Multithreading with Superscalar CPU.
2.4 Simultaneous Multithreading
Simultaneus Multithreading (SMT) is characterized by multiple issuing of in-
structions belonging to different threads. If the architecture is n-way super-
scalar with m hardware contexts, the n instructions of a stream element belong
to a number of threads at most equal to m (with 1 ≤ m ≤ n) and each “long
instruction” is characterized by its own value of m.
The key idea behind the SMT model is the composition of the “long instruc-
tion”; this is done with the aim to obtain both parallelism among threads and
latency overlapping.
Figure 2.5 shows an example of Simultaneous Multithreading for superscalar
CPUs. Instructions of multiple threads are issued in the same clock cycles.
A, B, C and D are active threads and gray boxes represent empty issue slots
caused by dependencies in the execution of thread instructions.
Figure 2.5: Simultaneous Multithreading (with Superscalar CPU).
CHAPTER 2. ARCHITECTURE 23
2.5 Multithreading in parallel computing
Parallelism at program/process grain (Thread Level Parallelism, TLP) comes
from multithreaded parallel applications or individual programs in a multipro-
grammed workload.
Experimental results show that SMT architectures work well on multipro-
grammed contexts, in average obtaining a scalability that goes from 1.4 to 2.4
for a number of threads that goes from 2 to 8 (see Figure 2.6).
As already discussed in the Introduction (Chapter 1), the increase of paral-
lelism degree does not assure an improvement of performances.
The motivation is that threads with similar characteristics tend to stress the
same resources that in turn become bottlenecks.
The point is that it is still unknown how to efficiently exploit multithreading
architectures in parallel applications.
Figure 2.6: Performance of Multithreading on Benchmark Suite Spec95 and
Splash2
2.6 Commercial platforms
There are several commercial and “less” commercial implementations of mul-
tithreading architectures.
In this section will be given a brief overview of some multithreading platforms.
For a more complete documentation see the related bibliography at the end of
this thesis.
CHAPTER 2. ARCHITECTURE 24
2.6.1 Pentium 4 (HyperThreading)
It was the first commercial general purpose processor to implement a multi-
threading architecture (Hyper-Threading[8]).
• 1 core.
• Superscalar CPU with out-of-order execution.
• Interleaved Multithreading with 2 hardware contexts.
• Shared Cache L2 and Shared Floating Point Units.
It operates in two modes:
• Single Threaded mode: that thread has a full use of all resources.
• Multi Threaded mode: each thread has access to half of the partitioned
resources.
The sharing of resources can impact positively on the workload of multithread-
ing applications because it guarantees a maximum flexibility in terms of re-
source allocation. Conversely, the partitioning schemes are more suitable for
multiprogrammed environments.
Intel claims a 15-27% performance gains on multithreaded and multipro-
grammed commercial workloads tuning on HT[42].
2.6.2 SUN ULTRASPARC (NIAGARA) T1
Sun (now Oracle) has a slightly different architectural requirements respect to
AMD and Intel because it is oriented to a different market (server market).
The application domain of the server market is characterized by applications
with a lower ILP and higher TLP, with respect to the personal computer
market.
The design of this architecture is characterized by many “modest” cores with
many hardware contexts per CPU, instead of few powerful cores with few
hardware contexts.
We take into account the Niagara T1 model[44].
• CMP with 8 cores
• Each core is a:
CHAPTER 2. ARCHITECTURE 25
– Scalar CPU with in-order execution.
– Interleaved Multithreading with 4 hardware contexts.
– Shared Cache L2 and Shared Floating Point Units.
2.6.3 SUN MAJC
Sun MAJC[46] does not represent a commercial success but it is a very inter-
esting architecture and it is worth observing some of its characteristics.
• CMP with 4 cores.
• Each core is a:
– Superscalar CPU with in-order execution and Very Long Instruction
Word (VLIW).
– Blocking Multithreading with 4 hardware contexts.
– Shared Cache L1.
– Statically Partitioned Instruction Cache.
It is a processor specialized for the execution of Java code and it implements a
support for speculative multithreading (Chapter 5) allowing, specifically, the
parallelization of loops.
From an architectural point of view it has an interesting feature called virtual
channels, allowing to the compiler to explicitly pass register values between
threads.
2.6.4 IBM Power
IBM Power5[48] is a dual-core CMP with SMT (two hardware contexts per
core).
The Power5 architecture is very interesting because its design is characterized
by a maximum compatibility with Power4 processor, therefore the comparison
between the two models is very direct.
The main difference between Power4 and Power5 processors is the multithread-
ing feature that the latter provides.
As well as Intel Pentium4, IBM Power5 has the ability to run both in a Single
Threaded mode and in Multi Threaded mode and the main difference between
these two models is that Power5 uses less static partitioning of resources.
CHAPTER 2. ARCHITECTURE 26
When the IBM Power5 is used in a Single Threaded mode, the firmware regis-
ters holding the state related to the hardware contexts, can be made available
to the single thread for the register renaming technique.
IBM Power6 [9] is a dual-core CMP with SMT nodes (two hardware con-
texts per core).
Power6 pipeline “becomes” in-order and it eliminates the register renaming
and other run-time structures.
IBM Power7 [10] tries to maximize the performance/energy ratio trying
to get more ILP for a lower clock rate (with respect to its predecessor IBM
Power6).
Power7 increases its ILP re-introducing the out-of-order execution and it in-
creases the threads per core: 4 on a superscalar 6-way.
There are three modes of execution for this model: Single Threaded (ST),
Dual Threaded (SMT2) or Four Threaded (SMT4).
In Table 2.1 there is a list of commercial/non commercial platforms sum-
marizing the basic multithreading properties of each product.
CHAPTER 2. ARCHITECTURE 27
Proc, CMP n. cores Type MT Thread/Core
NBS DYSEAC NO 1 Blocking Multithreading 2
MIT Lincoln Lab TX-2 NO 1 Blocking Multithreading 33
CDC 6600 NO 1 Interleaved Multithreading 10
Denelcor HEP YES 16 Interleaved Multithreading 128
Horizon NO 1 Interleaved Multithreading 128
Tera MTA YES 4 Interleaved Multithreading 128
Delco TIO NO 2 Interleaved Multithreading 32
MIT Sparcle NO 1 Blocking Multithreading 4
Alpha 21464 NO 1 Simultaneous Multithreading 4
Clearwated CNP810SP NO 1 Simultaneous Multithreading 8
Cosentry LSP-1 YES 64 Interleaved Multithreading 2
Intel Pentium 4 NO 1 Simultaneous Multithreading 2
Sun Niagara T1 YES 8 Interleaved Multithreading 4
Sun Niagara T2 YES 8 Interleaved Multithreading 8
Sun MAJC 5200 YES 2 Blocking Multithreading 4
Sun Rock YES 4 Interleaved Multithreading 8
IBM Power5 YES 2 Simultaneous Multithreading 2
IBM Power6 YES 2 Simultaneous Multithreading 2
IBM Power7 YES 8 Simultaneous Multithreading 4
Intel Nehalem i7 YES 4 Simultaneous Multithreading 2
Table 2.1: List of Multithreading Platforms
Chapter 3
Prefetching Opportunities
3.1 Introduction
The Introduction (Chapter 1) describes the cache fault unpredictability prob-
lem, giving a brief mention of the prefetching techniques used to mitigate it.
As already discussed, prefetching technique deals with the anticipation of the
blocks in cache, without relying only on the on-demand paging strategy[1][2].
Firstly, this chapter explains the relationship between the cache faults of an
application and its access pattern predictability.
This chapter presents also a case of study regarding an algorithm operating
on linked lists called chasing pointer.
This algorithm will be compiled in D-RISC and it will be analyzed with a
graphical simulative method targeted for the pipelined CPU [1][3].
The same analyses of the algorithm will be done also for superscalar pipelined
CPUs in the two execution variants: in-order and out-of-order.
To mitigate the cache faults of this program, ad-hoc single thread tech-
niques based on compile-time optimizations will be presented, highlighting the
limitations and the advantages regarding the use of each of them.
Finally, this chapter will present a new execution model, not based on single
thread techniques, called speculative precomputation (or helper thread) that
will be applied to the program.
The speculative precomputation relies on multithreading architectural support
and it shows several advantages when compared to the single thread prefetch-
ing techniques (that anyway do not need such support).
28
CHAPTER 3. PREFETCHING OPPORTUNITIES 29
3.2 Predictable access pattern
The on-demand paging strategy works quite well in average cases but if we have
specific programs, in which the memory access patterns do not have (only) the
properties of locality and reuse, it is possible to improve the cache efficiency,
and the processor performance, using the prefetching strategy.
The prefetching technique is straightforward for the class of problems showing
predictable access patterns. Thanks to such property, compilers and also the
firmware level are capable to effectively apply the prefetching strategy and
then to mitigate the impact of the latency due to cache transfers.
The prefetching activity at firmware level (also called “hardware prefetching”)
is implemented by the cache unit that tracks the requested logical addresses
and when it recognizes a pattern, it starts the load of the next cache block.
It is worth to restate that the prefetching activity for data, treated almost
“blindly” at firmware level, may be source of further inefficiencies; indeed
many of machines have in their instruction set (ISA) a disable prefetching an-
notation.
Anyway prefetching activity for instructions implemented at firmware level is
statistically powerful and it is largely expoited by the nowadays commercial
machines.
In the class of programs showing predictable access patterns, an important
space is occupied by applications dealing with arrays, whose elements are ac-
cessed in a consecutive way (A[0],A[1],A[2]...) or by strides (A[0],A[10],A[20]...).
In the first case array locations are accessed one after the other and the con-
tiguity of array structures, in the virtual memory of the process, makes the
prefetching activity very easy to apply at both compilation time and run-time.
In this case the compiler is usually able to recognize the prefetching activity to
apply and it can annotate the opportune LOAD instructions with a prefetching
flag (it is sufficient one additional bit for the LOAD instruction).
When the cache receives the LOAD request from the Data Memory unit (DM),
together with the prefetching annotation, it loads the block it currently needs
and the contiguous one. Arrays accessed by strides represent a generalization
of arrays accessed in a consecutive way. The stride represents the index incre-
ment (or step size) between the accesses of the array.
Also in this case, the compiler is able to recognize the prefetching activity to
apply, but this time, additionaly to the main LOAD instruction, it may be
necessary a further “dedicated” instruction for prefetching.
The prefetching annotation may be not enough this time, because there is the
need of an additional space to communicate the stride value, or at least to
address a register containing such value.
There are two options:
CHAPTER 3. PREFETCHING OPPORTUNITIES 30
• a “long load” instruction (if the ISA allows it). For example we could
have an instruction like LOAD RA, Ri, Ra, Rstride where the Rstride
represents the address of the general register containing the (constant)
value of the stride.
• alternatively the “long load” may be emulated by two independent LOAD
instructions: the LOAD related to the “actual” information needed and
an “empty LOAD” concerning the prefetched block (ex. LOAD RA, Ri,
Ra; LOAD RA, Ri, Rstride;).
I have given only two examples of predictable memory access pattern, but
actually they are the most significant in the programming background. This
two types of prefetching are usually supported by the firmware level of the
majority of commercial machines and also compilers always guarantee such
optimizations.
3.3 Unpredictable access pattern
In the previous section we have introduced the two most significant examples of
predictable access pattern. The prefetching technique is very useful, but there
are classes of problems in which it does not work very well. In this class of
problems we have, for example, algorithms dealing with data structures such
as linked lists, hash tables or indirect array references.
Linked lists suffer the presence of pointers: the indirection mechanism offered
by this data structure is in general source of many cache faults. It happens
because compilers and especially the firmware level cannot have the sensibility
to understand for which block to anticipate the allocation in cache.
More in general, the algorithms dealing with linked lists, hash tables and
indirect array references can be placed in the class of algorithms having unpre-
dictable access patterns. Ad-hoc prefetching strategies for algorithms dealing
with linked lists have been widely studied in literature. Their use requests
particular attentions by compilers or programmers.
3.4 Case of Study: Chasing Pointer
This section introduces a class of algorithms dealing with linked list data struc-
tures. These algorithms are recognized in literature under the name of Chasing
Pointer.
CHAPTER 3. PREFETCHING OPPORTUNITIES 31
The dynamic of such algorithms is straightforward: they consist in the scan-
ning of lists and, for each element, calculations are made in function of data
present in the current nodes. The linked list taken into account is represented
by Figure 3.1.
Figure 3.1: Linked List representation
The head of the list is characterized by:
• the pointer to the first element.
• a field indicating whether the list is empty or not.
The node elements are characterized by:
• a value part (DATA).
• a pointer part (POINTER) to the next element.
• a termination part (END) indicating whether the node is the last
element of the list.
The high level structure of the chasing pointer algorithm is represented by the
source code in C language shown in Listing 3.1.
struct node{ data , next } * ptr , * list_head;
ptr = list_head;
while( ptr ){
...
...
ptr = ptr -> next;
}
Listing 3.1: Chasing Pointer Code (C language)
CHAPTER 3. PREFETCHING OPPORTUNITIES 32
To analyse the structure of the problem from a finer grain perspective, the
program described in Listing 3.1 is compiled in D-RISC obtaining the code in
Listing 3.2.
LOAD Rhead ,1, Rend
IF <> 0 Rend , END
LOAD Rhead , 0, Rpoint
LOOP:
1) LOAD Rpoint , 1, Rend // CACHE FAULT
LOAD Rpoint , 2, Rvalue
LOAD Rpoint , 0, Rpoint
< ... computation ... >
EL) IF=0 Rend , LOOP
END
Listing 3.2: Chasing Pointer Code (D-RISC assembler)
The program has the structure of a do-while and it is characterized by an
initialization phase operating on the head of the list, retrieving the pointer to
the first element or, if the list is empty, causing the termination of the process.
The guard of the while depends on the end field value.
The general registers are characterized in this way:
• Rhead represents the address of general register containing the logical
memory address of the head of the list: it is initialized at compilation
time.
• Rend, Rpoint and Rvalue are temporal registers.
Thanks to the D-RISC compilation it is possible to proceed with an analysis
of the completion time, using a simple graphical simulation model.
Given that TC ≈ TLoop and given this specific loop, it is possible to state that,
in the worst case, a cache fault occurs at each iteration and then the total
number of faults is equal to O(N).
Given a pipelined CPU with a strict in-order strategy, when the DM unit
detects a fault, this latter blocks the execution of the program and the latency
caused by the transfer of the cache block fully reverberates on the service time.
Figure 3.2 shows the latency caused by the transfer of the cache block.
The computation block represents the set of the “other” instructions inside the
loop.
Let us take into account a pipelined CPU with some degree of out-of-
ordering, that is, when a cache fault is detected by DM and there are not data
CHAPTER 3. PREFETCHING OPPORTUNITIES 33
Figure 3.2: Graphical Simulation of Chasing Pointer Algorithm - 1
dependencies involving the target registers of the instruction causing the cache
fault, the pipeline does not block.
Assume that the target pipelined CPU has this facility and observe Figures 3.3
and 3.4 that show a graphical simulation with such architecture.
Figure 3.3: Graphical Simulation of Chasing Pointer Algorithm - 2
Figure 3.4: Graphical Simulation of Chasing Pointer Algorithm - 3
Let us characterize the computation block in two parts:
• comp1 is a set of instructions that do not suffer data dependencies with
the target register of the instruction causing the cache fault.
• comp2 is a set of instructions that suffer (also indirectly) these data
dependencies (computation = comp1 ∪ comp2).
With such architecture we can mask fully or partially the latency caused by
cache faults together with data dependencies (respectively Figure 3.3 and 3.4)
and the same considerations can be made for a pipelined superscalar CPU.
As already mentioned, for chasing pointer algorithms there are ad-hoc single
thread prefetching techniques that try to mitigate the impact of the latency
caused by cache faults.
CHAPTER 3. PREFETCHING OPPORTUNITIES 34
In this chapter we present two well-known approaches that can be adopted by
compilers or programmers, recognized in literature as:
• Natural pointer
• Jumper pointer
3.5 Natural Pointer
To constrast the latency due to cache faults and data dependencies in chasing
pointer algorithms, the Natural Pointer technique suggests the insertion of a
prefetching instruction in the loop to anticipate of one iteration the cache block
request.
The concept is simple: as soon as a process knows the logical address of the
cache block related to the next node, it starts the prefetching activity.
This optimization works very well if there is enough calculation to fully overlap
the cache fault latency. Listing 3.3 shows the prefetching instruction inside the
loop, executed as soon as the pointer value related to the next node is available.
The “empty load” loads the block related to the logical address of the next
node in cache, one iteration before its use.
LOOP:
LOAD Rpoint , 1, Rend
LOAD Rpoint , 2, Rvalue
LOAD Rpoint , 0, Rpoint
LOAD Rpoint , 1 // EMPTY LOAD (prefetch instruction)
< ... computation ... >
IF=0 Rend , LOOP
Listing 3.3: Chasing Pointer Code with prefetching in D-RISC
Figure 3.5 shows that the prefetching instruction is invoked early enough to
retrieve the cache block needed for the next iteration without paying any addi-
tional latency due to the cache fault (full latency overlapping). The blue color
represents the non overlapped cache fault latency and the fault cache block
(yellow) represents the overall latency related to the cache block transfer.
The latency related to the transfer of the cache blocks from the main mem-
ory (M) or secondary cache (C2) to the primary cache (C1) may be not fully
overlapped by the calculations. Figure 3.6 shows a case in which the latency
CHAPTER 3. PREFETCHING OPPORTUNITIES 35
Figure 3.5: Graphical Simulation of Chasing Pointer Algorithm - 4
Figure 3.6: Graphical Simulation of Chasing Pointer Algorithm - 5
is partially overlapped by calculations.
3.6 Jumper Pointer
The Natural Pointer technique represents a good option if the calculation
amount to overlap the latency is enough.
If the Natural Pointer does not allow a full overlapping of the latency, there is a
very elegant solution (called Jumper Pointer) that can be taken into account.
The idea behind the Jumper Pointer is to start the prefetching activity for
a certain node, k iterations before to use it (k is known in literature as the
Prefetching Distance).
There are few studies indicating how to find the optimal Prefetching Distance
but usually it is used the value: dTtransf
Titer
e, where the Ttransf is the latency
related to the transfer of a block from the memory to the primary cache.
The application of this technique involves an initial transient phase and a
steady state phase. In the initial transient phase the cache fault latency is fully
paid while in the steady state phase a full latency overlapping state is reached.
To use this technique it is fundamental to have an idea of the transient phase
and steady state “amount” characterizing the specific program.
The Jumper Pointer technique is implemented adding a pointer to each
node of the linked list, pointing to some node ahead (Figure 3.7).
CHAPTER 3. PREFETCHING OPPORTUNITIES 36
Figure 3.7: Linked List representation for Jumper Pointer Technique
The head of the list contains two fields:
• a pointer to the first element.
• a field indicating whether the list is empty or not.
The node elements of the list are characterized by:
• a value part (DATA).
• a pointer part (POINTER) to the next element.
• a termination part (END) indicating whether the node is the last element
of the list.
• a new field called (POINTER N) pointing to some node ahead in the
list.
Listing 3.4 shows the possible D-RISC code implementing such strategy.
This time the prefetching activity is not applied on the next node of the list,
but on k nodes ahead (where k is the calculated prefetching distance).
CHAPTER 3. PREFETCHING OPPORTUNITIES 37
LOAD Rhead ,0, Rend
IF <> 0 Rend , END
LOAD Rhead , 1, Rpoint
LOOP:
LOAD Rpoint , 1, Rend
LOAD Rpoint , 2, Rvalue
LOAD Rpoint , 0, Rpoint
LOAD Rpoint , 3, Rpointer_n
LOAD Rpointer_n , 1 // EMPTY LOAD
< ... computation ... >
IF=0 Rend , LOOP
END
Listing 3.4: Chasing Pointer Code with Jumper Pointer solution in D-RISC
There are few specific cases in which the Jumper Pointer technique does not
work well: the most significant is the case in which Ttransf >> Titer (the latency
caused by the cache fault is much bigger than the time spent for an entire
iteration). In fine-grain computations Prefetching Distance can result too large
and therefore it is possible to have a long transient phase characterized by a
sequence of fully paid cache faults.
As regards this problem, a further optimization is represented by the allocation
of a number of pointers equal to the calculated Prefetching Distance in the head
of the list (Figure 3.8). Listing 3.5 shows the D-RISC code implementing such
optimization. It is possible to notice a sequence of empty load before the loop
starts.
CHAPTER 3. PREFETCHING OPPORTUNITIES 38
Figure 3.8: Linked List representation for Jumper Pointer Technique with transient
phase optimization
LOAD Rhead ,0, Rend
IF <> 0 Rend , END
LOAD Rhead , 1, Rpoint
LOAD Rhead , 2 // EMPTY LOAD
LOAD Rhead , 3 // EMPTY LOAD
...
LOAD Rhead , D-2 // EMPTY LOAD
LOOP:
LOAD Rpoint , 1, Rend
LOAD Rpoint , 2, Rvalue
LOAD Rpoint , 0, Rpoint
LOAD Rpoint , 3, Rpointer_n
LOAD Rpoint , 1 // EMPTY LOAD
< ... computation ... >
IF=0 Rend , LOOP
END
Listing 3.5: Chasing Pointer Code with Jumper Pointer solution with transient
optimization in D-RISC
CHAPTER 3. PREFETCHING OPPORTUNITIES 39
3.7 Speculative Precomputation
As already introduced in Chapter 1, the Speculative Precomputation (or Helper
Thread) is a technique used for prefetching activity.
Basically, helper threads anticipate the same execution of instructions of the
main thread that cause cache faults.
As regards the chasing pointer case of study, let us assume that the linked list
we have to deal with does not have many elements (the N value is small) and
assume also to have a fine-grain computation such that Ttransf >> Titer.
In this case, neither the Natural Pointer nor the Jumper Pointer solutions rep-
resent a good option.
Now let’s assume, to have a superscalar pipelined CPU with SMT (Figure 3.9)
and observe the sharing of the primary cache between the two hardware con-
texts.
Figure 3.9: Multithreaded CPU with replication of IU and EU Master
Given this architectural support, the helper thread schema could represent
a very interesting solution, allocating the helper thread and the main thread
to different hardware contexts of the same CPU.
Also for the speculative precomputation technique the concept of Prefetching
Distance is very important because, to have some benefits, is fundamental to
activate the helper thread early enough. The optimal Prefetching Distance in
time is represented by Ttransf (latency related to the transfer of a block from
memory to primary cache).
Provided a well-calculated prefetching distance, and an underlying multithreaded
architecture, the speculative precomputation technique represents in general
CHAPTER 3. PREFETCHING OPPORTUNITIES 40
the best possible solution to mitigate the cache fault unpredictability problem.
As regards the case of study, it works well for any possible assumptions.
The compiler (or programmer) should calculate the prefetching distance and
insert a SPAWN instruction in the code of the main thread (see Listing 3.6).
MAIN THREAD ::
...
...
SPAWN HELPER THREAD
...
...
LOAD Rhead ,0, Rend
IF <> 0 Rend , END
LOAD Rhead , 1, Rpoint
LOOP:
LOAD Rpoint , 1, Rend
LOAD Rpoint , 2, Rvalue
LOAD Rpoint , 0, Rpoint
< ... computation ... >
IF=0 Rend , LOOP
END
HELPER THREAD ::
LOOP:
LOAD Rpoint , 1, Rend , not_deallocate
LOAD Rpoint , 0, Rpoint
IF=0 Rend , LOOP
Listing 3.6: Chasing Pointer Code with helper thread in D-RISC
The SPAWN instruction support should “fastly” copy the content of general
registers between different hardware contexts and change the thread switching
mode starting to fetch instructions of the helper thread.
The firmware interpreter of the pipelined CPU[1][3] for a multithreading ar-
chitecture could support such instruction in this way:
• The Instruction Unit (IU) sends the logical address of the first instruction
of the helper thread to the Instruction Memory (IM) unit.
• The Instruction Memory (IM) unit, once received such instruction, starts
the thread switching in SMT mode.
For the “fast” copy of registers’ contents between different contexts of a mul-
tithreading architecture, we need a new mechanism at firmware level allowing
to do it.
CHAPTER 3. PREFETCHING OPPORTUNITIES 41
If the architecture does not provide such as feature, the copy can be imple-
mented by a software-based mechanism.
The work [16] claims that, if the overhead of the helper thread’s startup is very
small compared to the helper thread computation, the software-based solution
can be sufficient to accomplish this task in an efficient way.
There are many approaches in literature regarding the way in which helper
threads are built.
The most straightforward technique consists in the identification of windows
composed by instructions’ sequences preceding critical instructions in the main
thread: this window is known in literature as Precomputational Slice.
A subset of this window is then selected. Usually this subset matches with the
instructions subgraph with minimal dependencies to execute critical instruc-
tions correctly.
Helper thread instructions are generated at compilation time and they belong
to the run-time support for multithreading optimization.
Chapter 4
Solving the Branch Problem
A well-known problem in pipelined scalar/superscalar CPUs is represented by
the Branch Problem (see Chapter 1).
Pipelined CPUs suffer branch instructions because of two main reasons:
• “code jumps”.
• logical dependencies regarding registers on the content of which predi-
cates must be evaluated.
Anyway “code jumps” do not represent important degradations of programs
while logical dependencies on registers related to the conditional branches can
be primary causes of degradation.
This chapter firstly evaluates the branch problem through a graphical simula-
tion aid and then presents different solutions aiming to mitigate the degrada-
tions related to it.
Among these solutions there are:
• Compilation based techniques
• Branch Prediction
• Multipath Execution
• Threaded Multipath Execution
All these techniques will be discussed in the next sections of this chapter.
42
CHAPTER 4. SOLVING THE BRANCH PROBLEM 43
4.1 Case of Study: the Branch Problem
This section shows the branch problem with some examples. In Listing 4.1 it
is possible to observe the array sum program example.
int A[N], B[N], C[N];
for (i=0; i<N; i++)
C[i]=A[i] + B[i]
Listing 4.1: Array Sum
Suppose the compiler establishes that:
• the general register of address Ri contains the variable index value i,
initialized at 0 at compilation time.
• the general registers of addresses RA, RB and RC contain the bases of
the array A, B and C, initialized at 0 at compilation time.
• the general registers of addresses Ra, Rb acting as temporary registers.
• the general register of address RN contains the constant N.
Listing 4.2 shows a possible D-RISC compilation of this program and Figure 4.1
shows graphically the bubbles created by the conditional branch.
LOOP:
LOAD RA , Ri , Ra
LOAD RB , Ri , Rb
ADD Ra , Rb , Ra
STORE RC, Ri, Ra
INCR Ri
IF< Ri, RN, LOOP
Listing 4.2: Array Sum D-RISC code
As regards the latter figure, we have two different kind of bubbles:
• the “orange” bubble is the one related to the “code jump”. This degra-
dations costs only one time slot.
• the “yellow” bubble is the one related to the logical dependency on reg-
ister Ri between the INCR and the IF instructions. In this example the
logical dependencies on registers impact only for one time slot, but in
general the degradations can be worse than this.
CHAPTER 4. SOLVING THE BRANCH PROBLEM 44
Figure 4.1: Array Sum Graphical Simulation 1
To show the potential degradations provided by data dependencies together
with the branch instructions, we change slightly the array sum program with
a similar example (Listing 4.3), substituting the increment instruction with an
high latency (group of) instruction(s).
int A[N], B[N], C[N];
for (i=0; i<N; i=arithmetic_expression(i))
C[i]=A[i] + B[i]
Listing 4.3: Array Sum with Branch Problem
The D-RISC implementation of this new program is described in Listing 4.4
where the ARITHMETIC EXPRESSION(Ri) block represents a group of in-
structions calculating, at each iteration of the loop, the value of the index i.
LOOP:
LOAD RA , Ri , Ra
LOAD RB , Ri , Rb
ADD Ra , Rb , Ra
STORE RC, Ri, Ra
ARITHMETIC_EXPRESSION(Ri)
IF< Ri, RN, LOOP
Listing 4.4: Array Sum with Branch Problem D-RISC code
If we substitute the ARITHMETIC EXPRESSION(Ri) block with an ex-
pensive MUL operation costing about five time slots (see Listing 4.5), the
bubble caused by the logical dependency can be more relevant.
LOOP:
...
MUL Ri , 3, Ri
IF< Ri, RN, LOOP
Listing 4.5: Array Sum with Branch Problem (Multiplication) D-RISC code
CHAPTER 4. SOLVING THE BRANCH PROBLEM 45
The couple of instructions highlighted in Listing 4.5 causes a stall of the
pipeline execution because the calculation of the value related to the general
register of address Ri in EU is still in progress during the IU phase of the IF
instruction.
If the ARITHMETIC EXPRESSION(Ri) block were substituted with a chain
of multiple dependent MULs (or other “expensive” operations), the service
time effect could get worse.
These degradations represent a cost in terms of performance: if the target
machine of the application is a superscalar architecture, it may lead to waste
a great number of instructions.
In the next sections this chapter will explain some techniques to mitigate the
problems exemplified in this case of study.
4.2 Compilation Time techniques
To mitigate the impact of the degradations caused by the branch problem
there are compilation time techniques that can be taken into account[1][4].
These optimizations try to reduce the branch probability or to exploit the
bubbles introduced by the branch instructions, to execute useful work.
To reduce the branch probability there are solutions as:
• Macro Expansion
• Loop Unfolding
These techniques contribute to increase the virtual memory space and the
working set of the processes, but they also represent simple and interesting
forms of optimization operating without any impact on the firmware level of
the underlying architecture.
Another interesting technique impacting mainly at compilation time is repre-
sented by delayed branch. It tries to exploit the bubbles created by the branch
instructions to do useful work.
Let us take into account the array sum example described and compiled in
Listings 4.1 and 4.2. Listing 4.6 shows the equivalent program compiled with
a delayed branch annotation.
CHAPTER 4. SOLVING THE BRANCH PROBLEM 46
LOAD RA , Ri , Ra
LOOP:
LOAD RB , Ri , Rb
ADD Ra , Rb , Ra
STORE RC, Ri, Ra
INCR Ri
IF< Ri, RN, LOOP , delayed_branch
LOAD RA , Ri , Ra
Listing 4.6: D-RISC Array Sum 3
Thanks to the delayed branch annotation, the instruction LOAD RA, Ri,
Ra has been moved after the IF instruction, making the loop re-start from
the LOAD RB, Ri, Rb instruction. In this way the bubble due to the branch
instruction is completely fixed. Anyway the delayed branch technique impacts
also on the design of the firmware architecture of the CPU: the Instruction
Unit (IU) must avoid to waste instructions which do not have matching iden-
tifiers coming from the Instruction Memory (IM).
There are cases in which the compilers do not apply this technique because an
hypothetical movement of instructions would cause further logical dependen-
cies on registers in the rest of the program.
4.3 Branch Prediction
The Branch Prediction is a technique targeted for out-of-order architectures
consisting in the prediction of the path to take, in relationship to a particular
conditional branch instruction.
This technique can be very useful in some cases, in particular for conditional
branches presenting predictable results.
Branch prediction can be implemented with an annotation schema or it can
be implemented entirely at firmware level.
The former solution implies the annotation at compile-time of the predictable
branches. In literature these annotations flags are usually called likely and
unlikely, suggesting the results of the instruction at firmware level.
For example, the code in Listing 4.6 can be modified adding the likely annota-
tion on the conditional branch representing the guard of the loop (Listing 4.7).
CHAPTER 4. SOLVING THE BRANCH PROBLEM 47
LOOP:
LOAD RA , Ri , Ra
LOAD RB , Ri , Rb
ADD Ra , Rb , Ra
STORE RC, Ri, Ra
MUL Ri , 3, Ri
IF< Ri, RN, LOOP , likely
Listing 4.7: D-RISC Array Sum 4 (with branch prediction)
As regards Listing 4.7, when the IF instruction will be decoded by the IU,
the execution will continue with the first instruction of the LOOP following
the “suggestion” of the compiler. The out-of-order facility of the processor is
exploited and sometimes it can be really advantageous. The overhead in case
of wrong prediction could be really high. The annotation mechanism could
simplify the pipelined CPU architecture (in particular the complexity of the
Instruction Unit) saving some useful space on-chip.
Branch prediction can be implemented completely at firmware level (it is
a common practice for commercial superscalar architectures).
As regards the pipelined CPU, when the Instruction Unit (IU) receives an IF
instruction it tries to predict the path to take. This implementation requires
an additional component called Branch Predictor whose aim is to predict the
right path the program execution will take.
Usually the Branch Predictor units follow very simple implementation schemes
as they are just counter based.
The firmware implementations of the most common branch predictors are
shown in Figure 4.2 and 4.3. Figure 4.2 shows a one-bit counter: it tracks
the action of the last branch taken and it predicts the same path.
Figure 4.2: State Machines diagram for Branch Prediction with 1-bit
The diagram of the state machine shown in Figure 4.3 is slightly more com-
plex and requires two bits to identify all the machine’s states. Basically there
are two macro states identifiable with Predict Taken and Predict Not Taken
CHAPTER 4. SOLVING THE BRANCH PROBLEM 48
Figure 4.3: State Machines for Branch Prediction with 2-bit
(refering to a certain path) and to switch from one to the other it is necessary
to take or not to take the same branch twice in succession. This latter is a
real branch predictor implementation used for commercial purposes[56].
These two predictors are based on counters of few bits but there is a wide
scientific and industrial literature about predictors based on counters of three,
four and more bits.
Branch Prediction mechanism has to implement the commit phase in which,
if the path prediction was proven to be good, the intermediate results related to
the branch prediction execution become consistent for the architectural state
of the processor.
Branch Prediction requires also mechanisms implementing the recovery phase
of a previous consistent state. This phase will be “activated” in case of pre-
diction failure.
When the IU realizes the goodness (or not) of the previous branch prediction,
it starts the commit or the recovery phase.
The architectural support for branch prediction is represented by additional
copies of registers to save data at the beginning of the computation “under
condition” or to use them for the temporal results of the branch executed
“under condition”. In the next section a possible implementation of branch
prediction for the pipelined CPU will be presented[1][2].
CHAPTER 4. SOLVING THE BRANCH PROBLEM 49
4.4 Branch Prediction for Pipelined CPU
Let us see a possible implementation of Branch Prediction. Assume the com-
piler annotates statically some conditional branch with likely/unlikely annota-
tions indicating whether and how to apply the branch prediction. As already
explained in the previous section, the compiler may have the sensibility to un-
derstand whether the probability to take a branch is high or not.
At firmware level, when the IU receives an IF instruction annotated with
the likely flag, it checks whether it is involved in a logical dependence on regis-
ters with EU and therefore it checks the semaphores associated to the registers
referred (Figure 4.4 (1)).
If the semaphores associated to the registers indicate that IU already has an up-
dated version of the registers it proceeds normally without doing extra activity
related to the branch prediction. Otherwise, if there are logical dependencies
in progress with at least one of the register explicited by the conditional branch
instruction, the branch prediction will be applied.
The IU switches to branch prediction mode and it sends messages to EU and
DM units communicating them to switch to branch prediction mode as well
(Figure 4.4 (2)). The branch prediction mode changes slightly the firmware
interpreter of IU, DM and EU units.
Both IU and EU adopted the technique of Register Renaming that is the dy-
namic allocation of registers visible at the assembler level in different registers
at firmware level.
Suppose the architecture has 64 registers visible at assembler level and addi-
tional 64 firmware registers. Among the additional registers assume to have
some registers dedicated for the branch execution (to save temporary the re-
sults in the execution “under condition”).
Suppose to have a replicated associative memory both in IU and EU imple-
menting a direct access table used to allocate dinamically the registers (MAPT-
ABLE). When the IU and EU units get into the branch prediction mode, they
use the registers dedicated for the branch execution to perform the operations,
indeed the MAPTABLE represents the register translation function for the
Register Renaming technique. For IU and EU it is important a proper ini-
tialization of firmware registers used for the branch execution. It can be done
through the use of a single bit.
The DM unit is supposed to operate with a buffer (called Branch Buffer or
more in general Speculative Buffer) during the branch execution for LOAD/-
STORE operations.
Once the IU-EU dependency causing the “activation” of branch execution is
CHAPTER 4. SOLVING THE BRANCH PROBLEM 50
solved, IU compares the result of the previously predicted branch with the
correct actual value. If they match then a commit phase starts (Figure 4.4
(3)). The IU switches from branch prediction mode to the “normal” mode and
it sends a commit message to DM and EU (Figure 4.4 (4)). Once received the
commit message DM transfers the content of the Branch Buffer to the primary
cache. IU and EU copy the content of all registers used in branch prediction
mode in the registers for the “normal” execution.
If there is no match and the branch prediction failes (Figure 4.4 (5)), IU starts a
“recovery” phase consisting in the restoring of the last consistent architectural
state of the processor. IU sends recovery messages to DM and EU (Figure 4.4
(6)) and both IU and EU get back to the “normal” mode.
CHAPTER 4. SOLVING THE BRANCH PROBLEM 51
Figure 4.4: Branch Prediction Schema for Pipelined CPU
CHAPTER 4. SOLVING THE BRANCH PROBLEM 52
4.5 MultiPath Execution
As discussed in Chapter 1, Branch Prediction may be not sufficient to mitigate
the Branch Problem.
There is another, much more aggressive, technique to solve it that is called
Multipath Execution.
The Multipath Execution is the execution of code down both paths of one or
more branch, discarding incorrect results when a branch is solved.
This technique can support three types of execution:
• Unipath[49].
• Balanced Multipath[50].
• Skewed Multipath.
The Unipath execution is equivalent to the branch prediction: when a
branch is encountered, the execution proceeds only down the predicted paths.
This process can be repeated.
The Balanced Multipath consists in forking the execution in two paths
when a branch is encountered: the if and the else branches.
The major drawback of Balanced Multipath execution is the exponential growth
in required resources with the increase of the branch depth .
The Skewed Multipath Execution want to represent a trade off between the
Unipath and Balanced approaches. It regards the different variants that are
possible to obtain as regards the evolution of the execution tree.
The Skewed Multipath Execution will not be treated in this chapter, but the
list below indicates the proper scientific literature related to this topic.
Skewed Execution can be categorized in:
• Theory Based[49][51][52][53]
• Heuristic: Static[49]
• Heuristic: Dynamic[54][26]
• Pseudo Random[55]
CHAPTER 4. SOLVING THE BRANCH PROBLEM 53
4.6 Threaded Multipath Execution
Threaded Multipath Execution consists in using idle contexts of SMT to im-
plement the multipath execution.
The basic idea is to use the SMT hardware contexts to increase the instruction
level parallelism when facing a branch instruction. In this case, the execution
continues on both paths but on different contexts.
When the branch will be solved, the thread executing the wrong path will be
canceled (or squashed) and it will become free for a new computation.
To support this technique, the authors of the Threaded Multipath Execu-
tion proposed a new hardware component called Mapping Synchronization Bus
(MSB) (see Figure 4.5) having the role of copy the register maps among differ-
ent threads and keeping the maps updated when alternate paths are spawned
onto the idle hardware contexts. The authors assumed to have shared gen-
eral registers among the contexts therefore they observed that there is not the
need to copy all the registers content to start a new execution path, but it is
sufficient to copy the register map.
Figure 4.5: A register mapping scheme for a simultaneous multithreading pro-
cessor, with an added mapping synchronization bus to enable threaded multi-path
execution
CHAPTER 4. SOLVING THE BRANCH PROBLEM 54
4.7 Threaded Multipath Execution for Pipelined
CPU
In this section a possible implementation of Threaded Multiple Path Execu-
tion for pipelined CPU is presented, assuming a compiler annotating statically
some conditional branch with a threaded multipath execution flag. We assume
to have a mechanism such as the Mapping Synchronization Bus (MSB) both in
IU and EU whose application is to copy fastly the content of general registers
from a context to an other.
At firmware level, when the IU receives an IF instruction annotated with
threaded multipath execution, it checks whether it is involved in a logical depen-
dence with EU, therefore it checks the semaphores associated to the registers
referred (Figure 4.6 (1)).
IU keeps track of waiting threads and the registers for which the threads them-
selves are waiting for the updates. They do it through the use of a descriptor
made by: <TAG, Instruction Address, Instruction COP, content of register 1,
content of register 2, content of register 3 OR offset jump>.
If the IU already has an updated version of the registers referred by the IF in-
struction, it proceeds without taking into account the multiple path execution.
If IU has not yet an updated version of such registers, it first checks whether
there are resources in terms of idle hardware contexts to start a multiple path
execution. IU tracks the current active threads in the processor through a list
of active threads organized as a tree to establish the relationship between the
spawned branches.
If there are enough idle contexts to spawn two branches in two different hard-
ware threads (the if branch and else branch), IU sends a multipath execution
message to IM, DM and EU. Together with the message directed to IM it
specifies the logical address of the if branch and the logical address of the
next instruction (Figure 4.6 (2)).
DM and EU receive the multipath execution message with the context/tag
involved.
DM starts to use the Speculative Buffer for LOAD and STORE tracking the
context/tag from which the requests are done.
Both IU and EU do a fast copy of the content of general registers through the
MSB mechanism.
IM starts to send instructions belonging to both path of the conditional branch
using different hardware contexts. This procedure can be executed recur-
sively and, as we explained in the previous section, the unresolved conditional
branches structure can evolve in different ways, depending in particular on the
available resources at run-time.
CHAPTER 4. SOLVING THE BRANCH PROBLEM 55
If the IU receives an IF instruction with a threaded multipath execution annota-
tion and there are not idle contexts, it refuses to apply the multipath execution.
When the logical dependence is solved, the IU communicates IM the “can-
cel” messages with the thread identifiers related to the wrong branches (Fig-
ure 4.6 (3)).
The IU communicates DM the identifier of the right branch, therefore DM
selectively will move data written in the Speculative Buffer to the primary
cache.
IU destroys the subtree related to the wrong path in the list of the active
threads and also the node related to the original branch itself.
CHAPTER 4. SOLVING THE BRANCH PROBLEM 56
Figure 4.6: Threaded Multipath Execution Schema
Chapter 5
Other opportunities
5.1 Communication Threads
5.1.1 Communications in Parallel Applications
Parallel applications can be described by computational graphs, whose nodes
represent computational modules cooperating with each other.
Let us assume a message passing environment in which the edges of each
module represent communication channels, used to exchange the informations
between the modules.
Usually to structure such computation graphs well-known schemes of parallel
computations are used. These schemes (also called paradigms) have their own
specific cost models.
Parallel applications expressed as computational graphs are usually compiled
in a set of processes, whose cooperation is defined by some concurrent pro-
gramming languages.
These languages usually provide primitives for interprocess communication
purposes and their combinations allow the effect of transferring a message
from a sender to a receiver through a communication channel.
5.1.2 Communication optimizations
The impact of communications in parallel applications is remarkable, therefore
there is the need to mitigate such negative effects.
As regards the cost models related to the parallel schemes, there is a funda-
mental metric represented by the interprocess communication latency: Lcom
(the average time needed to execute a communication).
57
CHAPTER 5. OTHER OPPORTUNITIES 58
The latency for transmitting a message of L words can be expressed as:
Lcom(L) = Tsend(L) + Treceive(L)
where Tsend and Treceive are respectively the latency of the send and receive
operations.
To reduce the metric Lcom(L) we can introduce the zero copy communica-
tion mechanism[1][2].
The zero copy communication is an interesting mechanism allowing the copy of
the variables directly in the address space of the receiver, without any double
copy of messages. With the zero copy communication, the latency spent to
transmit a message of L words can be expressed as:
Lcom(L) = Tsend(L).
The result is that with such technique we have:
Lcom(L) = Tsend(L) + Treceive(L) ≈ Tsend(L) = Tsetup + LTtransm
where:
• Tsetup is the latency of the communications (independent from the mes-
sage lenght).
• Ttransm is the latency needed to copy one word of the message.
To improve the performance of parallel applications impacting on the module’s
service time, we can try to overlap calculation and communications.
Assume to have a typical stream parallel application whose modules alternate
sequences of communications and calculations and assume, also, that the mod-
ule’s semantic allows the overlapping of communications (these latter must be
asynchronous and non blocking).
We can have three situations:
• Non-Overlapping. The communication latency is fully paid (Figure 5.1).
The service time is:
T = Tcalc + Lcom
• Full Overlapping. The communication latency is fully overlapped by
the next calculation phase (Figure 5.2).
The service time is:
T = Tcalc
CHAPTER 5. OTHER OPPORTUNITIES 59
• Partial Overlapping. The communication latency is partially over-
lapped (Figure 5.3).
The service time is:
T = Tcalc + Tcom = max(Tcalc, Lcom),
Tcom represents the average communication time non overlapped with
the internal calculations.
Figure 5.1: Non Overlapping communications and calculations
Figure 5.2: Fully Overlapping communications and calculations
Figure 5.3: Partial Overlapping communications and calculations
5.1.3 Communication Processor
Shared memory multiprocessors are characterized by a certain number of pro-
cessing nodes.
To allow the overlapping of computations and communications for these archi-
tectures there is the need for an additional architectural support: a communi-
cation processor[1][3].
CHAPTER 5. OTHER OPPORTUNITIES 60
Figure 5.4 shows an example of processing node with a communication pro-
cessor (KP) and a main processor (IP).
KP is dedicated to the execution of the run-time support of the communi-
cation primitives. The send operations will be delegated to KP passing the
informations required to execute the primitives.
These informations are usually:
• the channel identifier.
• the reference to the message.
5.1.4 Communication Thread
The work [7] investigates the possibility for multithreading architectures, to
emulate a communication processor such as KP.
A run-time support adopting the approach of associating a communication
thread to each computational node of the parallel application (to delegate
the communication primitive execution) has been designed. In parallel, the
“worker thread” executes the computation.
The run-time support has been realized efficiently with a particular atten-
tion for inter-threads synchronization mechanisms, realized using two different
technologies:
• Classical POSIX mechanisms (mutexes and condition variables).
• MONITOR/MWAIT assembler instructions.
CHAPTER 5. OTHER OPPORTUNITIES 61
Figure 5.4: Processing node with communication processor
CHAPTER 5. OTHER OPPORTUNITIES 62
5.2 Speculative Multithreading Architecture
(SpMT)
Speculative Multithreading (SpMT) represents an important opportunity for
multithreading architectures.
SpMT operates in the automatic discovery of segments of code belonging to
the original sequential application and executes them in parallel.
The initial order in which those segments are organized to be executed in
parallel could violate the semantic of the original application (the Bernstein
conditions could be violated).
To guarantee the correctness of the parallelization, SpMT must provide a sup-
port to “fix” at run-time the semantic violations through detection/commit/in-
validation mechanisms.
Many studies related to SpMT concern heuristics that tries to extract the
best possible parallel solutions on the basis of some considerations about the
parallel sections (thread size, data dependencies, workload balances...).
5.2.1 Architectural Support
The architectural support needed for speculative multithreading should main-
tain:
• A buffer for the speculative state (the same speculative buffer already
introduced in the implementation of Branch Prediction and Threaded
Multipath Execution).
• A mechanism to communicate and synchronize current threads.
SMT processor is an ideal solution to support speculative multithreading be-
cause the facilities provided at firmware/hardware level of such machines allow
efficient speculative execution.
5.2.2 Run-time Support
The run-time support of SpMT must implement:
1. Commit/invalidation mechanisms.
2. Violation detection mechanisms.
CHAPTER 5. OTHER OPPORTUNITIES 63
For the first point, all the things already explained of the run-time support of
the Threaded Multipath Execution (as regards the commit/invalidation mech-
anisms) are valid also in this case.
The violation detection mechanism is in charge of the units of the pipelined
CPU. In particular the Instruction Unit has to track:
• the relationship between the active threads.
• the logical addresses generated.
Through these informations the system is able to detect violations and trigger
the opportune commit/invalidation mechanisms.
The overhead related to SpMT can be caused by:
• Violation detection.
• Load imbalance.
When the run-time support detects a violation, usually there is a flushing of
the speculative buffers and a re-execution of part of the threads.
If violations are very frequent, the overhead related to the parallelized applica-
tions can be unsustainable and the performance related to this approach can
be outperformed also by the serial legacy code itself.
The Load imbalance is related to the time in which an hardware context
remains idle after completing the execution of a speculative thread.
5.2.3 Case of Study
Assume to have a computation as the one shown in Listing 5.1 and a target
multithreading architecture with three hardware contexts.
int A[N], L[N], K[N];
int i=0;
for(i=0; i<N; i++){
...=A[L[i]];
A[K[i]]=...;
}
Listing 5.1: Wokers Code
In speculative multithreading, a possible parallelization in case of a loop is
to consider its iterations as they are completely independent from each other.
CHAPTER 5. OTHER OPPORTUNITIES 64
Figure 5.5: Values of the arrays L and K
Figure 5.6: Speculative Multithreading Architecture Example
SpMT compiler annotates the code with informations needed to spawn the
iterations during the execution of the program.
In this example the run-time support of speculative multithreading allocates
the hardware contexts with consecutive iterations in a circular way.
Figure 5.5 shows the value of the arrays L and K at run-time. In Figure 5.6
it is possible to notice the architectural state violations (iteration 2, 4 and 6)
and a description of the program’s evolution in time.
The run-time support of this architecture implements invalidation/commit
mechanisms assuming the re-execution of those iterations that violate depen-
dencies.
Chapter 6
Cooperative Threads vs
Increase of Parallelism Degree
6.1 Introduction
As already introduced in Chapter 1, a leading trend related to high perfor-
mance products is represented by CMP with multithreaded CPU nodes and
the use of such architectures can be divided into two execution models: in-
crease of parallelism degree and cooperative threads.
If a parallel application presents a certain “optimization space”, the thread
cooperation technique should be taken into account.
For example, if a program presents: cache fault unpredictability, a great num-
ber of branch instructions or it is characterized by long latencies spent in
communications, using the techniques shown in the previous chapters can con-
tribute to optimize the performance of the application (given a proper under-
lying architecture).
This can be done exploiting the multithreading feature, allocating support
functionalities to hardware contexts aiming at global performance improve-
ments.
This chapter wants to show numerically that the choice of the execution
model is not always automatic.
Let us consider a parallel application modelled as a farm, receiving as input
stream elements with a certain interarrival time TA.
Let us suppose that the farm’s workers are characterized by a loop of N iter-
ations based on array/lists of N elements. The target architecture is a CMP
with multithreading CPU nodes.
The execution modes to be taken into account are two:
65
CHAPTER 6. COOP. THREADS VS INCR. PARALLELISM DEGREE 66
• Cooperative Threads (Version 1) - Use each core of the CMP to allocate
worker threads together with other cooperative threads (speculative pre-
computation, multipath execution and communication threads).
• Increase of Parallelism Degree (Version 2) - Use each hardware contexts
of the CMP to increase the parallelism degree of the application (if it is
necessary).
Let us assume to have the following configuration of parameters:
• Number of CPUs: 8.
• CPU superscalar 2-way with SMT.
• Stream Length: m.
• Average Array/List Length: N = 102.
• Average Interarrival Time: TA = 60 time units t.
In the next two sections we analyse two programs characterized by the branch
problem (Chapter 4) and they will be used as candidates to investigate the
two execution models.
6.2 Case of study 1
The first program (Listing 6.1) consists of a sequence of instructions, the last
of which causes a IU-EU dependency (IF instruction), and also it contains
three long dependent arithmetic instructions.
To evaluate the service time and the iteration time of the loop we use the
graphical simulation model shown in Figure 6.1.
The evaluation of such parameters will guide us to define the proper execution
strategy, in particular whether to apply the cooperative model (in this case
Threaded Multiple Path execution) or to increase the parallelism degree of the
application.
CHAPTER 6. COOP. THREADS VS INCR. PARALLELISM DEGREE 67
Figure 6.1: Graphical Simulation of a farm’s worker execution - Case of Study 1.
LOOP:
1 LOAD RA , Ri , Ra
2 MUL Ra , Rx , Re
3 MUL Re , Ry , Rz
4 MUL Rz , Rp , Ry
5 INCR Ry
6 IF <= Ry , RN , LOOP
Listing 6.1: Farm’s worker D-RISC code - Case of Study 1.
6.2.1 Increase of Parallelism Degree
Figure 6.1 shows an iteration time: T
′′
iter = 21t and a service time: T
′′
=
21t
6
.
The optimal parallelism degree of this program is equal to:
n
′′
opt = d
T
′′
calc
TA
e = dN · T
′′
iter
TA
e = d10
2 · 21t
60t
e = 35.
If we exploit all the hardware contexts, we can increase the parallelism degree
from 6 (number of cores subtracted by 2: emitter and collector processes) to
14 (number of cores X number of hardware threads per core subtracted by 2).
We obtain the completion time:
T
′′(14)
c ≈ m ·
T
′′
loop
14
= m · N · T
′′
iter
14
= m · 10
2 · 21t
14
= m · 150t
CHAPTER 6. COOP. THREADS VS INCR. PARALLELISM DEGREE 68
Figure 6.2: Graphical Simulation of a farm’s worker execution (with Threaded
Multipath Execution) - Case of Study 1.
6.2.2 Cooperative Threads
Let us consider the cooperative threads execution model (in particular the
Threaded Multipath Execution).
The program in Listing 6.1 has been re-writed in Listing 6.2, assuming an
annotation schema at compilation time for Threaded Multipath Execution.
In Listing 6.2 are also shown the successive instructions after the guard of
the loop and Figure 6.2 shows a graphical simulation of the execution of this
program. To distinguish the instructions belonging to the different iterations
we use different colors.
As regards the graphical simulation, once the first annotated IF instruction
has been processed by the IU, the IM unit starts to issue instructions of the
loop combining them with the successive out-of-loop instructions.
The (*) in Figure 6.2 represents the point in which the IU-EU dependency is
solved and it follows a commit/invalidation phase (as described in Chapter 4).
This program shows an iteration time: T
′
iter = 7t and a service time: T
′
=
7t
6
.
The optimal parallelism degree of this program is equal to:
n
′
opt = d
T
′
calc
TA
e = dN · T
′
iter
TA
e = d10
2 · 7t
60t
e = 12.
The completion time is:
CHAPTER 6. COOP. THREADS VS INCR. PARALLELISM DEGREE 69
T
′(6)
c ≈ m ·
T
′
loop
6
= m · N · T
′
iter
6
= m · 10
2 · 7t
6
= m · 116.67t.
For this program, the cooperative threads execution model, in particular
the Threaded Multipath Execution, results to be better than the increase of
parallelism degree, showing a global improvement of about the 22%.
LOOP:
1 LOAD RA , Ri , Ra
2 MUL Ra , Rx , Re
3 MUL Re , Ry , Rz
4 MUL Rz , Rp , Ry
5 INCR Ry
6 IF <= Ry , RN , LOOP , threaded_multipath_ex
7 LOAD ...
8 ADD ...
9 ADD ...
10 SUB ...
11 ADD ...
... ... ...
Listing 6.2: Farm’s worker D-RISC code with Threaded Multipath Execution
annotations - Case of Study 1.
6.3 Case of Study 2
The premises made for the case of study 1, are valid also for this case.
In particular, the program that will be analysed in this section (Listing 6.3)
shows a “lighter” branch problem with respect to the previous case of study.
We add only a LOAD instruction and we shorten the sequence of multiplica-
tions preceding the IU-EU dependency (we pass from three multiplications to
only one).
6.3.1 Increase of Parallelism Degree
Figure 6.3 shows an iteration time: T
′′
iter = 12t and a service time: T
′′
=
12t
5
.
The optimal parallelism degree of this program is equal to:
n
′′
opt = d
T
′′
calc
TA
e = dN · T
′′
iter
TA
e = d10
2 · 12t
60t
e = 20.
CHAPTER 6. COOP. THREADS VS INCR. PARALLELISM DEGREE 70
Figure 6.3: Graphical Simulation of a farm’s worker execution - Case of Study 2.
If we exploit all the hardware contexts, we can increase the parallelism degree
from 6 to 14 and we obtain the completion time:
T
′′(14)
c ≈ m ·
T
′′
loop
14
= m · N · T
′′
iter
14
= m · 10
2 · 12t
14
= m · 85.71t.
LOOP:
1 LOAD RA , Ri , Ra
2 LOAD RB , Ri , Rb
3 MUL Ra , Rx , Ry
4 INCR Ry
5 IF <= Ry , RN , LOOP
Listing 6.3: Farm’s Worker D-RISC code - Case of Study 2.
6.3.2 Cooperative Threads
The Figure 6.4 shows an iteration time: T
′
iter = 6t and a service time: T
′
=
6t
5
.
The optimal parallelism degree of this program is equal to:
n
′
opt = d
T
′
calc
TA
e = dN · T
′
iter
TA
e = d10
2 · 6t
60t
e = 10.
And the completion time is equal to:
T
′(6)
c ≈ m ·
T
′
loop
6
= m · N · T
′
iter
6
= m · 10
2 · 6t
6
= m · 100t.
CHAPTER 6. COOP. THREADS VS INCR. PARALLELISM DEGREE 71
Figure 6.4: Graphical Simulation of a farm’s worker execution (with Threaded
Multipath Execution).
The analysis shows that, for this program, increase the parallelism degree is a
better solution than the use of cooperative threads.
LOOP:
1 LOAD RA , Ri , Ra
2 LOAD RB , Ri , Rb
3 MUL Ra , Rx , Ry
4 INCR Ry
5 IF <= Ry , RN , LOOP , threaded_multipath_ex
6 LOAD ...
7 ADD ...
8 ADD ...
9 SUB ...
10 ADD ...
... ... ...
Listing 6.4: Farm’s Worker D-RISC code with Threaded Multipath Execution
annotation - Case of Study 2.
In the next section we will try to get some more general informations with
the formalization of this problem.
6.4 Formalization
In this section we try to formalize the examples given in the previous sections.
The cooperative thread solution (Version 1) is better than the other one when
CHAPTER 6. COOP. THREADS VS INCR. PARALLELISM DEGREE 72
its completion time is the lowest (if T
′(n′)
c ≤ T ′′(n′′)c ).
We can expand the latter relationship in this way:
m · N · T
′
iter
n′
≤ m · N · T
′′
iter
n′′
⇔ T
′
iter
T
′′
iter
≤ n
′
n′′
where n
′
is the actual parallelism degree of the Version 1 and n
′′
is the actual
parallelism degree of the Version 2.
They can be defined respectively as follows:
• n′ = min{n′opt, ncore − 2}: all the multiprocessor machine’s cores are
available to schedule pairs of < W (worker), CT (cooperative threads) >.
Also in this case if the optimal parallelism degree related to the Version
1 (n
′
opt) is less than the number of cores, the architecture’s parallelism is
exploited just enough.
• n′′ = min{n′′opt, ncore · ncontexts − 2}: all the architectural parallelism is
available to increase the parallelism degree of the application. If such par-
allelism degree is “too much” the architecture’s parallelism is exploited
just enough (n
′′
= min{n′′opt, · · ·}).
T
′
iter can be written as T
′
iter = (1− α) · T ′′iter
where α is the gain of the Version 1 over the Version 2:
T
′
iter
T
′′
iter
≤ n
′
n′′
⇔ (1− α) · T
′′
iter
T
′′
iter
≤ n
′
n′′
⇔ (1− α) ≤ min{n
′
opt, ncore − 2}
min{n′′opt, ncore · ncontexts − 2}
⇔ α ≥ 1− min{n
′
opt, ncore − 2}
min{n′′opt, ncore · ncontexts − 2}
.
By definition:
CHAPTER 6. COOP. THREADS VS INCR. PARALLELISM DEGREE 73
n
′
opt = d
N · T ′iter
TA
e = dN · T
′′
iter · (1− α)
TA
e and n′′opt = d
N · T ′′iter
TA
e
Version (cooperative threads) is better than Version 2 (increase of parallelism
degree) when:
α ≥ 1−
min{dN · T
′′
iter · (1− α)
TA
e, ncore − 2}
min{dN · T
′′
iter
TA
e, ncore · ncontexts − 2}
.
A simple analytical consideration (without any graphical aid) can be made
just observing the relationship:
α ≥ 1− min{n
′
opt, ncore − 2}
min{n′′opt, ncore · ncontexts − 2}
.
If n
′
opt ≤ ncore(·ncontexts)− 2
and n
′′
opt ≤ ncore − 2 therefore:
α ≥ 1− n
′
opt
n
′′
opt
= 1−
dN · T
′′
iter · (1− α)
TA
e
dN · T
′′
iter
TA
e
.
In other words, if the cooperative threads cause some benefits to the worker
thread (α > 0), then
dN · T
′′
iter · (1− α)
TA
e
dN · T
′′
iter
TA
e
≥ 1
and therefore
CHAPTER 6. COOP. THREADS VS INCR. PARALLELISM DEGREE 74
α ≥ 1−
dN · T
′′
iter · (1− α)
TA
e
dN · T
′′
iter
TA
e
.
is always true (the cooperative threads approach is better than the other).
The same observation can be made intuitively: if the optimal parallelism
degree of the application in both versions is less than ncore− 2, then Version 1
(cooperative threads) is always better than Version 2, even if it shows a mini-
mal gain (α > 0).
We extract some more informations by doing some plots in function of α
and T
′′
iter observing the service time ratio evolution (Figures 6.5,6.6,6.7 and
6.8).
Let us summarize once again the parameters for the plot generation:
• Number of CPUs: ncore = 8.
• CPU Superscalar 2-way with SMT.
• Iteration Time of Version 2: T ′′iter = 5, 7, 9, 11.
• Average Array/List Length: N = 102.
• Average Interarrival Time: TA = 60 time units t.
• 0 ≤ α ≤ 1.
Given by definition that
T
′(n′) =
N · T ′iter
n′
and T
′′(n′′) =
N · T ′′iter
n′′
if the service time ratio is greater than one
T
′′(n′′)
T ′(n′)
> 1 the cooperative threads
solution represents the best option.
Two observations that can be made observing Figures 6.5,6.6,6.7 and 6.8:
CHAPTER 6. COOP. THREADS VS INCR. PARALLELISM DEGREE 75
• Each plot show better performance with the use of cooperative threads,
when the gain α is greater than a certain threshold.
This threshold gain increases together with the iteration time:
– The case with T
′′
iter = 5 is characterized by a gain threshold ≈ 0.40.
– The case with T
′′
iter = 7 is characterized by a gain threshold ≈ 0.57.
– The case with T
′′
iter = 9 is characterized by a gain threshold ≈ 0.67.
– The case with T
′′
iter = 11 is characterized by a gain threshold ≈ 0.73.
• For values of α greater than the threshold gain, plots are characterized
by oscillatory behaviours.
Such behaviours show higher peaks with the increasing of α, highlighting
the convenience of using the cooperative execution model over the other
solution.
The analysis advanced in this chapter shows that, for fine grain computations,
the gain necessary to have a convenience in using the cooperative thread model
is smaller.
At the same time, it shows that if the specific program shows a great opti-
mization space (as in the case of study 1, where the gain with the Threaded
Multipath Execution is approximatively of 70%), the cooperative execution
model is a great opportunity that must be taken into account.
CHAPTER 6. COOP. THREADS VS INCR. PARALLELISM DEGREE 76
Figure 6.5: Service Time Ratio. T
′′
iter = 5.
Figure 6.6: Service Time Ratio. T
′′
iter = 7.
CHAPTER 6. COOP. THREADS VS INCR. PARALLELISM DEGREE 77
Figure 6.7: Service Time Ratio. T
′′
iter = 9.
Figure 6.8: Service Time Ratio. T
′′
iter = 11.
Chapter 7
Conclusions
The use of multithreading architectures for parallel programming is still de-
bated in computer science, indeed it is not yet clear how to exploit them in
the most efficient way.
This thesis wants to represent a self-contained work investigating different pro-
gram optimizations targeted for multithreading architectures.
These optimizations aim to increase the performance of the applications, try-
ing to mitigate the overheads due to: the branch problem, the cache fault
unpredictability and the communication latency.
Chapter 3 provides a case of study for the analysis of a problem character-
ized by cache fault unpredictability: the chasing pointer algorithm (based on
linked list).
Different techniques are introduced to solve this problem and the speculative
precomputation proves to be the best solution from different points of view.
This technique gives the possibility to fully mask the latencies caused by cache
faults, exploiting the hardware contexts of the architecture.
As regards the case of study, the speculative precomputation proves to work
very well for any possible assumption. In particular it proves to be good also
for fine-grain computations and in case of lists containing few elements.
Studies in literature confirm that the speculative precomputation technique
reaches an average speed-up of 15%-20% in a set of irregular applications (in
the introduction is provided the related bibliography).
It is the most used optimization technique exploiting multithreading architec-
tures.
Besides the chasing pointer program it works well also for other problems in-
volving unpredictable access patterns as algorithms based on indirect array
references and hash tables.
78
CHAPTER 7. CONCLUSIONS 79
In Chapter 4 we analyse the branch problem that regards the degradation
of performance due the frequency (and the particular context) in which the
branch instructions appear in the program. Firstly this chapter shows compila-
tion time techniques[1][2] that statistically behave very well, but in determined
contexts, they are not effective enough.
For this reason we introduce the branch prediction and we propose a run-
time support for this technique, implemented at firmware and targeted for the
pipelined CPU taken into account for this thesis[1][2].
There are application-specific cases in which also the branch prediction is dif-
ficult to apply, both at compilation (with annotations) and at run-time.
In these cases it could be convenient using the multipath execution technique
and going down both paths of a branch instruction. There are cases in which
waiting for the resolution of a “long” logical dependency or taking the wrong
branch could represent a cost too high for the particular situation.
In Chapter 4 we analyse the Threaded Multipath Execution technique that
consists in the exploitation of the hardware contexts to support the Multipath
Execution. We propose also a run-time support for this technique, imple-
mented at firmware and targeted for multithreading architecture. This im-
plementation assumes the existence of mechanisms for the fast copy of the
registers’ content among different hardware contexts.
In the case of study in Chapter 6 we assume to have such as fast copy mecha-
nism.
Studies on Threaded Multipath Execution, conducted on programs showing
very bad branch problems, show an average speed up on single program of
about 20%, depending on misprediction penalty.
Besides the cache fault unpredictability and the branch problem, also the
communication latency can be a source of degradation in parallel applications
and it can be faced by using the communication thread technique (Chapter 5).
The authors of [7] realized a run-time support for communication threads and
as a future work it could be interesting to extend it, including the support func-
tionalities treated in this thesis, in particular the Speculative Precomputation
and the Threaded Multipath Execution.
Bibliography
[1] Marco Vanneschi. Architettura degli elaboratori. Pisa University Press,
2013.
[2] Marco Vanneschi. Structured Computer Architecture, 2013.
http://www.di.unipi.it/∼vannesch/HPC%202013-14/0-2-
Course%20Notes%202013-Part%200-Background.pdf.
Course Notes of High-performance Computing Systems and Enabling
Platforms, Master Program in Computer Science and Networking,
University of Pisa, 2013.
[3] Marco Vanneschi. Structuring and Design Methodology for Parallel
Computations, 2013. http://www.di.unipi.it/∼vannesch/HPC%202013-
14/1-CourseNotes%202013-Part%201-Methodology.pdf.
Course Notes of High-performance Computing Systems and Enabling
Platforms, Master Program in Computer Science and Networking,
University of Pisa, 2013.
[4] Marco Vanneschi. Parallel Architectures, 2013.
http://www.di.unipi.it/∼vannesch/HPC%202013-14/2-
CourseNotes%202013-Part%202-Architectures.pdf.
Course Notes of High-performance Computing Systems and Enabling
Platforms, Master Program in Computer Science and Networking,
University of Pisa, 2013.
[5] Jamison D. Collins, Hong Wang , Dean M. Tullsen, Christopher
Hughes, Yong-Fong Lee, Dan Lavery, John P. Shen. Speculative
Precomputation: Long-range Prefetching of Delinquent Loads. In
Proceedings of the International Symposium on Computer Architecture,
pages 15-25, 2001b.
[6] S. Wallace, B. Calder, and D. M. Tullsen. Threaded Multiple Path
Execution. SIGARCH Comput. Archit. News, vol. 26, no. 3, Apr. 1998.
80
BIBLIOGRAPHY 81
[7] D. Buono, T. De Matteis, G. Mencagli and M. Vanneschi. Optimizing
Message-Passing on Multicore Architectures using Hardware
Multi-Threading. Proceedings of the 22nd Euromicro International
Conference on Parallel, Distributed and Network-Based Processing,
Turin, Italy, 2014. To appear.
[8] D. T. Marr, F. Binns, D. L. Hill, G. Hinton and D. A. Koufaty, J. A.
Miller and M. Upton. Hyper-Threading Technology Architecture and
Microarchitecture. Intel Technology Journal. vol. 6, no. 1, pages 4-16,
2002,
[9] H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen, B.
J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. 2007.
IBM POWER6 microarchitecture. IBM J. Res. Dev., 51, 6 (November
2007), 639-662. DOI=10.1147/rd.516.0639
http://dx.doi.org/10.1147/rd.516.0639
[10] R. X. Arroyo, R. J. Harrington, S. P. Hartman, and T. Nguyen. 2011.
IBM POWER7 systems. IBM J. Res. Dev., 55, 3 (May 2011), 220-232.
DOI=10.1147/JRD.2011.2131610
http://dx.doi.org/10.1147/JRD.2011.2131610
[11] Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. When Prefetching
Works, When It Doesn’t, and Why. ACM Trans. Archit. Code Optim.
9, 1, Article 2 (March 2012), 29 pages. DOI=10.1145/2133382.2133384
http://doi.acm.org/10.1145/2133382.2133384
[12] Chi-Keung Luk and Todd C. Mowry. Compiler-based prefetching for
recursive data structures. SIGPLAN Not. 31, 9 (September 1996),
222-233. DOI=10.1145/248209.237190
http://doi.acm.org/10.1145/248209.237190
[13] Amir Roth and Gurindar S. Sohi. Effective jump-pointer prefetching for
linked data structures. In Proceedings of the 26th annual international
symposium on Computer architecture (ISCA ’99). IEEE Computer
Society. Washington, DC, USA, 111-121. DOI=10.1145/300979.300989
http://dx.doi.org/10.1145/300979.300989.
[14] Jamison D. Collins, Dean M. Tullsen, Hong Wang, and John P. Shen.
Dynamic speculative precomputation. In Proceedings of the 34th annual
ACM/IEEE international symposium on Microarchitecture (MICRO
34), IEEE Computer Society. Washington, DC, USA, 306-317, 2001.
BIBLIOGRAPHY 82
[15] Dongkeun Kim and Donald Yeung. Design and evaluation of compiler
algorithms for pre-execution. SIGOPS Oper. Syst. Rev 36, 5 (October
2002), 159-170. DOI=10.1145/635508.605415
http://doi.acm.org/10.1145/635508.605415.
[16] Chi-Keung Luk. Tolerating memory latency through
software-controlled pre-execution in simultaneous multithreading
processors. SIGARCH Comput. Archit. News 29, 2 (May 2001), 40-51.
DOI=10.1145/384285.379250
http://doi.acm.org/10.1145/384285.379250.
[17] Weifeng Zhang, Dean M. Tullsen, and Brad Calder. Accelerating and
Adapting Precomputation Threads for Efficient Prefetching. In
Proceedings of the 2007 IEEE 13th International Symposium on High
Performance Computer Architecture (HPCA ’07). IEEE Computer
Society, Washington, DC, USA, 85-95.
DOI=10.1109/HPCA.2007.346187
http://dx.doi.org/10.1109/HPCA.2007.346187.
[18] Craig Zilles and Gurindar Sohi. Execution-based prediction using
speculative slices. In Proceedings of the 28th annual international
symposium on Computer architecture (ISCA ’01). ACM, New York,
NY, USA, 2-13. DOI=10.1145/379240.379246
http://doi.acm.org/10.1145/379240.379246.
[19] George Radin. 1982. The 801 minicomputer. SIGARCH Comput.
Archit. News 10, 2 (March 1982), 39-47. DOI=10.1145/964750.801824
http://doi.acm.org/10.1145/964750.801824.
[20] Robert S. Chappell, Jared Stark, Sangwook P. Kim, Steven K.
Reinhardt, and Yale N. Patt. Simultaneous subordinate microthreading
(SSMT). SIGARCH Comput. Archit. News 27, 2 (May 1999), 186-195.
DOI=10.1145/307338.300995
http://doi.acm.org/10.1145/307338.300995.
[21] Tor M. Aamodt, Paul Chow, Per Hammarlund, Hong Wang, and John
P. Shen. Hardware Support for Prescient Instruction Prefetch. In
Proceedings of the 10th International Symposium on High Performance
Computer Architecture (HPCA ’04). IEEE Computer Society,
Washington, DC, USA, 84-. DOI=10.1109/HPCA.2004.10028
http://dx.doi.org/10.1109/HPCA.2004.10028.
BIBLIOGRAPHY 83
[22] Dongkeun Kim and Donald Yeung. 2002. Design and evaluation of
compiler algorithms for pre-execution. SIGOPS Oper. Syst. Rev. 36, 5
(October 2002), 159-170. DOI=10.1145/635508.605415
http://doi.acm.org/10.1145/635508.605415.
[23] Weifeng Zhang, Brad Calder, and Dean M. Tullsen. 2005. An
Event-Driven Multithreaded Dynamic Optimization Framework. In
Proceedings of the 14th International Conference on Parallel
Architectures and Compilation Techniques (PACT ’05). IEEE
Computer Society, Washington, DC, USA, 87-98.
DOI=10.1109/PACT.2005.7 http://dx.doi.org/10.1109/PACT.2005.7.
[24] Carlos Madriles, Carlos Garc´ıa-Quin˜ones, Jesu´s Sa´nchez, Pedro
Marcuello, Antonio Gonza´lez, Dean M. Tullsen, Hong Wang, and John
P. Shen. Mitosis: A Speculative Multithreaded Processor Based on
Precomputation Slices. IEEE Trans. Parallel Distrib. Syst. 19, 7 (July
2008), 914-925. DOI=10.1109/TPDS.2007.70797
http://dx.doi.org/10.1109/TPDS.2007.70797.
[25] D. W. Anderson, F. J. Sparacio, and R. M. Tomasulo. The IBM
system/360 model 91: machine philosophy and instruction-handling.
IBM J. Res. Dev. 11, 1 (January 1967), 8-24.
DOI=10.1147/rd.111.0008 http://dx.doi.org/10.1147/rd.111.0008
[26] Artur Klauser, Abhijit Paithankar, and Dirk Grunwald. Selective eager
execution on the PolyPath architecture. SIGARCH Comput. Archit.
News 26, 3 (April 1998), 250-259. DOI=10.1145/279361.279393
http://doi.acm.org/10.1145/279361.279393.
[27] Pritpal S. Ahuja, Kevin Skadron, Margaret Martonosi, and Douglas W.
Clark. Multipath execution: opportunities and limits. In Proceedings of
the 12th international conference on Supercomputing (ICS ’98). ACM,
New York, NY, USA, 101-108. DOI=10.1145/277830.277854
http://doi.acm.org/10.1145/277830.277854
[28] N. Magid, G. Tjaden and H. Messinger. Exploitation of Concurrency
by Virtual Elimination of Branch Instructions. In Proceedings of the
1981 International Conference on Parallel Processing, pages 164-165
Ohio State University and the IEEE Computer Society, August 1981.
[29] Augustus K. Uht, Vijay Sindagi, and Kelley Hall. 1995. Disjoint eager
execution: an optimal form of speculative execution. In Proceedings of
BIBLIOGRAPHY 84
the 28th annual international symposium on Microarchitecture (MICRO
28). IEEE Computer Society Press, Los Alamitos, CA, USA, 313-325.
[30] T-F Chen. 1998. Supporting Highly-Speculative Execution via
Adaptive Branch Trees. In Proceedings of the 4th International
Symposium on High-Performance Computer Architecture (HPCA ’98).
IEEE Computer Society, Washington, DC, USA, 185-.
[31] Pedro Marcuello and Antonio Gonza´lez. Data Speculative
Multithreaded Architecture. In Proceedings of the 24th Conference on
EUROMICRO - Volume 1 (EUROMICRO ’98), Vol. 1. IEEE
Computer Society, Washington, DC, USA, 10321-.
[32] Il Park. 2003. Implicitly-Multithreaded Processors. Ph.D. Dissertation.
Purdue University, West Lafayette, IN, USA. AAI3113856.
[33] Venkatesan Packirisamy. Exploring Efficient Architecture Design for
Thread-Level Speculation—Power and Performance Perspectives. Ph.D.
Dissertation. University of Minnesota, Minneapolis, MN, USA.
Advisor(s) Pen-Chung Yew and Antonia Zhai. AAI3360381.
[34] Fredrik Warg and Per Stenstrom. Dual-Thread Speculation: Two
Threads in the Machine are Worth Eight in the Bush. In Proceedings of
the 18th International Symposium on Computer Architecture and High
Performance Computing (SBAC-PAD ’06). IEEE Computer Society,
Washington, DC, USA, 91-98. DOI=10.1109/SBAC-PAD.2006.17
http://dx.doi.org/10.1109/SBAC-PAD.2006.17
[35] P. Lai, P. Balaji, R. Thakur, and D. K. Panda. ProOnE: A
General-Purpose Protocol Onload Engine for Multi- and Many-Core
Architectures. Computer Science: Research and Development, vol. 23,
no. 3-4, 2009.
[36] T. Hoefler, C. Siebert, and W. Rehm. A practically constant-time MPI
Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast.
In Parallel and Distributed Processing Symposium, 2007. IPDPS ’07.
IEEE International, 2007, pages 1-8, 2007.
[37] G. Goumas, N. Anastopoulos, N. Ioannou and N. Koziris. Overlapping
Computation and Communication in SMT Clusters with Commodity
Interconnects. In Cluster Computing and Workshops, 2009. CLUSER
’09. IEEE International Conference on, 2009, pp. 1-10.
BIBLIOGRAPHY 85
[38] Ulrich Drepper. What Every Programmer should know about Memory.
RedHat Press.
[39] Marco Aldinucci. Smart Memory Parallel Architectures. PhD thesis,
Department of Computer Science, Universita‘ degli Studi di Pisa
(Italy), 2000.
[40] Keith Diefendorff. Compaq Chooses SMT for Alpha. Microprocessor
Report, 13(16):1, 6–11, December 1999.
[41] S. E. Perl and R. L. Sites. Studies of Windows NT Performance Using
Dynamic Execution Traces. Proceedings 2nd Symposium on Operating
Design and Implementation, pages 169–183, USENIX Association,
Berkeley, CA, 1996.
[42] D. Koufaty and D.T. Marr. Hyperthreading technology in the netburst
microarchitecture. IEEE Micro, pages 56-65, April 2003. 33, 34, 42, 75.
[43] Nathan Tuck, Dean M. Tullsen. 12th International Conference on
Parallel Architectures and Compilation Techniques (PACT 2003), 27
September - 1 October 2003, New Orleans, LA, USA; 01/2003.
[44] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun.
Niagara: A 32-way multithreaded SPARC processor. IEEE MICRO,
March 2005. 12, 26, 76.
[45] Shah M., Barren J., Brooks J., Golla R., Grohoski G. , Gura N. ,
Hetherington R. , Jordan P. , Luttrell M. , Olson C. , Sana B. ,
Sheahan D. , Spracklen L. , Wynn A. UltraSPARC T2: A
highly-threaded, power-efficient, SPARC SOC. In Solid-State Circuits
Conference, 2007. ASSCC ’07. IEEE Asian, November 2007. 8 ,12, 76.
[46] Tremblay M., Chan J., Chaudhry S., Conigliam A.W., Tse S.S. The
MAJC architecture: a synthesis of parallelism and scalability. Micro,
IEEE, vol.20, no.6, pp.12,25, Nov/Dec 2000 doi: 10.1109/40.888700.
[47] Chaudhry S. and Cypher R. and Ekman M. and Karlsson M. and
Landin A. and Yip S. and Zeffer H˚akan and Tremblay M. Micro, IEEE.
Rock: A High-Performance Sparc CMT Processor. vol.29, no.2, pp.
6-16, 2009.
[48] Ron Kalla, Balaram Sinharoy, and Joel M. Tendler. 2004. IBM Power5
Chip: A Dual-Core Multithreaded Processor. IEEE Micro 24, 2 (March
2004), 40-47. DOI=10.1109/MM.2004.1289290
http://dx.doi.org/10.1109/MM.2004.1289290
BIBLIOGRAPHY 86
[49] Augustus K. Uht, Vijay Sindagi, and Kelley Hall. 1995. Disjoint eager
execution: an optimal form of speculative execution. In Proceedings of
the 28th annual international symposium on Microarchitecture (MICRO
28). IEEE Computer Society Press, Los Alamitos, CA, USA, 313-325.
[50] N. Magid, G. Tjaden, and H. Messinger. Exploitation of Concurrency
by Virtual Elimination of Branch Instructions. In Proc. of the 1981 Intl.
Conference on Parallel Processing (KPP). pages 164-165, Aug. 1981.
[51] Prabhakar Raghavan, Hadas Shachnai, and Mira Yaniv. Dynamic
schemes for speculative execution of code. Perform. Eval. 53, 2 (July
2003), pages 125-142. DOI=10.1016/S0166-5316(02)00229-8
http://dx.doi.org/10.1016/S0166-5316(02)00229-8
[52] Alexander Gaysinsky, Alon Itai, and Hadas Shachnai. Strongly
competitive algorithms for caching with pipelined prefetching. Inf.
Process. Lett. 91, 1 (July 2004), pages 19-27.
DOI=10.1016/j.ipl.2004.03.008
http://dx.doi.org/10.1016/j.ipl.2004.03.008
[53] T-F Chen. 1998. Supporting Highly-Speculative Execution via
Adaptive Branch Trees. In Proceedings of the 4th International
Symposium on High-Performance Computer Architecture (HPCA ’98).
IEEE Computer Society, Washington, DC, USA, 185-.
[54] T. Heil and J. Smith. Selective dual path execution. In Technical report,
University of Wisconsin - Madison, WI, USA, November 8, 1996.
[55] Uht, A. K., Morano, D., Khalafi, A., and Kaeli, D. R. Levo. A Scalable
Processor With High IPC. The Journal of Instruction-Level
Parallelism, 5, August 2003.
[56] J.Smith. Branch predictor using random access memory. U.S. Patent
#4370711, assigned to Control Data Corporation, Filed Oct. 21, 1980,
Issued Jan 25, 1983
[57] M. Curtis-Maury and T. Wang Integrating multiple forms of
multithreaded execution on multi-smt systems: A study with scientific
applications. In Proceedings of the Second International Conference on
the Quantitative Evaluation of Systems, ser. QEST ’05, Washington,
DC, USA: IEEE Computer Society, 2005, pp. 199–.
[58] S. Saini, H. Jin, R. Hood, D. Barker, P. Mehrotra, and R. Biswas. The
impact of hyper-threading on processor resource utilization in
BIBLIOGRAPHY 87
production applications. In Proceedings of the 2011 18th International
Conference on High Performance Computing, ser. HIPC ’11,
Washington, DC, USA: IEEE Computer Society, 2011, pp. 1–10.
