Automatic Code-Generation Techniques for Micro-Threaded RISC Architectures by McGuiness, J.
Automati Code-Generation
Tehniques for Miro-Threaded
RISC Arhitetures
Jason M

Guiness
Submitted to the University of Hertfordshire in partial fulllment of
the requirements of the degree of Master of Siene by Researh.
Compiler Tehnology and Computer Arhiteture Group,
Department of Computer Siene Sienes,
University of Hertfordshire,
England.
July 2006
Dediated to
the many enlightening disussions I had with Dr. Rihard Harris and Dr. Andres
Márquez.
Automati Code-Generation Tehniques for
Miro-Threaded RISC Arhitetures
Jason M

Guiness
Submitted to the University of Hertfordshire in partial fulllment of
the requirements of the degree of Master of Siene by Researh.
July 2006
Abstrat
There has been an ever-widening gap between proessor and memory speeds,
resulting in a `memory wall' where the time for memory aesses dominates per-
formane. To ounter this, arhitetures that use many very small threads that
allow multiple memory aesses to our in parallel have been under investigation.
Examples of these arhitetures are the CARE (Compiler Aided Reorder Engine) ar-
hiteture, miro-threading arhitetures and ellular arhitetures, suh as the IBM
Cylops family, implementing using proessors-in-memory (PIM), whih is the main
arhiteture disussed in this thesis. PIM arhitetures ahieve high performane by
inreasing the bandwidth of the proessor to memory ommuniation and reduing
that lateny, via the use of many proessors physially lose to the main memory.
These massively parallel arhitetures may have sophistiated memory models, and
I ontend that there is an open question regarding what may be the ideal approah
to implementing parallelism, via using many threads, from the programmer's per-
spetive. Should the implementation be at language-level suh as UPC, HPF or
iv
other language extensions, alternatively within the ompiler using trae-sheduling?
Or should it be at a library-level, for example OpenMP or POSIX-threads? Or per-
haps within the arhiteture, suh as designs derived from data-ow arhitetures?
In this thesis, DIMES (the Delaware Iterative Multiproessor Emulation System),
whih is being developed by CAPSL at the University of Delaware, was used as a
hardware evaluation tool for suh ellular arhitetures. As the programing example,
the author hose to use a threaded Mandelbrot-set generator with a work-stealing al-
gorithm to evaluate the DIMES thread programming model. This implementation
was used to identify potential problems and issues that may our when attempting
to implement massive number of very short-lived threads.
Delaration
The work in this thesis is based on researh arried out at the Compiler Tehnology
and Computer Arhiteture Group, University of Hertfordshire, England. No part
of this thesis has been submitted elsewhere for any other degree or qualiation and
it is all my own work unless referened to the ontrary in the text.
v
Aknowledgments
My sinere thanks to Dr. Colin Egan my nal supervisor, of the University of
Hertfordshire, for taking on an unusual student! But also my thanks to Professor
Alex Shafarenko, of the same University for his inspirational onversations, and
being instrumental in my ollaboration with CAPSL and the University of Delaware.
Finally, my thanks to Dr. Slava Muhnik, who initiated this program of researh.
The following dotors also deserve honorable mention: Dr. Andres Márquez and Dr.
Georg Munz. Both were inspirational onversationalists, eah in their own ways...
The researh presented in this thesis was, in part, supported by the Engineering
and Physial Researh Counil (EPSRC) grant number: GR/S58492/01.
vi
Contents
Abstrat iii
Delaration v
Aknowledgments vi
1 Introdution 1
2 Related Work 6
2.1 The VLIW Origins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Beyond VLIW: Super-salar: the ombination of branh preditors,
speulation and memory hierarhies . . . . . . . . . . . . . . . . . . . 7
2.3 Parallel Arhitetures . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 EARTH, the EARTH ompiler and CARE . . . . . . . . . . . 10
2.3.1.1 The EARTH arhiteture . . . . . . . . . . . . . . . 10
2.3.1.2 The EARTH Compiler . . . . . . . . . . . . . . . . . 10
2.3.1.3 The CARE Arhiteture . . . . . . . . . . . . . . . . 12
2.3.2 The Miro-Threaded Arhiteture . . . . . . . . . . . . . . . . 12
2.3.3 IBM BlueGene/C and Cylops . . . . . . . . . . . . . . . . . . 14
3 The limitations of super-salar arhitetures: the memory wall 16
3.1 Multiple ores and massively parallel arhitetures . . . . . . . . . . . 17
3.2 The programming models: from ompilers to libraries . . . . . . . . . 19
3.3 IBM BlueGene/C, Cylops and DIMES/P: the implementation of a
ellular arhiteture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Programming Models on Cellular Arhitetures . . . . . . . . . . . . 21
vii
Contents viii
3.5 Programming for Cylops . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Programming the Mandelbrot Set Algorithm for Cylops 24
4.1 An Introdution to the Mandelbrot Set . . . . . . . . . . . . . . . . . 25
4.2 Threading and the Mandelbrot Set . . . . . . . . . . . . . . . . . . . 27
4.3 A Disussion of the Work-Stealing Algorithm 5 . . . . . . . . . . . . 30
4.4 DIMES/P Implementation of the Mandelbrot-set appliation . . . . . 31
4.4.1 The Memory Layout . . . . . . . . . . . . . . . . . . . . . . . 33
4.4.2 The Host Interfae . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4.3 Exeution details of the Mandelbrot-set appliation . . . . . . 34
5 List of Ahievements 37
6 Summary 38
Bibliography 42
Appendix 57
A Implementing Appliations on a Cellular Arhiteture - the Mandelbrot-
set. 57
A.1 Abstrat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.2 Introdution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.3 Programming Models on Cellular Arhitetures. . . . . . . . . . . . . 60
A.4 Conlusion/Disussion. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
B Implementing Appliations on a Cellular Arhiteture - the Mandelbrot-
set. 61
B.1 Overview: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
B.2 A reap on the memory wall. Part I:
The proessor viewpoint. . . . . . . . . . . . . . . . . . . . . . . . . . 62
B.3 A reap on the memory wall. Part II:
The memory viewpoint. . . . . . . . . . . . . . . . . . . . . . . . . . 63
B.4 The memory wall and ellular arhitetures: a solution? . . . . . . . . 63
Contents ix
B.5 Programming models on Cellular Arhitetures. . . . . . . . . . . . . 64
B.6 A brief overview of Cylops and DIMES/P-2. . . . . . . . . . . . . . 65
B.7 An introdution to the Mandelbrot set. . . . . . . . . . . . . . . . . . 65
B.8 The lassi algorithm used to generate the Mandelbrot set: . . . . . . 66
B.9 Threading applied to the Mandelbrot set. . . . . . . . . . . . . . . . . 67
B.10 The Render-Thread Algorithm. . . . . . . . . . . . . . . . . . . . . . 67
B.11 The Work-Stealing Algorithm. . . . . . . . . . . . . . . . . . . . . . . 68
B.12 A Disussion of the Work-Stealing Algorithm. . . . . . . . . . . . . . 69
B.13 The stati layout of the render and work-stealing threads within the
DIMES/P-2 system is shown below: . . . . . . . . . . . . . . . . . . . 70
B.14 Exeution Details of the Mandelbrot-set appliation. . . . . . . . . . 70
B.15 Superomputing Benhmarks: Global Updates Per Seond (GUPS). . 70
B.16 GUPS and DIMES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
B.17 Limitations of urrent GUPS & DIMES. . . . . . . . . . . . . . . . . 72
B.18 Conlusion & Future Work. . . . . . . . . . . . . . . . . . . . . . . . 72
C The Challenges of Eient Code-Generation for Massively Parallel
Arhitetures. 74
C.1 Abstrat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
C.2 Introdution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
C.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
C.3.1 The Programming Models: from Compiler to Libraries . . . . 76
C.3.2 Programming Models on Cellular Arhitetures . . . . . . . . 77
C.4 Programming for Cylops - threads . . . . . . . . . . . . . . . . . . . 78
C.4.1 Threading and the Mandelbrot Set . . . . . . . . . . . . . . . 79
C.4.2 DIMES/P Implementation of the Mandelbrot-set appliation . 80
C.5 Disussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
List of Figures
2.1 P (n) for onventional memory with L0 = 1/T0, taken from [13℄. . . . 13
2.2 Shemati of a miro-threaded, RISC arhiteture. . . . . . . . . . . . 14
4.1 The lassi Mandelbrot set image generated by Fratint [119℄. Points
oloured blak are in M . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 A false-olour image of the Mandelbrot set generated by Aleph One
[71℄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Simplied shemati overview of the DIMES/P implementation of
CylopsE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Layout of the render and work-stealing threads within the DIMES/P
system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5 The image generated shortly after program start-up. . . . . . . . . . 34
4.6 Image generation has progressed, shortly before a work-stealing event. 34
4.7 Just after the rst work-stealing operation. . . . . . . . . . . . . . . . 35
4.8 The seond work-stealing operation. . . . . . . . . . . . . . . . . . . . 35
4.9 The third work-stealing operation. . . . . . . . . . . . . . . . . . . . . 35
4.10 The ompleted Mandelbrot set. . . . . . . . . . . . . . . . . . . . . . 36
B.1 The image generated shortly after program start-up. . . . . . . . . . 71
B.2 Image generation has progressed, shortly before a work-stealing event. 71
B.3 Just after the rst work-stealing operation. . . . . . . . . . . . . . . . 71
B.4 The seond work-stealing operation. . . . . . . . . . . . . . . . . . . . 71
x
Chapter 1
Introdution
The memory-wall [121℄ is a limiting fator in CPU performane, whih may be oun-
tered by introduing extra levels in the memory hierarhy [15,121℄. However, these
extra levels inrease the penalty assoiated with a miss in the memory-subsystem,
due to memory-aess times, limiting the ILP. Also, there may be an inrease in
design omplexity and power onsumption of the overall system. An approah to
avoid this problem may be to feth sets of instrutions from dierent memory banks,
i.e. introdue threads, whih would allow an inrease in ILP, in proportion to the
number of exeuting threads. There are issues with introduing multiple threads of
exeution, suh as they should not have data or ontrol ow that is inter-dependant
between any of the urrently exeuting threads. Another issue is that the ost for
reating, synhronising and destroying threads should be very heap, whih on-
strains the arhitetural design. The reason for this latter onstraint is that the
latenies to be mitigated against would be pipeline stalls, usually very short peri-
ods of time, potentially between a few to tens of lok yles. Suh short threads,
that are designed to mitigate against pipeline stalls, this thesis shall term as miro-
threads. This denition is in slight ontrast to the denitions used within [13, 68℄,
where the motivation for the denition ame from the diering size of the thread,
i.e. that they laked a stak and ontext. Given that the threads to whih this thesis
refers, and the threads of [13,68℄ are all used to maintain pipeline throughput, then
this modied denition has some justiation. Note that these miro-threads are
not operating-system level threads, whih have large ontext, are potentially pre-
1
Chapter 1. Introdution 2
emptible and used for proess-level parallelism. Miro-threads would be designed to
have very little ontext, making reation and destrution heaper.
Various arhitetures have been proposed that ould support miro-threads:
• The arhiteture of [13, 68℄, whih is designed to support the smallest variant
of miro-threads.
• The CARE (Compiler Aided Reorder Engine) arhiteture desribed in [75℄,
that supports strands, a variant of miro-threads, that full the same goal,
thus ome under the denition of miro-threads used in this thesis.
• The integration of proessing logi and memory [21, 41, 105, 106℄ within the
same hip, termed PIM. Suh integration may also improve both data-proessing
and data-aess time.
In this thesis the appliation of miro-threading to the nal PIM arhiteture is what
will be examined in most detail, with oasional referenes to the other arhitetures.
A problem with integrating proessors and memory in the same spae is that the
proessor speed and the amount of memory are redued [21℄. This may be overome
by onneting multiple, independent PIM ells, where the resultant arhiteture is
desribed as ellular. In this multi-threaded organisation, every thread unit serves
as an independent single-issue, in-order proessor, thus able to potentially aess
memory independently, depending upon the exat details of the arhitetural design.
This gives rise to a number of ode-generation problems, some of whih are dis-
ussed in appendix C, entred around the fat that to provide omputational power,
these systems are massively parallel. It is ommon folklore in the programming
ommunity that writing orret and eient multi-threaded programs is hard. This
problem ould be ompounded for suh ellular arhitetures. Thus, onsiderable
researh eort has been targeted at ode generation, inluding thread generation,
to support suh hardware. There is likely to be muh researh to do: to develop
ompilers to generate multi-threaded ode, reate lower-level libraries that ease the
burden of reating suh ode, and write debuggers that allow the programmer to
eetively debug suh programs.
Chapter 1. Introdution 3
Thread-generating ompilers exist; for example, HPF and UPC [45℄. The Fortran-
based HPF is very useful for mathematial problems, but less so for other problem
domains. Both ompilers speialise on parallelising loop-onstruts. Other C and
C++ parallelising ompilers exist, but are largely based upon the OpenMP library,
for example IBM XL Fortran and Visual Age C/C++, whih also tend to fous upon
loops and a soure of parallelism. Alternatively, higher-level approahes, suh as a
ompiler that may automatially reate threads using the split-phase onstraint ex-
ist for suh arhitetures as EARTH [109℄. The split-phase onstraint may be loosely
dened as when the ompiler may generate a synhronization variable, and a destina-
tion thread for a potentially, long-lateny load from remote memory. This EARTH
ompiler attempts to full the promise of thread-generation for the programmer: it
is automated, general-purpose - not limited to loops - and the shedules it reates
are provably fast and orret.
Moreover, the dierent memory hierarhies within ellular arhitetures add to
the multi-threaded ode-generation problem. Researh is in progress to address this
problem: for example by plaing hardware memory-banks that have dierent aess
and onsisteny models at dierent address ranges in the memory-map of the virtual
mahine, known as loation onsisteny [40℄ is one approah. The EARTH ompiler
and UPC both provide language, hene ompiler support, for suh features using the
split-phase onstraint, or the use of the strit and relaxed keywords, respetively.
The library-based approah to threading has often been made less eetive by a
lak of language support, that would aid the expressiveness and use of thread-related
onstruts (for example threads themselves and synhronization mehanisms). For
example, the use of pragmas in the various implementations of OpenMP, and the
fat that general-purpose languages have been very slow to adopt a suiently so-
phistiated abstration of the features of any mahine model. C/C++ has had the
volatile keyword for over a deade, but has made very limited use of it in supporting
shared data, that may be aessible by more than one thread, an obvious use of the
keyword. (Indeed this use is to be introdued into the next C++ standard, to be -
nalised not before 2009.) This limitation has been noted (at the Assoiation of C and
C++ Users Conferene, 2005, by B. Stroustrup, in one of his keynote presentations,
Chapter 1. Introdution 4
and by others) and has apparently hampered development of multi-threaded pro-
grams and the development of ompilers that might automatially generate threads.
The author ontends that library-based solutions to threading are too dependant
upon the programmer to use orretly. For example, the expliit use of loks in pro-
grams is prone to error, with deadloks and rae-onditions that are hard to trak
down are easily introdued.
The development of suitable tools to debug multi-threaded appliations has been
slow. Some tools are available (strae, truss, pstak and various debugger) but are
very limited in funtionality, with regards to threading. More useful debuggers are
in development, for example for Cylops [29℄. But these are few, with urrently
limited funtionality. Further development in this area would be vital to allow the
programmer to debug their ode on suh systems. A more important aspet of these
tools would be to aid the programmer with regards to reasoning about the funtion
of their multi-threaded ode, and thus avoid suh bugs.
In the author's opinion, the leaders in this eld are aiming at a language, not
library, based solution, whih would be the appropriate level of abstration for the
expression of parallelism within a program. The ompiler support would allow
the development of more powerful multi-threading abstrations, suh as various
algorithms, that would help to divore the programmer from the omplex details
of the underlying arhitetural support. But there are limitations in the diretion
of suh urrent ompiler developments, for example, UPC apparently exposes only
loop-based parallelism and HPF requires expliit statements within the ode to make
the ompiler generate multi-threaded ode whih also direted towards parallelising
loops. The author ontends that this would be far too limited for appliation to
general-purpose programs.
As identifying parallelism both orretly and eiently has been very hard for
the programmer to do, the author ontends that they should not do it. When suh
massively-parallel arhitetures are developed, this proess should inlude time to
develop libraries that plug into the target ompilers to allow them to generate e-
ient ode for that arhiteture. Thus the programmer would identify variables and
funtions that they believe they may be able to parallelise, to guide the ompiler.
Chapter 1. Introdution 5
The ompiler, equipped via these libraries with a detailed mahine-model would be
able to rene and hone these gross indiations in the program to generate eient
ode. The author experiened only limited eort investigating the software aspet
of the ode generation problem for massively parallel arhitetures. Unfortunately,
if this ase should ontinue, this shortoming ould adversely aet the popularity
of suh systems and maintain the pereption that massively parallel arhitetures
are too speialized and thus too expensive to be of more general use. Given the pop-
ularity of multi-ore proessors, this position is set to beome even more untenable.
Chapter 2
Related Work
2.1 The VLIW Origins
The researh that has been done in the eld of multi-threaded arhitetures, may
be onsidered to have been heavily inuened by the work on VLIW arhitetures:
one an onsider them to have a limited number of live threads at any one instant,
limited not only by the number of slots in the instrution word, but also by the
ability of the ompiler to identify suh instrution-level parallelism. Some researh
work [83, 117℄ demonstrated that, in the SPEC95 benhmark suite, there has been
potential for a large number of independent threads, up to the order of thousands.
Unfortunately this motivating result was for VLIW mahine-models with ertain,
ideal parameters; a ommon limit has been the number of available registers, or
bypass buses, or an orale branh preditor within the ompiler. This gave impetus
to the arhiteture eld to researh these rih topis, and has provided very eetive
dynami, rather than ompile-time branh preditors. But the VLIW ompilers, the
trae ompilers of the time, required a ompile-time branh preditor to produe
ode that did not need expensive reovery mehanisms, and enable the ompiler to
perform the whole-program, ode-motion optimizations it needed to do to extrat
the ILP from the programs. Results for the register problem have been similarly
mixed: due to the multi-ported nature of the register banks, there is a physial and
tehnologial limit: having more registers sales the area of the hip linearly, but
more register ports (for bypass buses) sales the area geometrially. Tehnologial
6
2.2. Beyond VLIW: Super-salar: the ombination of branh preditors,
speulation and memory hierarhies 7
limitations in hip fabriation limit the yield of the hips: the larger the area, the
lower the yield in diret proportion. Thus adding a suient number of register ports
will always reah a limit in the urrent tehnologial ability to produe eonomi
quantities of suh hips.
The instrution density in VLIW ode dereased for various reasons:
• a lak of an eetive ompile-time branh preditor,
• ombined with limited register resoures,
• true data-dependenies,
• and strutural hazards
all of whih meant is was neessary to injet no-ops into the instrution stream.
These no-ops have been of vital signiane: they were a diret indiation of the
ineieny of the ompiler and tool-hain, hene arhiteture, to extrat ILP from
the instrution stream, and indiate an ineieny of both the software and the
ompiler. Consequently the eetiveness of the VLIW arhiteture as a tehnique
to inrease performane, via extrating ILP, by re-ompiling the soure ode had
been onstrained.
2.2 Beyond VLIW: Super-salar: the ombination
of branh preditors, speulation and memory
hierarhies
But the researh yielded very useful results: the development of dynami, as opposed
to the ompile-time branh preditors. These meant that speulative exeution
of ode was muh less likely to be wasted work. Thus the advent of super-salar
proessors, but these had their problems: performane was hindered by slow memory
speeds. So small ahes were implemented, based upon the assumption of data and
ontrol loality. The size of a hardware ahe has been hosen to be roughly 10%
of the average size of the exeuting program the related data. These ahes have
2.2. Beyond VLIW: Super-salar: the ombination of branh preditors,
speulation and memory hierarhies 8
been omposed of very high speed memory, whih has been ostly to implement.
They were also plaed diretly in line with the IF stage of the pipeline, allowing
very high-speed instrution-feth from the ahe, if there was a ahe hit [80℄. Also,
regarding instrution feth: the auray of the branh preditor and plaing it very
early in the pipeline has been vital. This is to allow the branh target addresses
to be obtained (potentially via the BTC, or via a default predition, or a dynami
preditor may be used) before the instrutions that would generate the result of that
branh ondition. This allowed the instrution ahe to pre-feth ahe-line sized
amounts of instrutions from slower hierarhies, using the pre-omputed, predited,
branh-target address, and deliver them with minimal pipeline stalls to the IF stage.
The data ahe has been more omplex, but the onept of implementing a small,
write-bak, high-speed amount of memory so that register writes would be direted
to this memory has been relatively simple: it would at as a buer to the lower-
level, slower memories, and allow memory reads to be potentially servied diretly
by the data ahe instead of from the lower-level memory hierarhies. Another
major fator has been out-of-order instrution exeution: if there were suient
proessor resoures, instrutions ould be exeuted in parallel, although they would
be fethed in-order and potentially retired out-of-order. Moreover instrutions that
ompleted faster need not be held up by slower instrutions that were ahead of
them in the instrution stream. The use of a soreboard or register le [59, 80℄
allowed the data-dependenies between registers to be dynamially omputed whilst
the instrutions were in ight in the pipeline. When these ahes were ombined
with branh predition and speulation even more ILP, and performane, ould be
extrated from the input instrution stream. In these arhitetures, the retirement
of instrutions was linked to an arhitetural state (potentially implemented via a
reorder buer) that, if a mis-predition ourred, would have to be rolled bak, and
the instrution feth re-started from the alternative branh. Also, if a proessor were
to implement preise interrupts, for example to implement proessor exeptions, then
a similar roll-bak, or ompletion, of in-ight instrutions would need to our to
ensure that the proessor would be in an arhitetural state that would be onsistent
with the sequential program state.
2.3. Parallel Arhitetures 9
2.3 Parallel Arhitetures
The roll-bak impliitly implemented within super-salar arhitetures has been
viewed as a problem: the inreased state due to deeper pipelines makes the hips
muh more omplex. This inreasing omplexity has been viewed as one of the lim-
its to the salability of the super-salar arhiteture. The impliit assumption in
the von Neumann arhiteture underlies this design, therefore more radial alterna-
tives would need to be researhed if inreased performane may be obtained under
suh onstraints, for example data-ow based ompilers [12, 99℄ and omputer sys-
tems [48, 57℄. But the data-ow arhiteture itself had problems: the arhitetural
state was reeted in inreasingly many registers, with inreasingly many ports, thus
ompliating hip design, in a similar manner to the VLIW register problems.
The implementation of large quantities of memory with mixed exeution units
may be seen to have led to a few avenues of researh. The ones that are pertinent
to this thesis are:
• EARTH and CARE,
• the miro-threaded arhiteture,
• and ellular arhitetures suh as IBM BlueGene/C and Cylops.
In general these arhitetures examine various tehniques by whih the exess per-
formane of the exeution units may be used to ameliorate the relatively limited
instrution and data throughput rate from the memory subsystems. Threading the
program attempts to divide the sequential program into data and ontrol dependent
threads. These dependenies imply a partial exeution order upon the threads that
must be satised to maintain the onsisteny of the original program, as expressed
by the programmer in the target language, whih has often been a sequential lan-
guage. By this tehnique the von Neumann arhitetural onept of strit instrution
feth-deode-exeute-writebak ould be avoided. Instead there ould be, eetively
multiple exeution units, eah exeuting as a von Neumann arhiteture, within a
whole arhiteture that would be applied to the program as a whole, thus attempting
to mine suh ILP as may be available within that program.
2.3. Parallel Arhitetures 10
2.3.1 EARTH, the EARTH ompiler and CARE
2.3.1.1 The EARTH arhiteture
The EARTH arhiteture [53℄, was omposed of: a synhronization proessor and
an exeution proessor, linked by two queues. The program would be written in
Threaded-C, suh that those threads within the program would be sheduled by an
synhronization unit to exeute on onneted exeution unit, but only if all of the
related dependenies had been satised. Due to the multi-proessor nature of the
arhiteture the thread size would be hosen to optimize exeution so that any redu-
tion in eieny due to long lateny delays aused by inter-proessor ommuniation
would be minimized. These delays ould be of many orders of magnitude longer than
latenies due to branh mis-preditions, or loal memory aesses. Threaded-C re-
quired the programmer to annotate their program with thread onstrutors to diret
the ompiler to generate multi-threaded ode.
2.3.1.2 The EARTH Compiler
To overome the neessity for the programmer to annotate the program, Tang in
his work [109℄, desribes a ompiler that was able to take a C program and suitably
annotate it with the appropriate threads. Most importantly this ould be done
without the programmer's intervention.
The tehnique desribed in [109℄ is as follows: the ompiler tried to identify, with
the potential aid of type modiers, those operations that may have aused long la-
tenies. Those memory aesses would be labeled using the loal or remote type
modier, and if no modier were used the ompiler had to assume that the aess
was remote, therefore the type modier would be remote. The remote type modier
indiated to the ompiler that the memory aess would be of long lateny. These
long-lateny operations, for example, memory aesses or funtion alls, would then
be split into two threads. The rst thread was the original thread and the seond
thread ontained the ode that was data-dependent upon the long-lateny opera-
tion. To ensure that the data dependene was satised a synhronization variable
was introdued, suh that the seond thread waited upon this synhronization ob-
2.3. Parallel Arhitetures 11
jet before it ould exeute, whih [109℄ terms as the split-phase onstraint. To
generate these threads the ompiler reated a data dependene graph of the input
program, with the edges in the graph being labeled as remote and loal. Those
remote edges would be split by the ompiler using the split-phase onstraint. The
ompiler also builds up a program dependene ow graph in whih the data and
ontrol dependenies of the program were hierarhially aptured. This graph in-
luded the threaded representation of the original program from whih the ompiler
then identied an optimal order that satised all of the onstraints. This graph also
allowed the ompiler to identify further optimizations:
• To redue thread swithing osts, ontrol and data independent threads should
be merged. This was done by omputing the remote level of eah node, and
merging those that have the same remote level.
• Within a thread, registers should be re-used and data shared with other in-
strutions within the thread, to enhane loality and sequential performane
of the instrution stream.
• Long lateny operations ould be overed by ontrol and data independent
loal operations, providing that the overall ontrol and data dependenies are
satised.
In [109℄, Tang showed that the optimization problem posed by ombining the above
details and minimizing the total exeution time was NP-hard. Thus an alternative
partitioning algorithm was required, to minimized ompilation time. Tang showed
that the list-based sheduling algorithm seleted was no worse than twie as slow as
an optimal shedule of the nodes. This bound may be improved upon by reduing
the ost of remote ommuniation. Tang also examined the use of the various heap
based analysis to aid the thread partitioner so that it an reate more threads, if
required.
The results presented in [109℄ showed that for randomly generated program
graphs, the list-based, thread-sheduling algorithm produed ode that was within
7% of the ideal run-time, whih was lose to an optimal shedule. Also, for the
2.3. Parallel Arhitetures 12
ustom benhmarks used by the paper, the thread sheduler produes ode that was
omparable in performane to optimized, hand-written ode. Their results showed
that the heap analysis tehnique improved the performane of the sheduler, whih
made use of the heap analysis to optimize the thread performane.
2.3.1.3 The CARE Arhiteture
In [75℄ the basis of the large threads implemented within the EARTH was re-
examined. In this ase the threads were muh redued in size. The onept behind
CARE was that the instrution fether within the pipeline required more guidane to
be able to feth instrution pointers to single-entry single-exit basi bloks, termed
strands, that ould be exeuted without stalls within the pipeline. Therefore during
exeution, the instrution fether would have an opportunity to identify other suh
strands for subsequent exeution. Indeed eah strand would have a set of assoiated
ring rules that, if satised, would allow that strand to be sheduled for subsequent,
stall-free, exeution. These ring rules would represent the data and ontrol depen-
denies upon whih the instrutions within the strand depend. Thus the instrution
feth unit would ontain a set of strands that have all of their ring rules satised,
ready to be exeuted, and another set of strands, whih are awaiting their ring
rules to be satised. The ompiler, in this arhiteture, would reate the strands,
and identify the ring rules and populate that data struture. Moreover, the initial
ordering of the strands within the instrution stream would be performed by the
ompiler. But the arhiteture, at run-time would be allowed to re-order strands, if
their ring rules were satised.
2.3.2 The Miro-Threaded Arhiteture
In [13℄ a mathematial model was presented that examined the latenies from a
generalized memory unit, modeled as a queue, to a generalized proessing unit,
i.e. requests for data. Their results for a loal memory system, as opposed to
networked, are reprodued in gure 2.1. They demonstrated that to obtain over
80% performane there need only be over 4 threads ready for exeution at any one
instant in the program. This result was independent of the type of input program.
2.3. Parallel Arhitetures 13
0.0 2.0 4.0 6.0 8.0 10.0
Number of micro−threads.
0.0
0.2
0.4
0.6
0.8
1.0
N
or
m
al
iz
ed
 P
er
fo
rm
an
ce
.
R=0.5T
R=T
R=2T
T = Maximum throughput of memory sub−system.
R = Memory requests per unit time.
Figure 2.1: P (n) for onventional memory with L0 = 1/T0, taken from [13℄.
It was also independent of the exat memory sub-system implementation. Indeed
the only assumption that was made was the fat that the proessor arhiteture an
support miro-threading, the exat implementation of the miro-threading being
abstrated out of the model. From the studies of available ILP within general
programs, it would seem that the implementation of the tehnique of miro-threading
in a proessor would be extremely eetive in maintaining proessor throughput
during memory loads. An important property of the miro-threads desribed in [13℄
was that the ost of thread reation, destrution and synhronization must be very
heap, due to the number and frequent swithing of miro-threads. This property
of miro-threads implied that there must be eient hardware support for them.
To transform a generi program into a miro-threaded program implied that the
ontrol onstraints within the sequential program must be transformed into thread
reation and synhronization onstraints. This task would be ahieved by a miro-
threading stage within a suitable ompiler. Further work [68℄ within this eld has
demonstrated the feasibility of suh an arhiteture. A simple shemati of their
2.3. Parallel Arhitetures 14
Fur
Next address ... deterministic
horizontal transfer of control.
Threads that are
ready for execution.
Next address ... non−deterministic
vertical transfer of control.
therpipeline
Stages
Continuation queue & transfer of control.
Schematic of a micro−threaded, RISC architecture:
Instruction
fetch logic
PC 1
PC 2
PC 3
PC 4
PC 5
Continuation
Queue
Figure 2.2: Shemati of a miro-threaded, RISC arhiteture.
implementation is provided in gure 2.2.
In this arhiteture, there are many very short threads, perhaps only 2-5 in-
strutions in length. They wait upon only one data item, that may be viewed as a
simplied version of the ring rules of CARE. The PCs of those threads that are
ready to exeute are stored in a ontinuation queue, for eventual exeution within
the pipeline. Beause of the arhitetural speed of thread reation, synhronization
and destrution, no speulation would be done: all of those features would be on-
verted into miro-threads, thus the exeution pipeline ould be a relatively simple
RISC-like pipeline.
2.3.3 IBM BlueGene/C and Cylops
This arhiteture will be disussed in muh more detail in hapter 3.3 of this thesis,
but for the purposes of this setion, a brief summary will sue. This arhiteture
was a PIM-like arhiteture, termed ellular, that implements a number of exeution
units and memory units on one die. Thus it has the ability to exeute many threads,
has fast memory aess, and may be viewed, in some sense, as between the EARTH
2.3. Parallel Arhitetures 15
arhiteture and the miro-thread arhiteture, in terms of a threading model.
In order to overome the von Neumann-derived memory wall, some method of
overoming the impliit data-feth delay should be implemented within the arhite-
ture. Moreover, suh implementations usually imply multiple threads of exeution,
whih further implies data and ontrol dependenies that must be resolved, either
at ompile or run-time:
• Within EARTH and CARE this is at ompile-time: the synhronization unit
has expliit dependenies upon whih it must wait, whih have been generated
at ompile time.
• Within miro-threaded arhitetures, these dependenies may be left to be
resolved at run-time, as long as all potentially data-dependent instrutions
are suitably annotated by the ompiler.
• Within Cylops, as will be presented in hapter 3.3, the ontrol and data
dependenies are muh more omplex due to the inreased omplexity of the
arhiteture and the massive parallelism it makes available.
Eventually this implies that some tehnique must be used, either expliitly or im-
pliitly by the programmer to generate the required threads for the arhiteture.
During the researh program, I hose to onentrate upon the Cylops arhite-
ture for the rest of the program, as an example of the problems with programming
for suh sets of threaded arhitetures.
Chapter 3
The limitations of super-salar
arhitetures: the memory wall
The ombination of data and instrution ahes eetively deouples the proessor
from the speed of the main-memory, by simply introduing more layers of ahes
in the memory hierarhy. This deoupling has been highly suessful: the inrease
in performane of proessors of the past deades has been greatly inuened by
the dramati inrease in lok speed. The original 8086 was loked at roughly
4MHz with no instrution ahe, the latest Pentium 4s have been loked at over
3.4GHz [56℄. These latest Pentiums ould retire instrutions at a rate of roughly
ten-times the main memory speed by using two to three ahe levels. But to get suh
high speeds the pipeline depth has had to be inreased. The Pentium 4 has over 20
stages; the AMD Opteron has 10-12 stages, and has been loked at approximately
2.6GHz. With these proessors, if a branh mis-predition or proessor exeption
should our and the state would have to be rolled bak, then instrution feth
and the pipeline must be restarted, so it would take inreasingly long in a 20 stage
pipeline to begin retiring instrutions after the restart. The auray of the branh
preditor has been paramount, to avoid suh time-onsuming re-starts. But if the
proessor speed were to inrease, then more stages may be needed, and branh-
mispreditions would beome even more ostly. Moreover, the inreased lateny of
instrution feth from the mis-predited branh would inrease due to the divergent
relative speeds of the proessor and main memory. This problem has been termed
16
3.1. Multiple ores and massively parallel arhitetures 17
the memory wall [117℄.
3.1 Multiple ores and massively parallel arhite-
tures
The problem of the memory wall may be viewed as an eet of the relative perfor-
mane dierene of main memory to proessor speed. If the instrution throughput
ould be inreased by reading instrutions from dierent memory banks, then the in-
strution issue rate is potentially limited by the number of available memory banks,
and IF stages attahed to them.
Multi-ore proessors develop this idea. Let us suppose that the OS supports pre-
emptive multi-tasking, and these OS-level threads are guaranteed to have the inter-
thread, data-dependenies expliitly speied using kernel-level (thus arhitetural)
synhronization primitives. If the resoures used for developing higher lok speeds
were instead used in implementing another ore within the proessor pakage, this
extra ore would be viewed as an extra proessor by the OS for sheduling threads
upon. Moreover, if the program were suitably written, it ould take advantage of
any extra proessor resoures. But this requires extensive and potentially diult
modiations to the soure ode to allow it to take advantage of suh extra resoures.
Moreover, the use of OS-level threads is expensive: they have a lot of ontext,
beause eah thread must not only retain the proessor state, but the OS state, if it
were to be ontext-swithed o the proessor. Arhitetural-level threading would
seem to be a faster and more simple approah. Another limitation with multiple
proessor ores is that the proessor ores take die spae away from the ahes and
branh-preditors, that are proven, high-performane solutions.
Furthermore, there are osts assoiated with swithing between an OS-level
thread with onsiderable ontext. These osts inlude the memory aess times
to ush the OS and arhitetural states to main memory, and the instrution- and
data-ahe misses inherent in suh a ontext swith. A tehnique to avoid these
latenies may be to redue the thread ontext to a level suh that any suh ontext
ould be maintained in the proessor, without having to be ushed to a lower mem-
3.1. Multiple ores and massively parallel arhitetures 18
ory hierarhy. But this implies a dramati redution in ontext: for miro-threading
the ontext has been limited to only a program ounter - an extreme example. More-
over this redution also implies that these threads would be unlikely to be managed
at the OS-level.
To ounter the memory latenies inherent in the super-salar designs, the ap-
proah of plaing the exeution pipelines as lose to the memory as possible may
be taken. In this ontext lose means that the memory and the exeution pipelines
are on the same die. Eah instrution feth stage and the data-bus of a pipeline
would be fed diretly from an independent bank of memory. Thus the instrution
fethes and, more importantly, data reads and writes an our independently of
other pipelines on the same die and other dies. This integration has the advantage
that the lateny of memory aess would be dramatially redued. But to allow suh
integration, the pipelines are usually muh more simple than a super-salar pipeline.
Often they have no branh predition, thus no speulation, whih allows the spae
that a pipeline onsumes on the die to be dramatially redued, thus allowing more
memory and more exeution units per die. For example, in the pioChip [33℄ design
there are approximately 308 VLIW ores and a similar number of DSP pipelines
on one die, with eah VLIW ore having diret, 1-lok yle aess to approxi-
mately 64K of RAM. Alternatively in the IBM BlueGene/C design [4℄, desribed
in setion 3.3, more sophistiated 64-bit ores are implemented with approximately
64K of software-ontrolled data ahe, and another 4Gb of RAM on hip, but with
a redued number of pipelines, in this ase approximately 96. Suh hips oer a
onsiderable instrution retirement throughput. To further inrease the bandwidth,
the pioChip has 4 ports implemented on it for aessing other pioChips in a grid
arrangement, and a memory port for aessing o-hip memory. To date, arrays of
up to 16 pioChips have been built. The IBM BlueGene/C design has 6 inter-hip
onnetion ports, allowing a ubi array of hips. Suh an arrangement of IBM
BlueGene/C hips has been termed as a ellular arhiteture: eah ell would be
an IBM BlueGene/C hip. The size of the entire IBM BlueGene/C array has been
envisaged to sale up to potentially 10,000,000 individual ells.
3.2. The programming models: from ompilers to libraries 19
3.2 The programming models: from ompilers to
libraries
With suh ompute bandwidth, and parallelism, a number of problems for the pro-
grammer have been raised, primarily these are foused on the problems of memory
reads and writes. Super-salar hips have had mehanisms to hide these problems
from the programmer, but the ellular hips suh as pioChip and IBM BlueGene/C
do not. Thus the programmer needs to know how memory reads and writes interat
with:
• the software-ontrolled data-ahe attahed to that pipeline,
• the software-ontrolled data-ahe of other on-hip pipelines,
• any global on-hip memory,
• the software ontrolled data-ahes of other o-hip pipelines,
• the global on-hip memory that is on any other hips,
• any global memory that is not on any hip
• and nally, given the massive parallelism available, how to make eient use
of it.
These issues give rise to various programming models, but initially the last point
will be disussed. Given the evidene of ILP studies, the eient use of the massive
parallelism for general purpose programs suh as SPEC2000 is highly unlikely to be
able to be parallelized to the extent of using a fration of the resoures of the IBM
BlueGene/C design, and similarly with the smaller pioChip. The answer would be
that these arhitetures eshew the aspiration of being pratial for general-purpose
use. Instead they target spei, embarrassingly-parallel problem domains.
For a programmer, the memory aess models are important to understand, or to
have a library or ompiler that hides the details from the appliations programmer.
In the remainder of this thesis the author will fous on the IBM BlueGene/C arhi-
teture, and a prototype implementation of it alled Cylops, that was implemented
3.3. IBM BlueGene/C, Cylops and DIMES/P: the implementation of a
ellular arhiteture 20
at CAPSL at the University of Delaware in ollaboration with the University of
Hertfordshire. In the following setions the memory aess models will be disussed,
leading on to a presentation of the author's experiene in developing a program
for suh an arhiteture. The experiene gained from this will allow the author to
disuss the major problems that were faed, how, if at all, they were overome, and
the outstanding problem domains that, in the author's experiene, would hinder the
aeptane of multi-ore hips and, moreover suh massively parallel designs as IBM
BlueGene/C.
3.3 IBM BlueGene/C, Cylops and DIMES/P: the
implementation of a ellular arhiteture
The IBM BlueGene/C arhiteture is desribed in detail in [4℄. Briey, this arhite-
ture onsists of a large number of thread units, an equal number of memory banks
and a large rossbar on one die. The exeution thread-units are linked to eah other
and the memory banks via the rossbar, whih also has at least 8 o-hip interon-
nets. These interonnets may be used to onnet more of these hips together in
a large 3-d mesh. Of the order of 160 thread units are on a single die, with the
order of 2-4 Gbytes of DRAM, on-hip. This means that per hip there is a large
amount of available parallelism, and onsidering that the 3-d mesh may ontain of
the order of 100,000 of suh hips. A further fator in this design is that there is
no data ahe: instead there is a speialized portion of eah DRAM bank that is
diretly aessible via a related thread unit. Suh a portion of the DRAM is termed
the srath-pad memory, and is eetively a software ontrolled data ahe. This
srath-pad memory is aessible from that related thread unit without having to a-
ess the rossbar. The other memory, not assoiated with any partiular thread-unit
is termed as on-hip memory . This gives rise to dierent memory aess models.
These memory aess models are related to the work on loation-onsisteny, de-
sribed in [40, 124℄. In brief, this is the onept that if a set of memory loations
are aessed from two dierent thread units, the thread units will experiene dif-
ferent memory aess models of those memory loations, upon simultaneous aess.
3.4. Programming Models on Cellular Arhitetures 21
For example: simultaneous aesses, by dierent thread units, to loation 1 might
provide program onsisteny as the memory aess model, whereas for loation 2,
with simultaneous aesses, by dierent thread units, this might provide sequential
onsisteny. With regards to IBM BlueGene/C the srath-pad memory only guar-
antees program onsisteny with regards to memory aesses. But for any memory
aessed via the rossbar, it guarantees sequential onsisteny.
At CAPSL muh work had been done in ollaboration with IBM with regards
to an implementation of the BlueGene/C arhiteture alled Cylops. Initially,
this work was implementing CylopsE [21℄, whih was developed into Cylops64,
[30℄. The CylopsE arhiteture was prototyped in hardware, alled DIMES/P,
[90,91℄. DIMES/P was used as the platform for exeuting the programming example,
desribed in setion 4.
With regards to any later disussions, it is very important to remember that
eah of these arhitetures, IBM BlueGene/C, Cylops64, CylopsE and DIMES/P
display the same features: multiple thread units and multiple memory onsisteny
models. This is simply beause they are all implementations of these same underly-
ing onepts.
3.4 Programming Models on Cellular Arhitetures
The hardware dierenes between ellular and super-salar arhitetures indiate
that dierent programming models, are required to make eetive use of the ellular
arhitetures [40, 41, 120℄. In the rst two of those three papers, their author pro-
pose the use of a ombination of exeution models and memory models, as already
desribed in setions 3.2 and 3.3.
The primary onerns when programming DIMES/P, and thus any Cylops-
based arhiteture, were:
• How to manage the potentially large numbers of threads.
• How to easily express any parallelism within the input soure-ode.
3.5. Programming for Cylops 22
• How to make orret, and most eetive use, of the memory onsisteny mod-
els.
Some researh has already been done regarding programming models for the thread-
ing, suh as using thread perolation as a tehnique to perform dynami load-
balaning [18, 53, 61℄. Another piee of researh [22℄ investigated using multi-level
sheduling-shemes: a work-stealing algorithm at the higher-level and a multi-
threading tehnique at the lower-level to hide ommuniation latenies. A further
piee of researh [37℄ investigated the use of laments as lightweight threads to
eiently implement thread ontrol.
3.5 Programming for Cylops
Cylops has a set of partiular onerns assoiated with programming for it, some
of whih have been investigated, but for alternative arhitetures. For Cylops, a
reasonable tehnique for implementing memory onsisteny models, thread manage-
ment, and nally making use of any parallelism was investigated.
This started with investigating how to easily implement the memory-onsisteny
models. This was relatively simple: earlier, unpublished, work on the GCC-based
ompiler had implemented a simple algorithm: all stati variables were stored in
on-hip memory, and the funtion all stak, inluding all automati variables was
plaed in the srath-pad memory.
As there was no language-level support for thread management, a library had
to be implemented to support the thread management instrutions in the Cylops
ISA. An early version of TNT [29, 31℄, alled threads was used as the basis for
reating a higher-level C++ abstration. The author onsidered that the thread
implementation, that losely followed a POSIX-Threads API, was far too primitive
to be eetively used for programming Cylops. The simple C++ API that was
developed also inluded thread-management, ritial-setions, mutexes and event
objets to allow for easier management of the lower-level objets.
An abstration of the extration of parallelism from the range of possible ex-
ample programs was not implemented for this thesis, as this was onsidered to be
3.5. Programming for Cylops 23
potentially too losely oupled to the atual program in question. In the author's
opinion, not performing this abstration of parallelism was awed, beause it is
where the ruial, further generalisations take plae that allow a programmer to im-
plement an algorithm with far less regard for the underlying arhitetural features.
Thus the programmer would obtain muh greater benets from this more powerful
abstration.
To test these ideas, and the Cylops arhiteture, a simple program was hosen.
It had the properties that it was a small problem and embarrassingly parallel, ideally
suited to CylopsE. Thus an implementation of a program to generate Mandelbrot
sets was reated, whih will be desribed in the following hapter, 4.
Chapter 4
Programming the Mandelbrot Set
Algorithm for Cylops
In this hapter, whih is a more detailed desription of the work done in appendix
C, the salient details of the Mandelbrot set and an informal algorithm will be given
for generating the set. How this algorithm may be multi-threaded is presented, with
partiular attention to the implementation used for DIMES/P [90℄. This is a proto-
type of the DIMES hardware that implements a redued version of CylopsE [21℄.
Alternative algorithms are also presented, but were not implemented. A desription
of how the threaded algorithm was implemented on the DIMES/P platform will be
presented, followed by an example of the appliation running and the operation of
the work-stealing algorithm.
Further details and the various presentations whih were based upon this work
are given in appendies B (this was a presentation give to the University of Hertford-
shire, upon my return from CAPSL, introduing DIMES and my work done there),
A (this was a draft paper prepared at CAPSL in ollaboration with Dr. Egan, for
submission to various onferenes) and C (this was a onferene paper that has been
aepted for publiation at ACSAC06).
24
4.1. An Introdution to the Mandelbrot Set 25
4.1 An Introdution to the Mandelbrot Set
The Mandelbrot [10, 72℄ set is intimately related to the Julia set
1
[60℄, disovered
in the 1910s. They are both mathematial entities alled fratals relating to the
fat that they have a non-integer dimension. Fratals are part of the branh of
mathematial alled Chaos Theory, whih may be dened as the term for those
theories relating to pseudo-random mappings and funtions. The appliations of
Chaos Theory is widely varied and inludes suh appliations as ompression [85℄,
ryptography [32℄, eonomis [82℄, seismology [114℄, the shape of naturally ourring
objets [10℄ suh as louds, trees [6℄ and landsapes, mediine suh as the modelling
of brillation in the human heart [44℄, whih is apart from the pure mathematial
or aestheti nature of the objets.
Both the Mandelbrot and Julia sets may be reated by iteration of a very simple
equation:
zn+1 = z
2
n + c (4.1)
In this equation, zn is a omplex number, where z0 = 0. c is also a omplex
number, whih is initialised to a value onstant throughout the iterations. The
iteration of equation 4.1 terminates when:
1. Either n reahes the so-alled maximum iteration value, m, a xed onstant,
greater than zero.
2. Or | zn | exeeds the so-alled bailout value of 2, usually set to the real value
4 (=| zn |
2), for eieny reasons. It has been proven that | zn |→ ∞ one
| zn |≥ 2.
To generate the Mandelbrot set, algorithm 1 is used.
Usually the seletion of c is not random, but a raster-san of the omplex
plane. It is not neessary to san the whole of the omplex plane, as a property
1
Eah point in the Mandelbrot set is an index into the Julia set for that point.
4.1. An Introdution to the Mandelbrot Set 26
Algorithm 1 The lassi algorithm used to generate the Mandelbrot set.
1. Set the value of m, the maximum iterations, greater than zero.
2. Selet a point from the omplex plane, and set c to that value.
3. Initialise n = 0, z0 = 0.
4. Exeute equation 4.1.
5. Inrement n.
6. If | zn |
2≥ 4 then that c is not in the set of points whih omprise the Man-
delbrot set. Go to 2.
7. If n > m then that c is in the Mandelbrot set, i.e. c ⊂ M . Go to 2.
8. Go to 4.
Figure 4.1: The lassi Mandelbrot set image generated by Fratint [119℄. Points
oloured blak are in M .
of the Mandelbrot set is that it is entirely ontained within the irle of radius 2,
entred on the origin of the omplex plane. Another important property of the
onversion to oating-point arithmeti is that the distane between the suessively
seleted points c is a nite number representable by a oating-point number, and
non-zero. In other terms, this distane is the resolution at whih the set is reated.
Usually the set of points M is displayed as an image, with those points in the
set oloured to ontrast with those that are not in the set. This gives the lassi
image in gure 4.1. The blak region in gure 4.1 is a basin of stability of algorithm
1. Those points of whih it omprises remain within a nite distane of the origin,
4.2. Threading and the Mandelbrot Set 27
Figure 4.2: A false-olour image of the Mandelbrot set generated by Aleph One
[71℄.
i.e. | zn |< ∞. Those outside this region are unstable, and eventually | zn |→ ∞.
More ommonly, the points c have a value assigned to them that is derived from
n, the iteration at whih algorithm 1 terminated for that point. This gives a false-
olour image, as shown in gure 4.2 , in whih the points c of similar olour are
termed level-sets, basins of stability identied by the algorithm that enlose the
Mandelbrot set.
4.2 Threading and the Mandelbrot Set
An important property of algorithm 1 to generate the Mandelbrot set is that the
lassiation of eah c in the omplex plane is independent of the lassiation of
any other c. Therefore the Mandelbrot set may be implemented as a massively
parallel appliation, thus potentially suited to ellular arhitetures. Studies of
alternative implementations for dierent arhitetures, suh as ne-grain threaded-
arhitetures [37℄ and NUMA arhitetures [22℄ have already been done. For ellular
arhitetures, another important feature of this lassiation proess is that the
oating point support required may be implemented in xed-point arithmeti using
up to 32 bits for the digits, as DIMES/P [90℄ laks oating point support.
The Mandelbrot set may be implemented using one algorithm per thread unit
4.2. Threading and the Mandelbrot Set 28
within the ellular-arhiteture mahine. This approah would work well for massive
lusters of ellular omputing nodes. (Remember that for an image of 100×100
points, c, 10,000 thread units would be required with this tehnique.) Moreover,
the lassiation of any randomly seleted c may take between 1 and m iterations
of the algorithm. In general it is not possible to know in advane how long suh a c
will take to lassify. Therefore the omputation time would take approximately m
times the time per iteration loop in algorithm 1.
Due to the properties of DIMES/P [90℄, this tehnique was not possible, as there
were only 8 thread units between two proessors. The hosen implementation, de-
rived from the implementation used in [71℄, had the omplex plane divided into a
series of horizontal strips. Separate render threads, as the lassiation of the points
c within eah strip is independent of suh lassiation on other render threads.
Therefore eah render thread implements a slightly modied version of algorithm 1,
whih is provided in algorithm 4. Only the oordinates for the bounding retangle
are inter-related between the render threads. However, eah strip will, in general,
take a dierent amount of time to render, thus the render threads will omplete
their assigned portion of work at dierent times. This lead to the addition of a
load-balaning algorithm moving unompleted work to threads that have already
ompleted their assigned work. Thus a work-stealing algorithm 5 was added to per-
form the load-balaning between the render threads. Alternative implementations
of the Mandelbrot set using a work-stealing algorithm [22℄ or ne-grain threaded
algorithm [37℄ exist.
The updates to the start, x, and nish points of the strips for the render threads
Tc and Tl are performed atomially - the threads are suspended whilst these up-
dates are done, either beause Tc is stopped or beause Tl is stopped by using a
mutex. (A mutex is required as the data to be updated is a two omplex numbers,
x and the nish point, these must both be updated as a pair, atomially. In this
implementation a omplex number onsists of two words - one for the real part,
one for the imaginary part.) This is a dynami-programming solution to the load-
balaning problem of work distribution between the render threads. Moreover, the
algorithm is robust: if the estimated ompletion-time, t, has an error, whih it is
4.2. Threading and the Mandelbrot Set 29
Algorithm 2 The render-thread algorithm.
1. Set the value of m, the maximum iterations, greater than zero. Set the es-
timated ompletion-time, t, to the largest, nite, representable time-period
possible.
2. Set c = x, where x is the top-left of the strip to be rendered.
3. Initialise n = 0, z0 = 0.
(a) Exeute equation 4.1.
(b) Inrement n.
() If | zn |
2≥ 4 then that c is not in the set of points whih omprise the
Mandelbrot set. Go to 4.
(d) If n > m then that c is in the Mandelbrot set, i.e. c ⊂ M . Go to 4.
(e) Go to 3a.
4. Inrement the real part of c. If the real part of c is less than the width of the
strip to be rendered, go to 3.
5. Calulate the average of t and the time it took to render that line.
6. Set the real part of c to the left-hand of the strip. Inrement the omplex part
of c. If the omplex part of c is less than the height of the strip, go to 3.
7. Signal work ompleted, set t = 0 (thus this thread is guaranteed not to be
seleted by the work-stealing algorithm 5).
8. Suspend.
Algorithm 3 The work-stealing algorithm.
1. Monitor render threads for a work-ompleted signal. That thread that om-
pletes we shall denote as Tc.
2. Find that render thread with the longest estimated ompletion-time, t, note
that eah render thread updates this time upon ompletion of a line. Call this
thread Tl.
3. Stop Tl when it ompletes the urrent line it is rendering.
4. Split the remaining work to be done in the strip equally between the two render
threads Tc and Tl.
5. Restart the render threads Tc and Tl.
6. Go to 1.
4.3. A Disussion of the Work-Stealing Algorithm 5 30
very likely to have, the algorithm merely performs exessive work-stealing opera-
tions, but automatially tunes to nd a loal minimum in the total ompletion time
urve. Experiments with [71℄ have shown that the algorithm an aommodate er-
rors of over 100% in the estimated ompletion-times, and rapidly orrets to the
new loal minimum.
4.3 A Disussion of the Work-Stealing Algorithm 5
The algorithm 5 has some important features:
• The bandwidth of the single thread that implements that algorithm is the lim-
iting fator in its ability to sale. Conversely, this algorithm is able to tolerate
failures in render threads and is therefore robust. If a render thread stops
responding, eventually it will be the slowest, unnished render thread, and its
work will be stolen. It is possible to sale this work-stealing algorithm, if one
observes that the work-stealing algorithm operates upon a slie of the omplex
plane, demonstrating that the work-stealing algorithm is reursive. It is pos-
sible to assign strips s0...j of the plane to independent sets of render threads,
governed by their own work-stealing thread. These si strips are monitored by
a work-stealing thread in turn, those strips returning an aggregate estimated
ompletion time. But this has a limitation: One the number of render threads
beomes of the order of the vertial resolution of the image, the ompletion
time is bounded by the maximum time it takes a render thread to generate a
single line. This line for the Mandelbrot set in gures 4.1 and 4.2 is the line
(−2, 0) to (2, 0), whih has the most points within the set. These points take
m time to lassify. As the unit of work in the work-stealing algorithm is a line,
this is the slowest line, and thus the ultimate limit of this algorithm, unless
the resolution is inreased. This disussion leads to the following algorithm:
• If robustness is not required, then the image generated may be viewed as an
array of values, where eah of these values is the lassiation of c. Consider
if there are p0...q threads, eah pn thread initially lassies a point in the array
4.4. DIMES/P Implementation of the Mandelbrot-set appliation 31
64K Global
Memory
Thread Unit 0
4K Scratch pad
Thread Unit 1
4K Scratch pad
Thread Unit 2
4K Scratch pad
Thread Unit 3
4K Scratch pad
Processor 0
64K Global
Memory
Thread Unit 0
4K Scratch pad
Thread Unit 1
4K Scratch pad
Thread Unit 2
4K Scratch pad
Thread Unit 3
4K Scratch pad
Network
Processor 1
Figure 4.3: Simplied shemati overview of the DIMES/P implementation of Cy-
lopsE.
oset by n, and one ompleted, moves along the array using a stride of q. This
allows the use of a number of threads that is bounded by the number of points
within the image. As this may be for an image of resolution 100×100, thus
10,000 points, whih maps well on to the ellular arhitetures as desribed
in [21℄. For more thread units, the image resolution would need to be inreased.
Unfortunately, this algorithm does not have a natural ability to tolerate failures
in thread units, unlike the work-stealing algorithm, 5.
4.4 DIMES/P Implementation of the Mandelbrot-
set appliation
A simplied shemati diagram of the DIMES/P implementation (from [90℄) of the
CylopsE proessor is given in gure 4.3. The features of this arhiteture are that
the memory model for the two types of memory, the srath-pad and the global
memories are dierent:
• Global memory obeys the Sequential Consisteny Model for all thread units.
4.4. DIMES/P Implementation of the Mandelbrot-set appliation 32
Processor 1
Thread Unit 0
Thread Unit 1
Thread Unit 2
Thread Unit 3
Render Thread
Render Thread
Render Thread
Work−Steal Th.
Memory128K Global
100x100 Pixel Image
Processor 0
Thread Unit 0
Thread Unit 1
Thread Unit 2
Thread Unit 3
CRTS − Debug
Main
Render Thread
Render Thread
Figure 4.4: Layout of the render and work-stealing threads within the DIMES/P
system.
• Srath-pad memory obeys the program onsisteny model for all thread units,
apart from the thread unit to whih it is attahed.
Suh dierent onsisteny models aet the way that the data for the Mandelbrot-
set appliation is arranged in memory, but this will be disussed in more detail in
setion 4.4.1.
The stati layout of the render and work-stealing threads within the DIMES/P
system is shown in gure 4.4. The software threads that oupy the thread units
are:
• The CRTS - Debug thread is required for the debugger, if it is exeuted. As
threads are statially alloated at program start-up, this must be left free for
the debugger and Cellular Run-Time System (CRTS
2
) support.
• Main is the main loop of the Mandelbrot-set appliation.
• The Render Threads are the threads that exeute algorithm 4.
2
Not to be onfused with the ANSI/ISO 'C' Runtime.
4.4. DIMES/P Implementation of the Mandelbrot-set appliation 33
• Work-Steal Th is the thread that exeutes algorithm 5. Only one work-stealing
thread was implemented, due to the limited number thread units per proessor.
In priniple a render thread ould also run on this thread unit, but the CRTS
does not support virtual threads, moreover the work-stealing thread atually
has to spin in a busy wait monitoring for ompletion of a render thread. Hene,
for this appliation, it was deemed an unneessary omplexity.
Further details regarding the implementation may be found in [70℄.
4.4.1 The Memory Layout
As far as the programmer is onerned, the two 64k global memory units omprise
a single, ontiguous 128k blok of memory whih is for ode and global data. The
programmer has no aess to dierentiate between them. Moreover, the CylopsE
design is suh that aess times to them are the same, no matter whih thread unit
from whih proessor aesses them. The programmer may ensure that data will be
plaed in global memory by the ompiler by ensuring that it is stati. This may be
done by making it global, or using the C/C++ keyword stati. The ompiler plaes
the stak frame into the srath-pad memory, whih means that funtion all depth is
limited, as there is only 4K stak spae per thread. The Mandelbrot-set algorithm
as desribed does not need this muh spae for eah thread unit, thus all thread
loal-data is plaed into the orresponding srath-pad memory for performane.
The 40,000 bytes of image data (100×100 words, 1 word = 4 bytes) is plaed in
global memory for implementation reasons. DIMES/P has no onsole, thus the only
way that ommuniation with DIMES/P an our is via the global memory from
a speially written program running on the host omputer.
4.4.2 The Host Interfae
The DIMES/P implementation is physially loated on an FPGA on a PCI board,
with speialized hardware and software support for it to ommuniate with the
host omputer for loading programs, and ommuniating results, of whih details
are given in [29, 90℄. A simple ommand-line program was written to periodially
4.4. DIMES/P Implementation of the Mandelbrot-set appliation 34
T2
T0
T1
{
{{
Figure 4.5: The image generated shortly after program start-up.
T2
T0
T1
{{
{
Figure 4.6: Image generation has progressed, shortly before a work-stealing event.
feth the image data from the DIMES/P memory, and save it to a le for subse-
quent display. This program also allows the user to enter image parameter data for
subsequent ontrol of the image rendering on DIMES/P.
4.4.3 Exeution details of the Mandelbrot-set appliation
In this example, there are three threads for simpliity:
1. On program start, the render threads start to exeute and perform their as-
signed work. The assigned work is initially equal
1
3
portions of the total image,
arranged in horizontal strips. The top render thread is denoted by T0, the mid-
dle by T1 and the lower by T2, although this relative position will hange later.
The operation of the render threads may be seen in gure 4.5. No work-stealing
has ourred, so there are just three strips, one per render thread, sanning
from left to right, top to bottom.
2. As the image generation proeeds, the T0 and T2 threads progress faster than
T1, as seen in gure 4.6. Note how T0 has alulated more than T2 - the lighter
areas take longer to alulate, and the strip generated by T0 is blak at the
top, and white at the bottom, but the onverse is true for T2.
4.4. DIMES/P Implementation of the Mandelbrot-set appliation 35
T
T
T2
1
0 {{
{
Figure 4.7: Just after the rst work-stealing operation.
T1
T0
T2 {
{
{
Figure 4.8: The seond work-stealing operation.
3. The rst work-stealing operation has just ourred. T0 nished and T1, the
slowest (mainly white) has had the remaining work divided between it and T0,
see gure 4.7. Note how the end-point of T1 was assigned to be the new end
point of T0 and the new end-point of T1 is the start-point of T0.
4. Shortly after this rst work-stealing operation ours, T2 ompletes its assigned
work. A seond work-stealing operation ours, see gure 4.8. In this ase work
was again stolen from T1, and assigned to T2.
5. After a pause T1 ompletes its assigned work, and another work-stealing op-
eration ours, this time with T0, whih may be seen in gure 4.9.
6. Finally the set is ompleted, see gure 4.10 , with no further work-stealing
operations, as the number of unompleted lines for any render thread is less
T2
T0
T1
{
{{
Figure 4.9: The third work-stealing operation.
4.4. DIMES/P Implementation of the Mandelbrot-set appliation 36
Figure 4.10: The ompleted Mandelbrot set.
than 2, and a line is the minimum unit of work for this algorithm.
Chapter 5
List of Ahievements
The following goals have been ahieved by the author in the ourse of the MS(Res)
program:
• A ollaboration between the University of Hertfordshire and the CAPSL group,
under Professor Gao at the University of Delaware, was set up by the author.
As part of this ollaboration the author worked at the CAPSL group for ap-
proximately 18 months.
• Two departmental seminars regarding this work were presented at CAPSL.
• A poster of the Mandelbrot set implementation on the DIMES/P-2 platform
was shown at Super Computing '03, amongst other posters from the CAPSL
group regarding DIMES.
• Two departmental seminars on regarding this work were presented at the Uni-
versity of Hertfordshire, shortly after the author's return from the CAPSL.
• A onferene paper by the author has been aepted for the 11th Asia-Pai
Computer Systems Arhiteture Conferene, titled: The Challenges of E-
ient Code-Generation for Massively Parallel Arhitetures, also to be pre-
sented by the author. It is inluded in appendix C.
37
Chapter 6
Summary
The limitations of DIMES/P prevented further study of the properties of this pro-
gram: salability and timings were not done beause of the limited number of thread
units (8) and memory apaity (128k). Despite this, the development of the program
was instrutive: an initial ontention of this thesis was that the memory models and
massive parallelism (i.e. large numbers of miro-threads) inherent in ellular arhi-
tetures would make programming for them hard. This was experiened to dierent
measures, relating to the memory model support, the thread library and therefore
the miro-thread support.
With regards to the memory model support, the fat that the ompiler made
natural use of language-level syntax to map data into srath-pad and on-hip mem-
ory (using the C/C++ keyword stati) made using these dierent memory mod-
els easy. But this simpliity was at a prie: Cylops only has word-sized, atomi
memory-operations, and these operations were apparently unused for this problem.
The author ontends that suh multiple, read-modify-write operations that must be
maintained as an atomi unit hampered the performane of the program on Cylops,
beause they ould not make use of the hardware-level support for atomi opera-
tions. So the more usual barriers suh as mutexes and ritial setions were needed.
This implies that the manual loking that had to be applied should really have been
implemented within the ompiler-provided support via the stati keyword. If this
were the ase then it may have been possible for the ompiler to perform optimiza-
tion on the loking of aess to the data, and improved program performane, with
38
Chapter 6. Summary 39
apparently no impat upon the developer. As already mentioned, ertain members
of the C++ standards ommittee are proposing the extension of the exeution model
within the C++ standard to support the onepts of memory onsisteny within the
C++ standard. That this proposal will address the problem outlined above is not
yet lear. It is the author's ontention that there should be support for suh loking
(in some manner implemented within the ompiler or a run-time library) if pro-
grams more sophistiated than the one desribed in this thesis are to be suessfully
written for these arhitetures.
With regards to the thread library: the omplexity of POSIX-Threads has been a
hindrane to suessful multi-thread program reation. Indeed this opinion has been
voied by some members of the C++ standards ommittee at the ACCU 2004, 2005
and 2006 onferenes. The reation of a C++ wrapper to hide thread reation and
destrution, and ombine with that thread, any loal storage in an eient manner,
was only partially suessful: The onept that a thread is an objet has not been not
universally aepted, beause this means that the data to be manipulated beomes
intimately intermingled with the thread-management ode. This would be an even
greater problem when onsidering miro-threads in that they have little, or indeed
no ontext, thus mingling threading onstruts with data is potentially in diret
onit with the design of miro-threads. Even for this simple example program,
this mingling was evident in the work-stealing algorithm, and the way it interated
with the start and end-points of the worker threads. For more omplex, larger
programs, suh omplexity would be likely to make writing them orretly, and
modifying them later very hard. Subsequent updates to the thread model, whih
beame TNT, desribed in [31℄, are still largely POSIX-Thread based, and whih
is a low-level API. The onept that data and exeution should be kept separate
is ommonly and naturally embodied in programming via the syntax of main. It
has been ontended by members of the C++ Standards ommittee that this pattern
should be dupliated for thread libraries: that there should exist a pool of threads,
to whih work is passed. This work would be asynhronously exeuted, on a thread
within the pool. With the results returned from that pool via a wait-able objet.
This onept is similar to the data-ow designs that preeded VLIW, indeed it has
Chapter 6. Summary 40
been argued that this onept is a software emulation of data-ow.
When onsidering the harder problem of reating an eetive algorithm to imple-
ment miro-threading and representing that in ode, learly the example program
desribed above was very limited in its ahievements. The work-stealing algorithm
was intimately related to the program design. The ability to abstrat the work-
stealing operation to other problems would be very limited using that design. Al-
ternative approahes have been examined, suh as in [88℄, using OpenMP, whih
was still used as a library to express the parallelism, but OpenMP poorly maps to
miro-threads, the primitives it implements, arguably, have been too tied into the
proess-level parallelism for whih it was originally designed. Alternatively, if one is
to onsider the suggestion above, of a miro-thread pool into whih work is submit-
ted, then the details of how the pool works beome separated from the work itself.
The fat that the pool balanes work between threads using a master-slave, or work-
stealing algorithm should be independent of the work: a natural division of onepts.
If this were the ase, then the programmer would be free to add work to the pool
as desired. The parallelism of the algorithm would be more naturally expressed in
terms of operations on data. If one is to onsider this further: the atual exeutable
ode (in terms of the funtion pointer, in miro-threading terms a program ounter)
and the data are passed to the pool together. It ould be then possible for the pool
to be designed to make use of data loality and ode loality: Did a previous thread
exeute that ode before? If so, prefer to run that work on that thread. If there
are liques of threads, related due to resoure asymmetry, then one might reate a
pool to represent the partiular feature of that resoure. For example a Cylops hip
might be represented as a miro-thread pool, ontained within a greater thread pool
that represents the mahine, due to the fat that o-hip memory aess makes use
of a message-passing protool, rather than the rossbar network embedded within
the hip, that allows muh more rapid memory aess.
It is still an open question regarding what may be the ideal approah to im-
plementing parallelism via miro-threading: language-level support suh as UPC,
HPF or other language extensions, or within the ompiler using trae-sheduling, or
should it be at a library-level using, for example OpenMP or POSIX-Threads, or
Chapter 6. Summary 41
should it be within the arhiteture, suh as the miro-threaded arhitetures [13℄
of Luo et al [68℄, CARE [75℄ or Cylops [31℄.
Bibliography
[1℄ Adam, T.L., Chandy, K.M. and Dikson, J.R., A Comparison of List Shedules
for Parallel Proessing Systems, CACM, 17, 12, pp. 685-690, Deember, 1974.
[2℄ Ahuja, R.K., Magnanti, T.L. and Orlin, J.B., Network Flows: Theory, Algo-
rithms and Appliations, Prentie-Hall, 1993.
[3℄ Allen, J.R., Kennedy, K., Portereld, C. and Warren, J.D., Conversion of Con-
trol Dependene to Data Dependene, Proeedings of the 10
th
ACM Symposium
on Priniples of Programming Languages, pp. 177-89, January 1983.
[4℄ Almásil, G., Casaval, C., Castaños, J.G., Denneau, M., Lieber, D., Moreira, J.E.
and Warren, H.S., Disseting Cylops: Detailed Analysis of a Multithreaded
Arhiteture., ACM SIGARCH Computer Arhiteture News, Vol. 31, Marh
2003
[5℄ Amaral, J.N., Gao, G.R., and Tang, X., An Implementation of a Hopeld Net-
work Kernel on EARTH, CAPSL Tehnial Paper, 1998. http://www.apsl.
udel.edu
[6℄ Aono, M., Kunil, T.L., Botanial Tree Image Generation., IEEE Computer
Graphis and Appliations 4,5 (1984) 10-33.
[7℄ Arhiteture Simulation Framework, http://www.lri.fr/~osmose/ASF/, latest
updated in 28/06/2001 or http://www-roq.inria.fr/a3/tools.html.en.
[8℄ Avarind, R.S.N., and Pingail, K.K., I-strutures: Data Strutures for Parallel
Computing, ACM TOPLAS, 11(4): pp. 598-632, Otober 1989.
42
Bibliography 43
[9℄ Bakus, J., Can programming be liberated from the von Neumann style? A
funtional style and its algebra of programs, Communiations of the ACM 21,
8, pp. 613-641, August 1978.
[10℄ Barnsley, M.F., Devaney, R.L., Mandelbrot, B.B., Peitgen, H.-O., Saupe, D.,
Voss, R.F., The Siene of Fratal Images., Springer-Verlag, 1988.
[11℄ G.E. Blelloh, P.B. Gibbons, Y. Matias and G.J. Narlikar, Spae-Eient
Sheduling of Parallelism with Synhronization Variables, Proeedings of the
9
th
ACM Symposium on Parallel Algorithms and Arhitetures (SPAA), June
1997.
[12℄ Bohm A.P.W. and Sargeant, J., Eient Dataow Code Generation for
SISAL, IEEE Transations on Computers, vol.C-38 no.1, pp. 4-14, January
1989.
[13℄ Bolyhevsky, A., Jesshope, C.R. and Muhnik, V.B., Dynami Sheduling in
RISC Arhitetures, IEE. Pro.-Comput. Digit. Teh., Vol. 143, No. 5, Sept.
1996.
[14℄ Bruening, U., Giloi, W.K. and Shroeder-Preikshat, W., Lateny Hiding in
Message Passing Arhitetures, Proeedings of the 8
th
International Parallel
Proessing Symposium [21℄, pp. 704-709.
[15℄ Burger, D., Memory Bandwidth Limitations of Future Miroproessors., ISCA
1996.
[16℄ Burks, A.W., Goldstine, H.H. and von Neumann, J., Preliminary disussion of
the logial design of an eletroni omputing instrument. A.H. In Taub, editor,
John von Neumann Colleted Works, The Mamillan Co., New York, Volume V,
pp. 34-79, 1963.
[17℄ Burtsher, M. and Zorn, B.G., Predition Outome History-based Condene
Estimation for Load Value Predition, The Journal of Instrution-Level Paral-
lelism, vol. 1, February 1999. http://www.jilp.org/vol1
Bibliography 44
[18℄ Cai, H., Dynami Load-Balaning on the EARTH-SP System., Master's The-
sis, MGill University, Montréal, May 1997.
[19℄ B. Calder and D. Grunwald, Reduing Indiret Funtion Call Overhead in
C++ Programs, Proeedings of the 21
st
Symposium on The Priniples of Pro-
gramming Languages, pp. 397-408, January, 1994.
[20℄ Calder, B., Feller, P. and Eustae, A., Value Proling and Optimization, The
Journal of Instrution-Level Parallelism, vol. 1, February 1999. http://www.
jilp.org/vol1
[21℄ Casaval, C., Castaños, J.G., Ceze, L., Denneau, M., Gupta, M., Lieber, D.,
Moreira, J.E., Strauss, K. and Warren, H.S., Evaluation of a Multithreaded
Arhiteture for Cellular Computing., 8th International Symposium on High-
Performane Computer Arhiteture (HPCA), February 2002.
[22℄ Cavalherio, G.G.H., Doreille, M., Galilée, F., Gautier, T., Roh, J-L., Shedul-
ing Parallel Programs on Non-Uniform Memory Arhitetures., HPCA Confer-
ene  Workshop on Parallel Computing for Irregular Appliations WPCIA1,
Orlando, USA, January 1999.
[23℄ Chang, P.-Y., Hao, E., Yeh, T.-Y. and Patt, Y.N., Branh Classiation: A
New Mehanism for Improving Branh Preditor Performane, Proeedings of
MICRO-27, pp. 22-31, Nov-De 1994.
[24℄ Chappell, R.S., Stark, J., Knott, S.P., Reinhardt, S.K. and Patt, Y.N., Simul-
taneous Subordinate Mirothreading (si), Proeedings of the 26
th
International
Symposium on Computer Arhiteture, IEEE, 1998.
[25℄ Cleary, J. andWitten, I., Data Compression using Adaptive Coding and Partial
String Mahines, IEEE Transations on Communiations, vol. 32, pp. 396-402,
April 1984.
[26℄ Cmelik, B. and Keppel, D., Shade: A Fast Instrution-Set Simulator for Exeu-
tion Proling, ACM SIGMETRICS Conferene on Measurement and Modelling
of Computer Systems, 1994.
Bibliography 45
[27℄ Coman, J.R., ed., Computer and Job-Shop Sheduling Theory, John Wiley,
New York, 1976.
[28℄ Cohn, R. and Lowney, P.G., Design and Analysis of Prole based Optimization
in Compaq's (si) Compilation Tools for the Alpha, The Journal of Instrution-
Level Parallelism, vol. 2, January 2000. http://www.jilp.org/vol2
[29℄ del Cuvillo, J.B., Klosiewiz, R. and Zhang, Y., A Software Development Kit
for DIMES., CAPSL Tehnial Note 10, Department of Eletrial and Computer
Engineering, University of Delaware, Newark, Delaware, May 2003, ftp://ftp.
apsl.udel.edu/pub/do/notes/.
[30℄ del Cuvillo, J.B., Zhu, W., Hu, Z. and Gao, G.R., FAST: A Funtionally Au-
rate Simulation Toolset for the Cylops-64 Cellular Arhiteture., Workshop on
Modeling, Benhmarking and Simulation (MoBS), held in onjuntion with the
32nd Annual International Symposium on Computer Arhiteture (ISCA'05),
Madison, Wisonsin, June 4, 2005.
[31℄ del Cuvillo, J.B., Zhu, W., Hu, Z. and Gao, G.R., TiNy Threads: a Thread
Virtual Mahine for the Cylops64 Cellular Arhiteture., Fifth Workshop on
Massively Parallel Proessing (WMPP), held in onjuntion with the 19th Inter-
national Parallel and Distributed Proessing System, Denver, Colorado, April 3
- 8, 2005.
[32℄ Dahselt, F., Kelber, K. and Shwarz, W., Chaoti Coding and Cryptanalysis.,
Proeedings of 1997 IEEE International Symposium on Ciruits and Systems
Ciruits and Systems in the Information Age (New York, USA), vol. 4, May
1998, pp. 518-21.
[33℄ Duller, A., Towner, D., Panesar, G., Gray, A. and Robbins, W., pioArray
tehnology: the tool's story., Proeedings of the Design, Automation and Test
in Europe Conferene and Exhibition, IEEE, 2005.
[34℄ Egan, C., Steven, G. and Vintan, L., Cahed Two-level Adaptive Branh Pre-
ditors with Multiple Stages, In Trends in Network and Pervasive Computing
Bibliography 46
ARCS 2002 (Leture Notes in Computer Siene 2299), Springer-Verlag, pp. 179
191, 2002.
[35℄ Eggers, S.J., Emer, J.S., Levy, H.M., Lo, J.L., Stamm, R.L. and Tullsen, D.M.,
Simultaneous Multithreading (si): A Platform for Next-Generation Proes-
sors, IEEE Miro, Vol. 17, No. 5, September/Otober 1997.
[36℄ Emami, M, Ghiya, R. and Hendren, L.J., Context-sensitive Interproedural
(si) Points-to Analysis in the Presene of Funtion Pointers, Proeedings of
the ACM SIGPLAN '94 Conferene on Programming Language Design and Im-
plementation, SIGPLAN Noties 29(6), pp. 242-56, June 1994.
[37℄ Engler, D.R., Andrews, G.R. and Lowenthal, D.K., Filaments: Eient Sup-
port for Fine-Grain Parallelism.,TR 93-13a, Dept. of Computer Siene, Uni-
versity of Arizona, Tuson, 1993.
[38℄ Fisher, J.A., The Optimization of Horizontal Miroode within and beyond
Basi Bloks: An Appliation of Proessor Sheduling with Resoures,, PhD
dissertation, University of New York, New York, 1979.
[39℄ Gao, G.R., Theobald, K.B., Márquez, A., and Sterling, T., The HTMT
Program Exeution Model, CAPSL Tehnial Memo 9, July 1997. .http:
//www.apsl.udel.edu
[40℄ Gao, G.R. and Sarkar, V., Loation Consisteny - a New Memory Model and
Cahe Consisteny Protool., IEEE Transations on Computers, Vol. 49, No. 8,
August 2000.
[41℄ Gao, G.R., Theobald, K.B., Hu, Z., Wu, H, Lu, J., Sterling, T.L., Pingali, K.,
Stodghill, P., Stevens, R. and Hereld, M., Next Generation System Software for
Future High-End Computing Systems., International Parallel and Distributed
Proessing Symposium: IPDPS 2002 Workshops April 15 - 19, 2002 Fort Laud-
erdale, Florida.
[42℄ Gao, G.R., Theobald, K.B., Govindarajan, R., Leung, C., Hu, Z., Wu, H.,
Lu, J., del Cuvillo, J., Jaquet, A., Janot, V. and Sterling, T.L., Program-
Bibliography 47
ming Models and System Software for Future High-End Computing Systems:
Work-in-Progress., International Parallel and Distributed Proessing Sympo-
sium (IPDPS'03) April 22 - 26, 2003 Nie, Frane.
[43℄ Garey, M.R. and Johnson, D.S., Computers and Intratability: A Guide to the
Theory of NP-Completeness, W.H. Freemann and Co., New York, 1979.
[44℄ Garnkel, A., Chen, P.S., Walter, D.O., Karagueuzian, H.S., Kogan, B., Evans,
S.J., Karpoukhin, M., Hwang, C., Uhida, T., Gotoh, M., Nwasokwa, O., Sager,
P. and Weiss, J.N., Quasiperiodiity and haos in ardia brillation., Journal
of Clinial Investigation. 99(2), pp. 305-14, January 1997.
[45℄ El-Ghazawi, T.A., Carlson, W.W., Draper, J.M., UPC Language Speiations
V1.1.1, Otober 2003.
[46℄ Ghiya, R. and Hendren, L.J., Connetion Analysis: A Pratial Interproedu-
ral (si) Heap Analysis for C, International Journal of Parallel Programming,
24(6), Deember 1996.
[47℄ Gottleib, A., Lubahevsky, B.D., and Rudolph, L., Basi Tehniques for the
Eient Coordination of Very Large Numbers of Cooperating Sequential Proes-
sors, ACM Transations on Programming Languages and Systems, 5(2), April
1983.
[48℄ Gurd, J.R. and Snelling, D.F., Manhester Data-Flow: A Progress Report,
ACM Proeedings of the 6th International Conferene on Superomputing, pp.
216-225, 1992.
[49℄ Hennessy, J.L. and Patterson, D.A.,  Computer Arhiteture: A Quantitative
Approah, 2
nd
Edition, Morgan Kaufmann, 1996.
[50℄ Hoogerbrugge, J. and Augusteijn, L., Instrution Sheduling for TriMedia,
The Journal of Instrution-Level Parallelism, vol. 1, February 1999. http://
www.jilp.org/vol1
Bibliography 48
[51℄ Hsu, P.Y.T. and Davidson, E.S., Highly Conurrent Salar Proessing, Pro-
eedings of the 13
th
Annual International Symposium on Computer Arhiteture,
pp. 386-95, June 1986.
[52℄ Huang, J., and Lilja, D.J., An Eient Strategy for Developing a Simulator
for a Novel Conurrent Multi-Threaded Proessor Arhiteture, 1998.
[53℄ Hum, H.H.J., Maquelin, O., Theobald, K.B., Tian, X., Gao, G.R. and Hen-
dren, L.J., A Study of the EARTH-MANNA Multithreaded System., Intl. J.
of Parallel Programming, 24(4):319-347, August 1996.
[54℄ Hwu, W.W., Conte, T.M. and Chang, P.P., Comparing Software and Hardware
Shemes for Reduing the Cost of Branhes, Proeedings of the 16
th
Annual
International Symposium on Computer Arhiteture, pp. 224-233, May 1989.
[55℄ Hwu, W.W., Mahlke, S.A., Chen, W.Y., Chang, P.P., Warter, N.J., Bringmann,
R.A., Ouellette, R.G., Hank, R.E., Kiyohara, T., Haab, G.E., Holm J.G. and
Lavery, D.M., The Superblok (si): An Eetive Tehnique for VLIW and
Supersalar Compilation, The Journal of Superomputing, Vol. 7, 1/2: pp.
229-248, 1993.
[56℄ Intel Pentium 4 Proessor Produt Brief,
http://www.intel.om/design/Pentium4/prodbref/ http://www.intel.om/
design/Pentium4/prodbref/.
[57℄ Jagannathan, R., Dataow (si) Models, E.Y. Zomaya, editor, Parallel and
Distributed Computing Handbook, M

Graw-Hill, 1985.
[58℄ Jesshope, C.R. and Luo, B., Evaluation of Vetor-Instrution Set Miro-
Threaded Pipelines, Institute of Information Sienes and Tehnology, Massey
University, New Zealand, private ommuniations.
[59℄ Johnson, M., Supersalar Miroproessor Design., Prentie Hall, New Jersey,
1991.
[60℄ Julia, G., Sur l'iteration des Funtions Rationelles., Journal de Math. Pure et
Appl. 8 (1918) 47-245.
Bibliography 49
[61℄ Kakulavarapu, K.P., Dynami Load-Balaning Issues in the EARTH Runtime
System., Master's Thesis, MGill University, Montréal, April 2000.
[62℄ Kakulavarapu, P., Morrone, C.J., Theobald, K., Amaral J.N. and Gao, G.R., A
Comparative Performane Study of Fine-Grain Multi-threading on Distributed
Memory Mahines., 19th IEEE International Performane, Computing and
Communiation Conferene-IPCCC2000, Phoenix, Arizona, USA, Feb. 20-22,
2000.
[63℄ Kalamatiano, J. and Kaeli, D.R., Indiret Branh Predition using Data Com-
pression Tehniques, The Journal of Instrution-Level Parallelism, vol. 2, De-
ember 1999. http://www.jilp.org/vol1
[64℄ Kogge, P.M., Sterling, T.L. and Gao, G., Proessing in memory: Chips to
petaops., In Workshop on Mixing Logi and DRAM: Chips that Compute and
Remember at ISCA '97. http://iram.s.berkeley.edu/isa97-workshop/,
Denver, CO, June 1997.
[65℄ Lam, M.S. and Wilson, R.P., Limits of Control Flow on Parallelism, Pro-
eeding of the 19
th
Annual International Symposium on Computer Arhiteture,
ACM, May 1992, pp. 46-57.
[66℄ Lee, C., Potkonjak, M. and Mangione-Smith, W.H., Mediabenh (si): A Tool
for Evaluating and Synthesizing Multimedia and Communiations Systems,
Proeedings of the 30
th
Annual International Symposium on Miro-arhiteture,
Deember 1998.
[67℄ Lo, J. L., Eggers, S. J., Emer, S. J., Levy, H. M., Stamm, R. L., Tullsen, D.
M., Converting Thread-Level Parallelism to Instrution-Level Parallelism via
Simultaneous Multithreading, ACM Transations on Computer Systems, Vol.
15, No. 3, August 1997, Pages 322354.
[68℄ Luo, B., and Jesshope, C.R., Performane of a Miro-Threaded Pipeline, AC-
SAC 2002, Melbourne, Australian, Vol. 6.
Bibliography 50
[69℄ MFarling, S., Combining Branh Preditors, Teh. Note TN-36, DEC WRL,
June 1993.
[70℄ M

Guiness, J.M., A DIMES Demonstration Appliation: Mandelbrot-Set Gen-
eration Using a Work-Stealing Algorithm., CAPSL Tehnial Note 11, Depart-
ment of Eletrial and Computer Engineering, University of Delaware, Newark,
Delaware, June 2003, ftp://ftp.apsl.udel.edu/pub/do/notes/.
[71℄ M

Guiness, J.M., Aleph One, http://aleph1.soureforge.net/.
[72℄ Mandelbrot, B.B., The Fratal Geometry of Nature., W.H.Freeman & Co.,
Sept., 1982.
[73℄ Márquez, A, Theobald, K.B., Tang, X. and Gao, G.R., A Superstrand (si)
Arhiteture, CAPSL Tehnial Memo 14, Deember 1997. http://www.apsl.
udel.edu
[74℄ Márquez, A, Theobald, K.B., Tang, X., Sterling, T. and Gao, G.R., A Su-
perstrand (si) Arhiteture and its Compilation, CAPSL Tehnial Memo 18,
Marh 1998. http://www.apsl.udel.edu
[75℄ Márquez, A., CARE Arhiteture, PhD dissertation, University of Delaware,
2004.http://www.apsl.udel.edu/publiations.shtml##5
[76℄ Moreira, J.E., On the Implementation and Eetiveness of Autosheduling for
Shared-Memory Multi-Proessors, PhD Thesis, Univerisity of Illinois, Urbana,
Illinois, USA, 1995.
[77℄ Morrone, C.J., Amaral, J.N., Tremblay, G. and Gao, G.R., A Multi-Threaded
Runtime System for a Multi-Proessor/Multi-Node Cluster., 15th Annual In-
ternational Symposium on High Performane Computing Systems and Applia-
tions, June 18-20, 2001, Windsor, ON, Canada.
[78℄ Moshovos, A. and Sohi, G.S., Memory Dependene Predition in Multimedia
Appliations, The Journal of Instrution-Level Parallelism, vol. 2, January 2000.
http://www.jilp.org/vol2
Bibliography 51
[79℄ Muhnik, S.S., Advaned Compiler Design and Implementation, Morgan
Kaufmann Publishers, San Franiso, 1997.
[80℄ Patterson, D.A. and Hennessy, L.J., Computer Arhiteture: A Quantitative
Approah., 2
nd
Edition, Morgan Kaufmann In., San Franiso, pp. 374, 1996.
[81℄ Perfet Developer, http://www.esherteh.om/index.html
[82℄ Peters, E.E., Fratal Market Analysis : Applying Chaos Theory to Investment
and Eonomis., John Wiley & Sons, 1994.
[83℄ Posti, M.A., Greene, D.A., Tyson, G.S. and Mudge, T.N., The limits of in-
strution level parallelism in SPEC95 appliations., The Third Workshop on the
Interation between Compilers and Computer Arhitetures (INTERACT), in
onjuntion with the Eighth International Conferene on Arhitetural Support
for Programming Languages and Operating Systems (ASPLOS-VIII), Otober
1998.
[84℄ Posti, M., Tyson, G. and Mudge, T., Performane Limits of Trae Cahes,
The Journal of Instrution-Level Parallelism, vol. 1, February 1999. http://
www.jilp.org/vol1
[85℄ Reghbati, H.K., An Overview of Data Compression Tehniques., Computer
14,4 (1981) 71-76.
[86℄ Reilly, J., SPEC Desribes SPEC95 Produts And Benhmarks, Intel
Corporation, September 1995. http://open.spebenh.org/osg/pu95/news/
pu95desr.html
[87℄ Norton Riley, H., The von Neumann Arhiteture of Computer Systems,
Computer Siene Department California State Polytehni University Pomona,
California, September 1987. http://www.supomona.edu/~hnriley/www/VonN.
html
[88℄ Rodenas, D., Martorell, X., Ayguade, E., Labarta, J., Almasi, G., Casaval, C.,
Castanos, J. and Moreira, J., Optimizing NANOS OpenMP for the IBM Cylops
Bibliography 52
Multithreaded Arhiteture., 19th IEEE International Parallel and Distributed
Proessing Symposium, Vol. 1, pp. 110, 2005.
[89℄ Rotenburg, E. and Smith, J.E., Control Independene in Trae Proessors,
The Journal of Instrution-Level Parallelism, vol. 2, January 2000. http://www.
jilp.org/vol2
[90℄ Sakane, H., Yakay, L., Karna, V., DIMES/P Hardware Tehnial Manual.,
CAPSL Tehnial Note 12, Department of Eletrial and Computer Engineering,
University of Delaware, Newark, Delaware, June 2003, ftp://ftp.apsl.udel.
edu/pub/do/notes/.
[91℄ Sakane, H., Yakay, L., Karna, V., Leung, C. and Gao, G.R., DIMES: An It-
erative Emulation Platform for Multiproessor-System-on-Chip Designs., IEEE
International Conferene on Field-Programmable Tehnology, Deember 15-17,
2003, Tokyo, Japan.
[92℄ Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Burger, D., Kek-
ler, S. W., Moore, C.R., Exploiting ILP, TLP, and DLP with the Polymorphous
TRIPS Arhiteture, 30
th
Annual International Symposium on Computer Ar-
hiteture, 2003.
[93℄ Saulsbury, A., Pong, F., and Nowatzyk, A., Missing the Memory Wall: The
Case for Proessor/Memory Integration., In Proeedings of the 23
rd
Interna-
tional Symposium on Computer Arhiteture, pp. 90-101, May 1996.
[94℄ Savari, S. and Young, C., Comparing and Combining Proles, The Journal of
Instrution-Level Parallelism, vol. 2, January 2000. http://www.jilp.org/vol2
[95℄ CAPSL Exhibit Booth 130, November 18
th
-20
th
http://www.apsl.udel.edu/
dimes/s2003_flyer.html, as part of Super Computing 2003, [96℄.
[96℄ Super Computing 2003, November 15
th
-21
st
, Phoenix, Arizona, U.S.A.http:
//www.s-onferene.org/s2003/
Bibliography 53
[97℄ Shnarr, E.C., Applying Programming Language Implementation Tehniques
to Proessor Simulation, PhD dissertation, University of Winsonsin, Madison,
2000.
[98℄ Sehrest, S., Lee, C.C. and Mudge, T., The Role of Adaptivity in Two-
level Adaptive Branh Predition, 28
th
International Symposium on Miro-
arhiteture, 1995.
[99℄ Sharp, J.A., Data Flow Computing, Ellis Horwood Limited, Chihester, Eng-
land, 1985.
[100℄ Skadron, K., Martonorst, M. and Clark, D.W., Speulative Updates of Loal
and Global Branh History, The Journal of Instrution-Level Parallelism, vol.
2, Deember 1999. http://www.jilp.org/vol2
[101℄ Smith, J.E., and Sohi, G.S., The Miroarhitetures of Supersalar Proes-
sors., In the Proeedings of the IEEE 1995.
[102℄ de Souza, A.F., and Roune, P., On the Eetiveness of the Sheduling Algo-
rithm of the Dynamially Trae Sheduled VLIW Arhiteture, 11
th
Symposium
on Computer Arhiteture and High Performane Computing, SBAC-PAD, 1999.
[103℄ Srinivasan, S.T. and Lebek, A.R., Load Lateny Tolerane in Dynamially
Sheduled Proessors, The Journal of Instrution-Level Parallelism, vol. 1,
February 1999. http://www.jilp.org/vol1
[104℄ Sterling, T., Beker, D.J., Savarese, D., Berry, M., and Res, C., Ahieving a
Balaned Low-Cost Arhiteture for Mass Storage Management through Multi-
ple Fast Ethernet Channels on the Beowulf Parallel Workstation, Proeedings of
the International Parallel Proessing Symposium, 1996. http://esdis.gsf.
nasa.gov/beowulf/papers/papers.html
[105℄ Sterling, T.L., "An Introdution to the Gilgamesh PIM Arhiteture." Pro.
European Conferene on Parallel Proessing, Manhester, UK, pp. 16-32, August
2001.
Bibliography 54
[106℄ Sterling, T.L., Zima, H.P., Gilgamesh: A Multithreaded Proessor-In-
Memory Arhiteture for Petaops Computing., Pro.SC2002, Baltimore,
November 2002.
[107℄ Steven, G., Exploiting Instrution-Level Parallelism in High Performane Pro-
essors, Department of Computer Siene, University of Hertfordshire, Unpub-
lished, 2001.
[108℄ Stouthinin, A., Amaral, J.N, Gao, G.R., Dehnert, J. and Jain, S., Automati
Pre-fething of Indution Pointers for Software Pipelining, CAPSL Tehnial
Memo 37, November 1999. http://www.apsl.udel.edu
[109℄ Tang, X., Compiling for Multithreaded (si) Arhitetures, PhD disserta-
tion, University of Delaware, Autumn 1999.
[110℄ Tarlesu, M.D., Theobald, K.B. and Gao, G.R., Elasti History Buer: A
Low-Cost Method to Improve Branh Predition Auray, IEEE Conferene
on Computer Design, Otober 1997.
[111℄ Tate, D., Supersalar Arhitetures and Statially Sheduled Programs, PhD
dissertation, University of Hertfordshire, Hateld, Hertfordshire, U.K., 2000.
[112℄ Theobald, K. B., Gao, G. R. and Hendren, L. J, Speulative Exeution and
Branh Predition on Parallel Mahines, ICS-7/93, ACM, 1993.
[113℄ Tullsen, D.M., Eggers, S.J. and Levy, H.M., Simultaneous Multithreading
(si): Maximising (si) On-hip Parallelism, Proeedings of the 22
nd
Annual
International Symposium on Computer Arhiteture, pp 392-402, June 1995.
[114℄ Turotte, D.L., Fratals and Chaos in Geology and Geophysis., Cambridge
University Press, 1992, pp. 35-50.
[115℄ Unger, A., Ungerer, Th. and Zehender, E., Simultaneous Speulation Shedul-
ing, 11
th
Symposium on Computer Arhiteture and High Performane Com-
puting, SBAC-PAD '99, 1999.
Bibliography 55
[116℄ Waingold, E., Taylor, M., Srikrishna, D., Sarkar, D., Lee, W., Lee, V., Kim,
J., Frank, M., Finh, P., Barua, R., Babb, J., Amarasinghe, S. and Agarwal, A.,
Baring It All to Software: RAW Mahines., Computer September 1997 (Vol.
30, No. 9) pp 86 -93.
[117℄ Wall, D.W., Limits of Instrution-Level Parallelism, Proeedings of the 4
th
International Conferene on Arhitetural Support for Programming Languages
and Operating System, SIGPLAN Noties, vol. 26, no. 4, ACM Press, New York,
NY, pp. 176-189, 1991.
[118℄ Wang, Z., Piere, K. and MFarling, S., BMAT  A Binary Mathing Tool
for Stale Prole Propagation, The Journal of Instrution-Level Parallelism, vol.
2, January 2000. http://www.jilp.org/vol2
[119℄ Wegner, T., Osuh, J., Martin, G., Bussell, B., Fratint, http://www.
fratint.org/
[120℄ Woo, S.C., Ohara, M., Torrie, E., Singh, J.P. and Gupta, A., The SPLASH-
2 programs: haraterization and methodologial onsiderations., In Proeed-
ings of the 22
nd
Annual International Symposium on Computer Arhiteture,
ACM/IEEE, Portono, Italy, pp. 24-36, 1995.
[121℄ Wulf, W. and MKee, S., Hitting the memory wall: Impliations of the obvi-
ous., Computer Arhiteture News, 23(1), pp. 20-24, 1995.
[122℄ Yeh, T.-Y., and Patt, Y.N., Alternative Implementations of Two-Level Adap-
tive Branh Predition, Proeedings of the 19
th
Annual International Sympo-
sium on Computer Arhiteture, pp. 124-34, May 1992.
[123℄ Yeh, T.-Y., and Patt, Y.N., A Comparison of Dynami Branh Preditors that
use Two Levels of Branh History, Proeedings of the 20
th
Annual International
Symposium on Computer Arhiteture, pp. 257-66, May 1993.
[124℄ Zhang, Y., Zhu. W., Chen, F., Hu, Z. and Gao, G.R, Sequential Consis-
teny Revisited: The Suient Conditions and Method to Reason Consisteny
Model of a Multiproessor-on-a hip Arhiteture., The IASTED International
Bibliography 56
Conferene on Parallel and Distributed Computing and Networks (PDCN2005),
February 15 - 17, 2005, Innsbruk, Austria.
Appendix A
Implementing Appliations on a
Cellular Arhiteture - the
Mandelbrot-set.
This was a draft paper prepared at CAPSL, at the University of Delaware in ol-
laboration with Dr. Egan, for submission to various onferenes, before the author
left Delaware.
Authors: Jason M

Guiness
1,2
, Colin Egan
2
, Guang Gao
1
.
1
University of Delaware, Newark, DE.
2
University of Hertfordshire, Hateld, Hertfordshire, U.K. AL10 9AB.
mguinesapsl.udel.edu
.eganherts.a.uk
ggaoee.udel.edu
A.1 Abstrat.
There is an ever widening gap between CPU speed and memory speed, resulting in
a 'memory wall' where the time for memory aesses dominate performane. Cellu-
lar arhitetures, suh as the Cylops family, have been developed to overome this
'memory wall' by implementing proessors-in-memory (PIM) on the same hip. PIM
57
A.2. Introdution. 58
arhitetures ahieve high performane by inreasing the bandwidth of proessor-
memory ommuniation and reduing lateny. In this paper we introdue DIMES
(the Delaware Iterative Multiproessor Emulation System) whih is being developed
by CAPSL at the University of Delaware, as a hardware validation tool for ellular
arhitetures. The version of DIMES used in this paper is a simplied hardware
implementation of the Cylops-64 ellular arhiteture developed at the IBM T. J.
Watson Researh Center. Sine DIMES is a hardware validation tool, its hardware
implementation is onstrained to a dual proessor where eah proessor has four
thread units. DIMES memory is restrited to 16K of loal srath-pad memory
per proessor and 64K global shared memory. Additionally DIMES is linked to a
host omputer for I/O. We have hosen to use a Mandelbrot-set generator (written
in C++) with a work-stealing algorithm as our metri to evaluate the program-
ming model on DIMES. The Mandelbrot-set generator has been threaded, and the
work-stealing algorithm ahieves load balaning between the DIMES' threads. The
Mandelbrot example demonstrates the eetive use of DIMES' threads, the eetive
use of DIMES srath-pad memory and the eetive use DIMES global memory in
its CRTS environment. The results of the study are highly promising and show that
DIMES is an ideal hardware tool for validating future Cylops enhanements.
A.2 Introdution.
High performane proessors, in partiular super-salars, exploit instrution level
parallelism (ILP) by overlapping instrution exeution (pipelining) and using multi-
ple instrution issue (MII) per lok yle [101℄. Although, this approah improves
proessor performane, it does not improve performane of the memory subsystem.
Researhers improve CPU speed by inreasing the number of instrutions issued in
eah lok yle or by inreasing the depth of the pipeline, whih an ause a bot-
tlenek in the memory-subsystem. This is termed as the memory-wall and impats
on overall system performane [121℄.
One approah to overome the memory-wall is to improve data throughput and
data storage between the memory subsystem and the CPU by introduing extra
A.2. Introdution. 59
levels in the memory hierarhy [15, 121℄. However, introduing extra levels in the
memory hierarhy inreases the penalty assoiated with a miss in the memory-
subsystem, whih limits the amount of ILP and impats on proessor performane.
Also, there is an inrease in design omplexity and an inrease in power onsumption
of the overall system. Furthermore, inreasing the number of levels in the memory
hierarhy does not improve memory aess times.
An alternative approah to overome the memory-wall is to improve both data-
proessing and data-aess time by the integration of proessing logi in memory
[21, 41, 105, 106, 116℄. The idea of integrating proessors-in-memory (PIMs) is to
simplify the memory hierarhy design, to ahieve higher bandwidth and to redue
lateny. There are several PIM arhitetures being developed, for example, the
Cylops family of PIM arhitetures by IBM [21℄, the Gilgamesh PIM arhiteture
by NASA [105, 106℄, the polymorphous TRIPS arhiteture at Austin, Texas [92℄
and the Shamrok PIM arhiteture at Notre Dame, Frane [64℄.
A problem with integrating a proessor and memory on in the same silion spae
is that the proessor speed is redued in omparison with a high performane pro-
essor and the amount of memory is also redued [21℄. To overome the redution in
proessing power and the redution in the amount of available memory and therefore
lateny, multiple PIM hips are onneted together forming a network of ells, where
a single PIM hip is onsidered to be a ell and the whole arhiteture is desribed
as ellular.
To overome the data aess problem, eah ell is threaded suh that eah thread
unit is independent from all other thread units. In this multi-threaded organisation,
every thread unit serves as an independent single-issue in-order proessor, whih
shares omputationally expensive hardware resoures suh as oating-point units
and ahes.
In this paper we introdue DIMES (the Delaware Iterative Multiproessor Em-
ulation System) whih is being developed by CAPSL at the University of Delaware
[29,90℄. DIMES is a hardware validation tool for ellular arhitetures, in partiular
the Cylops family [21℄. DIMES plaes the Cylops arhitetural design into a sin-
gle FPGA. The idea behind DIMES is to emulate Cylops yle by yle, to be far
A.3. Programming Models on Cellular Arhitetures. 60
faster than software based simulations, and to diret future Cylops enhanements.
A.3 Programming Models on Cellular Arhitetures.
Cellular arhitetures require dierent programming models to the general-purpose
ode exeuted by super-salar proessors [40, 41, 120℄. Gao proposes the use of
a ombination of exeution models and memory models, beause of the ellular
arhitetures multiple exeution units within eah ell.
Gao's programming model evaluates multiple threads in eah ell due to the large
number of exeution units within Cylops. For example, one programming model
uses thread perolation as a tehnique to perform dynami load-balaning [18,53,62℄.
Additionally, in ellular arhitetures, multiple threads perform memory aesses
independently. As a result of this, the memory subsystem requires some form of
aess model that allows these memory referenes to be eetively served. For
example, the use of the loation-onsisteny model was suggested as a memory
aess model by [40℄.
A.4 Conlusion/Disussion.
The threaded algorithm shows that the Mandelbrot set is an ideal mehanism for
evaluating ellular arhitetures and programming models on the DIMES hardware.
Currently DIMES is targeted towards CylopsE, however DIMES ould be expanded
to the full IBM Cylops family and other ellular arhitetures, suh as those at
Gilgamesh at NASA and Shamrok at Notre Dame.
Future enhanements to DIMES may inorporate more hardware to allow benh-
marks, suh as Tabletoy and others. This will also allow us to evaluate further
enhanements to the ellular programming model.
Appendix B
Implementing Appliations on a
Cellular Arhiteture - the
Mandelbrot-set.
This was a presentation give to the University of Hertfordshire, upon the author's
return from CAPSL, introduing DIMES and the work that was done.
Authors: Jason M

Guiness
1,2
, Colin Egan
2
, Guang Gao
1
.
1
University of Delaware, Newark, DE.
2
University of Hertfordshire, Hateld, Hertfordshire, U.K. AL10 9AB.
mguinesapsl.udel.edu
.eganherts.a.uk
ggaoee.udel.edu
B.1 Overview:
Reap from last week:
• The memory wall and ellular arhitetures: a solution?
• Programming models on Cellular Arhitetures, followed by a brief overview
of Cylops and DIMES/P-2.
61
B.2. A reap on the memory wall. Part I:
The proessor viewpoint. 62
New this week:
• An introdution to the Mandelbrot set.
• Threading and Work-Stealing applied to the Mandelbrot set.
• The programming implementation with regard to DIMES/P-2 and the exeu-
tion details of the Mandelbrot-set appliation. PLUS A LIVE DEMONSTRA-
TION OF THE PROGRAM!!!
• Latest work: Global Updates Per Seond (GUPS) benhmarks.
• Conlusions & Future Work.
B.2 A reap on the memory wall. Part I:
The proessor viewpoint.
Wall
Processor Memory
• Higher performane may be ahieved through ILP by MII and/or pipelining.
Various tehniques are used to implement these goals, e.g. Register Renaming,
Out-of-order instrution issue/exeution, Branh Predition, dynami instru-
tion sheduling, Value Predition, Instrution Reuse, et.
• But this auses a bottle-nek - upon a miss the reovery ost beomes in-
reasingly high, beause the memory annot keep up with the required feth
rate.
• This leads to attempts to improve the performane of the memory.
B.3. A reap on the memory wall. Part II:
The memory viewpoint. 63
B.3 A reap on the memory wall. Part II:
The memory viewpoint.
o Wall
L2L1
Processor
Main Memory
• Inreasing the levels of memory in the hierarhy, by plaing levels of ahes
between the main memory and the CPU (or on the CPU).
• This redues the memory wall, but on a ahe miss the penalty is more severe.
(Also this does not redue the memory sub-system lateny for an initial aess,
only upon subsequent aess.)
• In both ases:
 The hardware omplexity and ost is inreased.
 The rewards obtained are balaned against known disadvantages.
B.4 The memory wall and ellular arhitetures: a
solution?
• Why not plae the proessor in the memory, e.g. PIM arhitetures? Does
this remove the memory wall?
• In priniple due to the proximity of the exeution units to the memory ells,
the lateny and bandwidth should be redued.
• But due to the mixture of logi units on the silione die, the gate density is
redued.
B.5. Programming models on Cellular Arhitetures. 64
• To maintain gate density, more simple exeution ores are used, suh as RISC
pipelines whih may also omit branh predition, for example.
• Thus the memory density and exeution unit throughput are redued. How
may this be ountered?
 With the addition of a network interfae to interonnet between the PIM
hips. Thus eah hip beomes a ell.
 Thus redued individual performane may be ountered by interonnet-
ing many of these ells together to build up a ellular arhiteture, e.g.
Cylops developed at IBM, Gilgamesh at NASA and Shamrok at Notre
Dam.
B.5 Programming models on Cellular Arhitetures.
Cellular arhitetures have partiular features that mean that their programming
model is dierent to super-salar proessors:
• They have large (millions) of exeution (or in ellular arhitetures thread
units) whih are simple.
• Memory aess is irregular: Some memory is very lose, thus fast, the rest is
o-hip, so muh slower.
Researh into appropriate programming models is on-going, the urrent model is
pthread, but future diretions inlude:
• For example thread perolation as a tehnique to perform dynami load-
balaning.
• In ellular arhitetures, multiple threads perform memory aesses indepen-
dently. As a result of this, the memory subsystem ould have some form of
aess model that allows these memory referenes to be eetively served. For
example, the use of the loation-onsisteny model ould be used as a memory
aess model.
B.6. A brief overview of Cylops and DIMES/P-2. 65
B.6 A brief overview of Cylops and DIMES/P-2.
At the University of Delaware the rst hardware simulation of a ellular arhiteture
has been built under Hiro Sakane's group:
• This is alled DIMES/P-2.
• It is a simplied implementation of the 32-bit CylopsE design, one of the
family of Cylops arhitetures developed at the IBM T.J. Watson Researh
Center.
64K Global
Memory
Thread Unit 0
4K Scratch pad
Thread Unit 1
4K Scratch pad
Thread Unit 2
4K Scratch pad
Thread Unit 3
4K Scratch pad
Processor 0
64K Global
Memory
Thread Unit 0
4K Scratch pad
Thread Unit 1
4K Scratch pad
Thread Unit 2
4K Scratch pad
Thread Unit 3
4K Scratch pad
Network
Processor 1
B.7 An introdution to the Mandelbrot set.
The Mandelbrot set is a fratal named after Professor B.B. Mandelbrot, who dis-
overed the set in the 1960s. It is intimately related to the Julia set, also a fratal,
disovered in the 1910s.
Both the Mandelbrot and Julia sets may be reated by iteration of a very simple
equation:
B.8. The lassi algorithm used to generate the Mandelbrot set: 66
zn+1 = z
2
n + c (B.7.1)
In this equation, zn is a omplex number, where z0 = 0. c is also a omplex
number, whih is initialized to a value onstant throughout the iterations. The
iteration of equation terminates when:
1. Either n reahes the so-alled maximum iteration value, m, a xed onstant,
greater than zero.
2. Or | zn | exeeds the so-alled bailout value, a xed onstant, usually set to
the real value 4, for eieny reasons.
B.8 The lassi algorithm used to generate the Man-
delbrot set:
1. Set the value of m, the maximum iterations, greater than zero.
2. Selet a point from the omplex plane, and set c to that value.
3. Initialize n = 0, z0 = 0.
4. Exeute equation B.7.1.
5. Inrement n.
6. If | zn |≥ 2 then that c is not in the set of points whih omprise the Mandelbrot
set. Go to 2.
7. If n > m then that c is in the Mandelbrot set, i.e. c ⊂ M . Go to 2.
8. Go to 4.
B.9. Threading applied to the Mandelbrot set. 67
B.9 Threading applied to the Mandelbrot set.
An overview of threading the Mandelbrot-set generation algorithm:
• An important property of algorithm to generate the Mandelbrot set is that the
lassiation of eah c in the omplex plane is independent of the lassiation
of any other c. Thus the Mandelbrot set may be implemented as a massively
parallel appliation, thus potentially suited to ellular arhitetures. Indeed
the Mandelbrot has has been used in as a benhmark for dierent arhitetures,
suh as ne-grain threaded-arhitetures and NUMA arhitetures.
The omplex plane is divided into a series of horizontal strips. These strips may be
alulated or rendered independently of eah other, using separate render threads,
as the lassiation of the points c within eah strip is independent of suh lassi-
ation on other render threads. Therefore eah render thread implements a slightly
modied version of the lassi algorithm, whih is given in the threaded algorithm,
given next.
B.10 The Render-Thread Algorithm.
1. The algorithm:
(a) Set the value of m, the maximum iterations, greater than zero. Set the esti-
mated ompletion-time, t, to ∞.
(b) Set c = x, where x is the top-left of the strip to be rendered.
() Initialise n = 0, z0 = 0.
i. Exeute equation B.7.1.
ii. Inrement n.
iii. If | zn |≥ 2 then that c is not in the set of points whih omprise the
Mandelbrot set. Go to 4.
iv. If n > m then that c is in the Mandelbrot set, i.e. c ⊂ M . Go to 4.
v. Go to 3a.
B.11. The Work-Stealing Algorithm. 68
(d) Inrement the real part of c. If the real part of c is less than the width of the
strip to be rendered, go to 3.
(e) Calulate the average of t and the time it took to render that line.
(f) Set the real part of c to the left-hand of the strip. Inrement the omplex part
of c. If the omplex part of c is less than the height of the strip, go to 3.
(g) Signal work ompleted, set t = 0 (thus this thread is guaranteed not to be
seleted by the work-stealing algorithm).
(h) Suspend.
A load-balaning algorithm was added to move unompleted work to threads that
have ompleted their assigned work. This is beause eah strip will take a dierent
amount of time to render.
B.11 The Work-Stealing Algorithm.
1. Monitor render threads for a work-ompleted signal. That thread that om-
pletes we shall denote as Tc.
2. Find that render thread with the longest estimated ompletion-time, t, note
that eah render thread updates this time upon ompletion of a line. Call this
thread Tl.
3. Stop Tl when it ompletes the urrent line it is rendering.
4. Split the remaining work to be done in the strip equally between the two render
threads Tc and Tl.
5. Restart the render threads Tc and Tl.
6. Go to 1.
This is a dynami-programming solution to the load-balaning problem of work
distribution between the render threads. Due to the seletion of the slowest render
thread, this algorithm may been see to be optimal. The author believes that this is
an original appliation of work-stealing to Mandelbrot-set generation.
B.12. A Disussion of the Work-Stealing Algorithm. 69
B.12 A Disussion of the Work-Stealing Algorithm.
• The bandwidth of the single thread that implements that algorithm is the
limiting fator in it's ability to sale.
• It is possible to sale this work-stealing algorithm, if one observes that the
work-stealing algorithm operates upon a slie of the omplex plane. This lue
demonstrates that the work-stealing algorithm is reursive.
• Conversely this algorithm is able to tolerate failures in render threads. If a
render thread stops responding, eventually it will be the slowest, unnished
render thread, and it's work will be stolen.If robustness is not required, then
the image generated may be viewed as an array values. Eah of these values
is the lassiation of c. Thus if one has p0...q threads, eah pn thread initially
lassies a point in the array oset by n, and one ompleted, moves along the
array using a stride of q.
• This allows the use of a number of threads that is bounded by the number of
points within the image. As this may be for an image of resolution 100×100,
thus 10,000 points, this maps well on to ellular arhitetures.
B.13. The stati layout of the render and work-stealing threads within
the DIMES/P-2 system is shown below: 70
B.13 The stati layout of the render and work-stealing
threads within the DIMES/P-2 system is shown
below:
Processor 1
Thread Unit 0
Thread Unit 1
Thread Unit 2
Thread Unit 3
Render Thread
Render Thread
Render Thread
Work−Steal Th.
Memory128K Global
100x100 Pixel Image
Processor 0
Thread Unit 0
Thread Unit 1
Thread Unit 2
Thread Unit 3
CRTS − Debug
Main
Render Thread
Render Thread
B.14 Exeution Details of the Mandelbrot-set ap-
pliation.
B.15 Superomputing Benhmarks: Global Updates
Per Seond (GUPS).
The GUPS benhmark is a very simple program that is eetively a ross-setion
bandwidth benhmark. It makes a large number of random updates to a large array:
B.16. GUPS and DIMES. 71
Figure B.1: The image generated
shortly after program start-up.
T2
T0
T1
{
{{
Figure B.2: Image generation has pro-
gressed, shortly before a work-stealing
event.
T2
T0
T1
{{
{
Figure B.3: Just after the rst work-
stealing operation.
T
T
T2
1
0 {{
{
Figure B.4: The seond work-stealing
operation.
T1
T0
T2 {
{
{
• for (i = 0; i < 30000000; ++i) table[random_integer℄ += random_value;
• This is ompliated beause for Cylops we wish to perform this operation on
the table aross multiple proessors.
• The limiting fator for the program is the memory aess time due to the
random reads & writes this onfounds arhitetural features that may attempt
to improve memory performane.
• But we are allowed to have a 0.1% errors in the table at the end of the benh-
mark.
This error rate is vital as it allows us to relax the loking used to aess the global
table. This relaxation means that updates to the table may not be done in sequential
program order, thus introduing errors.
B.16 GUPS and DIMES.
Currently there have been three simple, initial implementations of this program, all
run on DIMES/P-2:
B.17. Limitations of urrent GUPS & DIMES. 72
1. A sequential implementation, with no threading.
2. Multi-threaded implementations:
(a) Full loking on the table aess, thus giving a zero error rate.
(b) No loking at all on the table aess, thus sariing error rate for speed.
The error rate is as yet unmeasured, but this appears to run 10 times
faster than the fully loked version above.
The justiation for implementing GUPS with no loking is a statistial one. As
the amount of memory on Cylops is 1Gb/hip, with only 320 thread units/hip,
then the likelihood of any two thread units aessing any one memory loation at
the same time is very low, muh lower than 0.001 (our permissible error rate).
B.17 Limitations of urrent GUPS & DIMES.
• The pthread programming model is too simplisti:
 It does not reet the memory hierarhy. More sophistiated memory
models (suh as loation onsisteny) will be needed to aid the program-
mer in eetively lay out the global table to make the memory aesses
faster.
 It does not diretly support data or thread perolation.
• The DIMES/P-2 hardware has too few resoures (only 8 thread units and
128Kb RAM) to be a realisti platform upon whih to run these more sophis-
tiated benhmarks.
B.18 Conlusion & Future Work.
• The Mandelbrot set is an ideal program to demonstrate and test massively
parallel arhitetures, suh as ellular arhitetures.
B.18. Conlusion & Future Work. 73
• The urrent run-time system and pthread programming model, although sim-
ple, is suiently powerful only for a ertain sub-set of sophistiated applia-
tions.
• Future development of the Cylops arhiteture towards Cylops-64, with the
development of DIMES/P-8 with more hardware resoures (at least 32 thread
units and 512Kb RAM) will allow the development and testing of more sophis-
tiated programs, suh as superomputing benhmarks. These benhmarks
and the greater hardware resoures will allow further experimentation with the
sophistiated programming models that have been suggested, suh as thread
perolation and loation onsisteny.
Appendix C
The Challenges of Eient
Code-Generation for Massively
Parallel Arhitetures.
This is opy of a onferene paper submitted and aepted for the 11
th
Asia-Pai
Computer Systems Arhiteture Conferene.
Jason M M

Guiness
1
, Colin Egan
1
, Brue Christianson
1
and Guang Gao
2
.
Department of Compiler Tehnology and Computer Arhiteture, University of
Hertfordshire, Hateld, Hertfordshire, U.K. AL10 9AB. .eganherts.a.uk
1
CAPSL, University of Delaware, Delaware, U.S.A. g.gaoapsl.udel.edu
2
C.1 Abstrat
Overoming the memory wall [121℄ may be ahieved by inreasing the bandwidth
and reduing the lateny of the proessor to memory onnetion, for example by
implementing Cellular arhitetures, suh as the IBM Cylops. Suh massively
parallel arhitetures have sophistiated memory models. In this paper we used
DIMES (the Delaware Iterative Multiproessor Emulation System), developed by
CAPSL at the University of Delaware, as a hardware evaluation tool for ellular
arhitetures. The authors ontend that there is an open question regarding the
potential, ideal approah to parallelism from the programmer's perspetive. For
74
C.2. Introdution 75
example, at language-level suh as UPC or HPF, or using trae-sheduling, or at a
library-level, for example OpenMP or POSIX-threads. To investigate this, we have
hosen to use a threaded Mandelbrot-set generator with a work-stealing algorithm to
evaluate the DIMES thread programming model for writing a simple multi-threaded
program.
C.2 Introdution
Integrating the proessing logi and memory [21℄, termed PIM, is an approah
to overome the memory wall [121℄. PIM arhitetures may improve both data-
proessing and data-aess times, but the ombined proessor speed and the amount
of memory may be redued [21℄. This may be overome by onneting multiple, inde-
pendent PIM ells, giving a ellular arhiteture. In this organisation, every thread
unit is an independent single-issue, in-order proessor, thus able to potentially a-
ess memory independently. Moreover, the dierent memory hierarhies may have
dierent aess timings and onsisteny models suh as loation onsisteny [40℄.
This gives rise to a number of ode-generation problems, entred around the fat
that to provide omputational power, these systems are not only massively parallel,
but have omplex memory hierarhies.
Researh also proeeded towards thread-generating ompilers, for example, HPF
and UPC [45℄, IBM XL Fortran and Visual Age C/C++, largely based upon OpenMP,
all of whih have their ompromises. Some of these also have support for the various
memory models.
Unfortunately general-purpose languages have been slow to adopt a sophistiated
abstration of the mahine model, library-based approahes have developed, for
example, the various implementations of OpenMP. But, the authors ontend that
library-based solutions to threading are too dependent upon the programmer to use
eetively. For example, the expliit use of loks in programs is prone to error, with
deadloks and rae-onditions that are hard to trak down easily, introdued, even
on systems with only a few proessors. The development of suitable tools to debug
multi-threaded appliations has also been slow. Debuggers are in development, for
C.3. Related Work 76
example for Cylops [42℄, but there have been too few, with limited funtionality.
As identifying parallelism both orretly and eiently is very hard for the pro-
grammer to do, the authors ontend that they should not do it. The ompiler,
equipped via these libraries with a detailed mahine-model, ould be able to use
the programmer-identied parallelize-able variables and funtions, to generate more
eient ode. The authors identied little work investigating the software aspet
of the ode-generation problem for massively-parallel arhitetures. Unfortunately,
if this ase would ontinue, this shortoming ould adversely aet the popularity
of suh systems and maintain the pereption that massively parallel arhitetures
are too speialised and thus too expensive to be of more general use. Given the
popularity of introduing multi-ore proessors, this position is set to beome even
more untenable.
C.3 Related Work
C.3.1 The Programming Models: from Compiler to Libraries
With suh ompute bandwidth, and parallelism, a number of problems for the pro-
grammer have been raised, primarily these are foused on the problems of memory
reads and writes. Super-salar hips have had mehanisms to hide these problems
from the programmer, but the ellular arhitetures of suh hips as pioChip [33℄
and IBM BlueGene/C [4℄ do not. Thus the programmer needs to know how memory
reads and writes interat with:
• the software-ontrolled data-ahe attahed to that pipeline,
• the software-ontrolled data-ahe of other on-hip pipelines,
• any global on-hip memory,
• the software ontrolled data-ahes of other o-hip pipelines,
• the global on-hip memory that is on any other hips,
• any global memory that is not on any hip
C.3. Related Work 77
• and nally, given the massive parallelism available, how to make eient use
of it.
For a programmer, the memory aess models are important to understand, or to
have a library or ompiler that hides the details from the appliations programmer.
In the remainder of the paper the authors will fous on the IBM BlueGene/C ar-
hiteture, and a prototype implementation of it alled Cylops [21, 30℄, that was
implemented at CAPSL at the University of Delaware in ollaboration with the
University of Hertfordshire. The Cylops arhiteture was prototyped in hardware,
alled DIMES/P, [91℄ whih was used as the platform for exeuting the programming
example, desribed later in this paper. In the following setions the memory aess
models will be disussed, leading on to a presentation of the authors' experiene in
developing a program for suh an arhiteture. The experiene gained from this will
allow the authors to disuss the major problems that were faed, how, if at all, they
were overome, and the outstanding problem domains that, in the authors' experi-
ene, would hinder the aeptane of multi-ore hips and, moreover suh massively
parallel designs as IBM BlueGene/C.
C.3.2 Programming Models on Cellular Arhitetures
The hardware dierenes between ellular and super-salar arhitetures indiate
that dierent programming models, to those used for super-salar arhitetures, are
required to make eetive use of the ellular arhitetures [40, 42℄. In the rst two
of those three papers, their authors propose the use of a ombination of exeution
models and memory models, as already noted in this paper.
The primary onerns when programming DIMES/P, and thus any Cylops-
based arhiteture, were:
• How to manage the potentially large numbers of threads.
• How to easily express any parallelism within the input soure-ode.
• How to make orret, and most eetive use, of the memory onsisteny mod-
els.
C.4. Programming for Cylops - threads 78
Some researh has already been done regarding programming models for the thread-
ing, suh as using thread perolation as a tehnique to perform dynami load-
balaning [62℄. Another piee of researh [22℄ investigated using multi-level sheduling-
shemes: a work-stealing algorithm at the higher-level and a multi-threading teh-
nique at the lower-level to hide ommuniation latenies. Alternatively there is
researh [88℄ into how to implement OpenMP eiently on ellular arhitetures
suh as IBM BlueGene/C.
C.4 Programming for Cylops - threads
This setion will very briey desribe the thread programming model, whih is an
early version of TNT [31,42℄, then how it was used to implement the programming
example, followed by a disussion of the implementation.
The implementation of the memory onsisteny models was relatively simple:
earlier, unpublished, work on the GCC-based ompiler had implemented a simple
algorithm: all stati variables were stored in on-hip memory, and the funtion all
stak, inluding all automati variables was plaed in the srath-pad memory.
As there was no language-level support for thread management, a library had
to be implemented to support the thread management instrutions in the Cylops
ISA, whih was used as the basis for reating a higher-level C++ abstration. This
was beause the thread implementation, that losely followed a POSIX-Threads
API, was onsidered far too primitive by the authors to be eetively used for
programming Cylops. This C++ API also inluded ritial-setion, mutex and
event objets to allow for easier management of the lower-level objets.
To test these ideas, and the Cylops arhiteture, a small, simple and embarrass-
ingly parallel program to generate Mandelbrot sets [72℄ was reated. In the following
setions a brief overview of how this how this program may be implementation for
DIMES/P.
C.4. Programming for Cylops - threads 79
Algorithm 4 The render-thread algorithm.
1. Set the value of m, the maximum iterations, greater than zero. Set the estimated ompletion-time, t, to ∞.
2. Set c = x, where x is the top-left of the strip to be rendered.
3. Initialise n = 0, z0 = 0.
(a) Exeute zn+1 = z
2
n + c.
(b) Inrement n.
() If | zn |≥ 2 then that c is not in the set of points whih omprise the Mandelbrot set. Go to 4.
(d) If n > m then that c is in the Mandelbrot set, i.e. c ⊂ M . Go to 4.
(e) Go to 3a.
4. Inrement the real part of c. If the real part of c is less than the width of the strip to be rendered, go to 3.
5. Calulate the average of t and the time it took to render that line.
6. Set the real part of c to the left-hand of the strip. Inrement the omplex part of c. If the omplex part of
c is less than the height of the strip, go to 3.
7. Signal work ompleted, set t = 0 (thus this thread is guaranteed not to be seleted by the work-stealing
algorithm 5).
8. Suspend.
Algorithm 5 The work-stealing algorithm.
1. Monitor render threads for a work-ompleted signal. That thread that ompletes we shall denote as Tc.
2. Find that render thread with the longest estimated ompletion-time, t, note that eah render thread updates
this time upon ompletion of a line. Call this thread Tl.
3. Stop Tl when it ompletes the urrent line it is rendering.
4. Split the remaining work to be done in the strip equally between the two render threads Tc and Tl.
5. Restart the render threads Tc and Tl.
6. Go to 1.
C.4.1 Threading and the Mandelbrot Set
Due to the properties of DIMES/P, alternative tehniques were not possible, as there
are only 8 thread units between two proessors. In this implementation, the omplex
plane was divided into a series of horizontal strips. Those strips may be alulated
independently of eah other, using separate threads, implemented as algorithm 4.
However, eah strip will, in general, take a dierent amount of time to omplete,
thus the threads would have ompleted their assigned portion of work at dierent
times. Thus a work-stealing algorithm 5 performed the load-balaning between the
threads.
C.5. Disussion 80
The bandwidth of the work-stealing thread, algorithm 5, limited saling to more
worker threads, algorithm 4. But algorithm 5 would able to tolerate failures: if a
worker thread stopped responding, its work would have been eventually stolen.
If robustness is not required, then the image generated may be viewed as an
array values. Eah of these values would be the lassiation of c. Thus if one has
p0...q threads, eah pn thread initially lassies a point in the array oset by n, and
one ompleted, would move along the array using a stride of q. This would allow
the use of a number of threads that is bounded by the number of points within the
image.
C.4.2 DIMES/P Implementation of the Mandelbrot-set ap-
pliation
In threads, eah software thread was statially alloated to one of the 8 hardware
thread-units in DIMES/P at program start-up. The software threads were:
1. The a thread was required for threads support and the debugger [42℄, if it
were to be run.
2. The main loop of the Mandelbrot-set appliation.
3. The thread that exeuted the work-stealing algorithm 5. In priniple, a worker
thread ould also run on this thread unit, but threads did not support virtual
threads.
4. The remaining 5 threads were worker threads that exeuted algorithm 4.
Further details regarding the implementation may be found in [70℄.
C.5 Disussion
The limitations of DIMES/P prevented further study of the properties of this pro-
gram: salability and timings were not done beause of the limited number of thread
units (8) and memory apaity.
C.5. Disussion 81
The memory model support, using the C/C++ keyword stati by the ompiler,
made natural use of language-level syntax to map data into srath-pad and on-
hip memory made using these dierent memory models. The atomi, word-sized,
memory-operations on Cylops were not used for this problem, beause of the mul-
tiple, read-modify-write operations that had to be maintained as an atomi unit. If
the manual loking had been implemented within the ompiler, then it may have
been possible for the ompiler to perform optimization on the loking of aess to
the data.
With regards to the thread library: in the opinion of the author's, the om-
plexity of POSIX-Threads has been a hindrane to suessful multi-thread program
reation. Abstrating the algorithms that expressed the parallelism within the Man-
delbrot program, for example the work-stealing algorithm, was not implemented for
this paper, as this was onsidered to be potentially too losely oupled to the atual
program in question. Ultimately this deision, in the authors' opinion, was awed,
and by extrating and abstrating the work-stealing algorithm from both the pro-
gram and Cylops, would have allowed a programmer to reuse that algorithm with
other programs, thus separating the design of the parallelism from the details of the
program that would wish to use it.
It is still an open question regarding what may be the ideal approah to paral-
lelism: language-level support suh as UPC, HPF or other language extensions, or
within the ompiler using trae-sheduling, or should it be at a library-level using,
for example OpenMP or POSIX-Threads, or should it be within the arhiteture,
suh as the data-ow design. If programs more sophistiated than the one desribed
in this paper are to be suessfully written for these ellular arhitetures, then
based upon this brief examination, it is the authors' ontention that it would be
highly advantageous to have:
• Compiler support for making use of any available the memory model of the
arhiteture.
• Compiler support for loking, whih would aid the programmer with writing
ode that avoids rae-onditions.
C.5. Disussion 82
• Reusable abstrations of tehniques of implementing parallelism, suh as work-
stealing, or master-slave models. These abstrations ould make use of both
data and ode loality to ensure that a thread unit re-exeutes the same ode,
if desirable.
The researh presented in this paper is supported by the Engineering and Physial
Researh Counil (EPSRC) grant number: GR/S58492/01.
