Easy PRAM-based High-performance parallel Programming by Ghanim, Fady
ABSTRACT




Doctor of Phi losophy, 2017
Disser tat ion directed by: Professor Rajeev Barua
Depar tment of Electr ical and
Computer Engineer ing
Paral le l machines have become more widely used. Unfor-
tunately paral le l programming technologies have advanced at
a much slower pace except for regular programs. For i r regu-
lar programs, this advancement is inhibi ted by high synchro-
nizat ion costs, non- loop paral le l ism, non-array data structures,
recursively expressed paral le l ism and paral le l ism that is too
fine-grained to be exploi table.
This work introduced ICE, a new paral le l programming lan-
guage that is easy-to-program, since: ( i ) ICE is a synchronous,
lock-step language; ( i i ) for a PRAM algor i thm its ICE pro-
gram amounts to direct ly transcr ib ing i t ; and ( i i i ) the PRAM
algor i thmic theory offers unique wealth of paral le l a lgor i thms
and techniques. This work suggests that ICE be a par t of
an ecosystem consist ing of the XMT archi tecture, the PRAM
algor i thmic model, and ICE itsel f, that together del iver on the
twin goal of easy programming and efficient paral le l izat ion of
i r regular programs. The XMT archi tecture, developed at UMD,
can exploi t fine-grained paral le l ism in irregular programs. This
work also presents the ICE compi ler which translates the ICE
language into the mult i threaded XMTC language; the signifi-
cance of this is that mult i - threading is a feature shared by
pract ical ly al l current scalable paral le l programming languages.
As one indicat ion of ease of programming, i t was observed a
reduct ion in code size in 11 out of 16 benchmarks vs. XMTC.
For these programs, the average reduct ion in number of l ines
of code was 35.53% when compared to hand opt imized XMTC
The remaining 4 benchmarks had the same code size. The ICE
compi ler achieved comparable run-t ime to XMTC with a 0.48%




Fady Ahmad Abdalrahim Ghanim
Disser tation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in par tial fulfillment




Professor Rajeev K. Barua, Chair/Advisor




c© Copyr ight by



































































































































I dedicate this work to my mother.
i i
Acknowledgments
First and foremost, I star t by thanking God for al l the
achievements discussed herein.
Second, I owe my grat i tude to al l the people who have made
this disser tat ion possible and because of whom my graduate
exper ience has been one that I wi l l cher ish forever.
I ’d l ike to thank my advisor, Professor Rajeev Barua for
giv ing me an invaluable oppor tuni ty to work on chal lenging
and extremely interest ing projects over the past years. He has
always made himself avai lable for help and advice and there
has never been an occasion when I ’ve knocked on his door
and he didn’ t g ive me t ime. Even dur ing busy t imes, he’d
always pat ient ly l is ten to the long-winded - incoherent at t imes-
explanat ion of my raw ideas, then help me summar ize them
and advise me on ways to improve and bui ld upon them. I t is
a pleasure to work with and learn from such an extraordinary
indiv idual .
I would also l ike to thank my co-advisor, Professor Uzi
Vishkin. Without his extraordinary theoret ical ideas and exper-
t ise in paral le l a lgor i thms, this disser tat ion would have been
impossible. Thanks are due to Professor Bruce Jacob, Profes-
sor Manoj Frankl in and Professor Alan Sussman for agreeing
i i i
to serve on my PhD defense commit tee and for spar ing their
invaluable t ime reviewing the manuscr ipt .
I owe my deepest and undying grat i tude to my mother who
have always stood by me and guided me through my ent i re l i fe.
Her wisdom and her sound advice was a l ight that pul led me
through in darkest t imes. Ever since my concept ion she’s been
carry ing me, whether physical ly or in her hear t and thoughts.
Words cannot express the amount of grat i tude I owe her. Also,
I want to thank my l i t t le nephews and nieces who cr ied their
hear ts out when I first lef t home to do my studies. I want to
thank my brothers and the rest of my fami ly for suppor t ing me
throughout the ent i re per iod.
Also, I ’d l ike to express my grat i tude to my fr iend Kel ly
Flanagan . Kel ly has always been there for me at the t ime of
need. Whi le doing this work, I have gone through tr y ing t imes
and personal ordeals. Kel ly, wi th kindness and suppor t , helped
me through i t a l l . I am lucky to have such a great person in
my l i fe. Kel ly ’s a l i fesaver.
I would l ike to acknowledge financial suppor t f rom the Ful-
br ight program and for providing me with this oppor tuni ty.
I t is impossible to remember al l , and I apologize to those
I ’ve inadver tent ly lef t out . Last ly, thank you al l !
iv
Contents
List of Abbreviations xi
1 Introduction 1
1.1 Contr ibut ions . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 9
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 The PRAM Algor i thmic Models . . . . . . . . . . . . . . 10
2.3 The Work-Depth Model for PRAM
Algor i thms . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 The many-core XMT Archi tecture . . . . . . . . . . . . 14
2.5 The XMTC language . . . . . . . . . . . . . . . . . . . . 20
2.6 The Programmer Workflow for
Wr i t ing PRAM-Based Programs . . . . . . . . . . . . . . 25
2.7 Advantages of the XMT Plat form . . . . . . . . . . . . 28
2.7.1 The Performance of the XMT Plat form . . . . . 28
2.7.2 The XMT Teachabi l i ty and Ease of Use . . . . 30
3 The ICE Programming Language 33
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 The ICE language . . . . . . . . . . . . . . . . . . . . . . 33
3.3 The Syntax of the ICE Language . . . . . . . . . . . . 36
3.4 Example Showcasing the ICE
Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 The ICE language Advantages . . . . . . . . . . . . . . 45
4 The ICE Language Compiler 53
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Translat ing ICE to XMTC . . . . . . . . . . . . . . . . . 53
4.2.1 Spl i t t ing a pardo Region into Mult ip le spawn
blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.2 Communicat ion of Intermediate Informat ion
Among spawn Blocks . . . . . . . . . . . . . . . . . 55
4.2.2.1 Handl ing Data Flow Across pardo Re-
gion Spl i ts . . . . . . . . . . . . . . . . . . 55
v
4.2.2.2 Handl ing Control Flow Across pardo
Region Spl i ts . . . . . . . . . . . . . . . . 56
4.3 Opt imizing The Translated ICE Code . . . . . . . . . . 58
4.3.1 Cluster ing of Memory Instruct ions . . . . . . . . 60
4.3.2 Reducing the Number of Temporar ies . . . . . . 63
4.3.3 Fix ing Control Flow after Cluster ing . . . . . . . 64
4.4 The ICE Compi ler Structure . . . . . . . . . . . . . . . 67
4.4.1 prel iminary Code Optimizat ion . . . . . . . . . . 67
4.4.2 Bui ld ing the Dependency Graph . . . . . . . . . 69
4.4.3 Maintaining Correctness of LLVM’s SSA Form 73
4.4.4 Transforming the LLVM IR into XMTC code . . 77
4.5 Suppor t for Nested Paral le l ism . . . . . . . . . . . . . . 77
5 Evaluating The ICE Language: Results and Analysis 81
5.1 Environment and Methodology . . . . . . . . . . . . . . 82
5.2 Ease of use and Code size . . . . . . . . . . . . . . . 84
5.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 92
6 Related work and Conclusion 101
6.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A Dependence Tests and Analysis 111
A.1 Overview of Dependence Test ing . . . . . . . . . . . . 112
A.2 Zero Index Var iable Test . . . . . . . . . . . . . . . . . 114
A.3 Single-Subscr ipt Dependence Tests . . . . . . . . . . . 114
A.3.1 Single Index Var iable Tests . . . . . . . . . . . . 115
A.3.1.1 Strong SIV Test . . . . . . . . . . . . . . . 115
A.3.1.2 Weak SIV Subscr ipts . . . . . . . . . . . . 118
A.3.1.3 Weak-Zero SIV Test . . . . . . . . . . . . 118
A.3.1.4 Weak-Crossing SIV Test . . . . . . . . . . 119
A.3.1.5 Exact SIV Test . . . . . . . . . . . . . . . 121
A.4 Mult ip le Index Var iable Test . . . . . . . . . . . . . . . 122
A.4.1 The Greatest Common Denominator Test . . . . 124
A.4.2 Restr icted Double Index Var iable . . . . . . . . . 125
A.5 Test ing in Coupled Groups . . . . . . . . . . . . . . . . 126
A.5.1 Delta Test . . . . . . . . . . . . . . . . . . . . . . . 127
A.6 Symbol ic Tests . . . . . . . . . . . . . . . . . . . . . . . . 128
vi
List of Tables
3.1 Compar ison of the pardo and spawn constructs. . . . 39
5.1 A l ist of the non-nested benchmarks. . . . . . . . . . 83
5.2 A l ist of the nested benchmarks. . . . . . . . . . . . . 84
5.3 Number of spawn blocks and temporar ies in both




2.1 Representat ion of the standard PRAM mode. Only
p operat ion are executed at each t t ime step . . . . 11
2.2 Representat ion of the Work-Depth mode. Execute
as many operat ion as needed at each t t ime step . 14
2.3 XMT hardware. (a) Block diagram. (b) Memory Hier-
archy in paral le l mode. The lef t s ide of (b) shows
the est imated latency to each memory hierarchy
level f rom the processing core for a 1024 TCU con-
figurat ion (64 clusters × 16 TCUs). Some elements
are omit ted for simpl ic i ty, such as the Master TCU,
which operates in ser ia l mode, the global register
file and the prefix-sum unit . . . . . . . . . . . . . . . . 18
2.4 XMT Programming. (a) Array Compact ion example.
Array A’s non-zero elements are copied into B. The
order is not necessar i ly preserved. After execut-
ing ps(inc,base) , the base var iable is increased by
inc and the inc var iable gets the or ig inal value of
base , as an atomic operat ion. (b) The XMT execut ion
model: switching between ser ia l and paral le l modes. 19
2.5 XMTC language syntax. low and high represent the
IDs of the first and last threads. The join is impl ic i t
in the closing bracket of the spawn block. var iable
declared outside the spawn block are shared whi le
var iables declared inside are thread pr ivate. . . . . . 23
2.6 The program flow for translat ing PRAM algor i thm
into XMT algor i thm. . . . . . . . . . . . . . . . . . . . . . 26
3.1 ICE language Syntax. . . . . . . . . . . . . . . . . . . . 38
3.2 Pointer jumping example showing simpl ic i ty of ICE
code. (a) provides a descr ipt ion of the pointer jump-
ing problem. This problem is then solved using ICE
programming language (b), XMTC programming lan-
guage (c), and OpenMP (d) . . . . . . . . . . . . . . . . 41
3.3 Threaded code with race condi t ion . . . . . . . . . . . 48
ix
4.1 (a) A pardo with a condi t ional branch. (b) I ts XMTC
translat ion. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Reschedul ing memory accesses. Statement A2 is
dependent on statement A1, and statement B2 is
dependent on statement B1. statements A are inde-
pendent from statements B . . . . . . . . . . . . . . . . 59
4.3 The cluster ing algor i thm. . . . . . . . . . . . . . . . . . 62
4.4 Example showing placement of instruct ions into their
respect ive clusters, before and after cluster ing, and
the CFG repl icat ion process. The store depends
on the load across mult ip le paral le l contexts. The
placement of instruct ions A1 and A2 into clusters
1 and 2 respect ively, resul ts in breaking the SSA
dominance proper ty. To resolve this problem, A1 is
c loned to temp in c luster 1, and temp is retr ieved
to be used by A2 in c luster 2 . . . . . . . . . . . . . . 75
5.1 Code size for the ent i re program normal ized to XMTC. 85
5.2 Code size of the algor i thm’s paral le l sect ions nor-
mal ized to XMTC. . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Code size for the ent i re program normal ized to
XMTC for nested benchmarks. . . . . . . . . . . . . . . 88
5.4 Code size of the algor i thm’s paral le l sect ions nor-
mal ized to XMTC for nested benchmarks. . . . . . . . 88
5.5 64 TCU XMT processor speedup compar ison of non-
nested ICE programs normal ized to performance of
hand-opt imized XMTC . . . . . . . . . . . . . . . . . . . 94
5.6 64 TCU XMT processor speedup compar ison for
nested ICE programs normal ized to performance of
hand-opt imized XMTC . . . . . . . . . . . . . . . . . . . 94
5.7 64 TCU XMT net speedup of non-nested ICE pro-
grams normal ized to hand-opt imized XMTC . . . . . . 95
5.8 64 TCU XMT net speedup of ICE normal ized to
opt imized XMTC for the nested benchmarks . . . . . 96
5.9 64 TCU XMT net speedup compar ison between nested
and non-nested ICE normal ized to opt imized XMTC 98
x
List of Abbreviat ions
CFG Control Flow Graph
CRCW Concurrent Read Concurrent Wr i te
DA Dependence Analysis pass
ERCW Exclusive Read Concurrent Wr i te
EREW Exclusive Read Exclusive Wr i te
ICE Immediate Concurrent Execut ion
IOS Independence of Order Semant ics
MIV Mult ip le Induct ion var iable
MTCU Master Thread COntrol Uni t
PRAM Paral le l Random Access Machine
PS Prefix-Sum
RDIV Restr icted Double Index Var iable
SIV Single Induct ion Var iable
SPMD Single Program Mult ip le Data
SSA Single Stat ic Assignment
TCU Thread Control Uni t
TID Thread ID
WD Work-Depth
XMT eXpl ic i t Mult i -Threading
XMTC eXpl ic i t Mult i -Threading C language
ZIV Zero Induct ion Var iable
xi
Chapter 1 : In t roduc t ion
Since 2005, prac t ica l l y a l l computers have become (mul t i -
core) para l le l mach ines. The fie ld o f para l le l comput ing
has made t remendous s t r ides in exp lo i t ing para l le l i sm for
per fo r mance. However, i t i s a lso increas ing ly recogn ized
tha t i t s t ra jec to r y is shor t o f i t s genera l -pur pose poten t ia l .
Para l le l mach ines requ i re par t i t ion ing the task at hand
in to subtasks ( th reads) to be run concur ren t ly fo r min imiz ing :
( i ) memor y accesses beyond loca l (cache) memor ies, and ( i i )
communica t ion and synchron iza t ion among subtasks. Other
programmers respons ib i l i t i es inc lude lock ing , wh ich can be
t r i cky fo r fine-gra ined mul t i - th read ing needed for sca l ing ,
work d is t r ibu t ion and schedu l ing and hand l ing concur ren t
access to data s t ruc tures. Whi le para l le l p rogramming lan-
guages and para l le l mach ines d i ffe r on how much of the
par t i t ion ing is the programmers respons ib i l i t y, they a l l ex -
pec t a s ign ificant e ffo r t f rom the programmer for produc ing
an effic ien t mul t i - th readed program. Establ i sh ing cor rec tness
1
of these programs is ye t another cha l lenge, as asynchrony
may increase the number o f reachable s ta tes exponent ia l l y.
The theor y o f genera l -pur pose para l le l a lgor i thms assumes
an abst rac t computa t ion mode l (known as PRAM for para l le l
random-access mach ine, or mode l ) tha t s tands in shar p con-
t ras t to these hardsh ips ; each t ime step invo lves a p lura l i t y
o f opera t ions, a l l opera t ion per fo r med synchronous ly in un i t
t ime and may inc lude access to a la rge shared memor y.
Th is PRAM computa t ion mode l abs t rac ts away oppor tun i t ies
fo r us ing loca l memor ies, and min imiz ing computa t ion or
synchron iza t ion , lock ing , work d is t r ibu t ion , schedu l ing and,
in fac t , any concept o f th reads. A lso, fo r PRAM prac t i -
ca l l y ever y problem has a para l le l a lgor i thm. Th is makes
i t bo th des i rable and much eas ie r to spec i fy PRAM para l le l
a lgor i thms, and the quest ion tha t s ta r ted out th is thes is
has been: bu t , a t what per fo r mance pena l ty? As exp la ined
next , our sur pr is ing resu l t i s tha t i t i s feas ible to avo id any
per fo r mance pena l ty.
Coup led wi th pr io r work , th is thes is es tabl i shes the fo l -
low ing resu l t : ( i ) i t i s feas ible to get compet i t i ve speedups
whi le essent ia l l y us ing PRAM algor i thms as- is fo r program-
ming a para l le l computer sys tem; fu r ther more ( i i ) these
2
speedups are on par wi th mul t i - th readed code opt im ized
to min imize non- loca l memor y accesses, communica t ion and
synchron iza t ion . Es tabl i sh ing feas ib i l i t y o f us ing such ab-
s t rac t (and much s impler ) PRAM programming whose per fo r -
mance is on par wi th the bes t manua l ly op t im ized programs
is a spec ific new cont r ibu t ion of the cur ren t paper.
The pr io r work o f our research group has ant ic ipa ted
the above hardsh ips. To preempt as many of them as we
deemed feas ible, our s ta r t ing po in t fo r the des ign of a many-
core arch i tec tu re f ramework ca l led XMT was the r ich theor y
o f para l le l a lgor i thms, known as PRAM (for para l le l random-
access mach ine or mode l ) deve loped in the 1980s and ear ly
1990s. XMT made b ig s t r ides toward overcoming c la ims by
many tha t i t wou ld be imposs ible in prac t ice to suppor t e f -
fec t i ve ly PRAM algor i thms [e.g . , [1 ] ] . I t s premise ( in pr io r
work) has been tha t i t must be the programmer who wi l l
p roduce a mul t i - th readed program: [2 ] ou t l ines a program-
mers workflow for advanc ing f rom a PRAM algor i thm to an
XMT mul t i - th readed program. Namely, the programmer is s t i l l
respons ible fo r produc ing a mul t i - th readed program wi th im-
proved loca l i t y and reduced communica t ion and synchron iza-
t ion . Hardware suppor t tha t XMT prov ides made th is e ffo r t
3
eas ie r than for commerc ia l mach ines, wh ich pa id o ff. Th is
workflow al lowed bet te r speedups and demonst ra ted eas ie r
lear n ing of para l le l p rogramming. S ince our pr io r work re-
mained wedded to programmer-prov ided mul t i - th read ing , i t
charac ter ized XMT programming as PRAM- l ike, as opposed
to jus t PRAM.
Th is new work is fundamenta l l y d i ffe ren t . I t shows for
the firs t t ime tha t the thread ing- f ree synchronous para l le l
a lgor i thms taught in PRAM tex tbooks can be used as- is
fo r programming wi thout per fo r mance pena l ty. Namely, i t
i s feas ible to reduce mul t i - th read ing to a compi le r ta rge t ,
a l together f ree ing the cogn i t ion o f the programmer f rom mul -
t i th read ing . In fac t , Th is thes is shows that the programmer
can essent ia l l y use the pseudo-code used in tex tbooks for
descr ib ing synchronous para l le l a lgor i thm as- is ; th is e le -
va tes XMT from suppor t ing PRAM- l ike programs to suppor t -
ing PRAM programs. Note tha t the new resu l t i s exceed ing
our research groups expecta t ions at the beg inn ing of the
XMT pro jec t : i t was expected tha t the programmer wi l l need
to make an ext ra e ffo r t fo r exp l i ca t ing PRAM para l le l i sm as
mul t i - th readed para l le l i sm; indeed, the name of XMT, exp l i c i t
mu l t i - th read ing , reflec ts the or ig ina l expec ta t ion . As can be
4
seen f rom the example, XMT gets on ly par t o f the way
to fine-gra ined mul t i - th read ing , bu t no t to lock-s tep PRAM
programming.
ICE a l lows the same in tu i t i ve abs t rac t ion tha t made i t
easy to reason about and program in ser ia l . Namely, any
ins t ruc t ion ava i lable fo r execut ion can execute immedia te ly.
In ser ia l , a program prov ides the ins t ruc t ions to be executed
in the next t ime step. Th is made ser ia l p rograms behave
as rud imentar y induc t ive s teps f rom the s tar t o f p rogram to
i ts fina l resu l t . S imi la r ly, ICE descr ibes t ime-s teps of ser ia l
o r concur ren t para l le l ins t ruc t ions tha t execute immedia te ly
each t ime-s tep ( induc t ive ly ) , wh i le fa l l ing back to ser ia l ex-
ecut ion for the ser ia l por t ion of the code. In un i fy ing ser ia l
and para l le l code, ICE can be thought o f as the natura l
ex tens ion of the ser ia l mode l .
1.1 Contr ibut ions
In th is work we make the fo l low ing cont r ibu t ions :
(1 ) A new programming language ca l led ICE that a l lows the
programmer to express the ICE abst rac t ion eas i l y and
d i rec t ly.
5
(2 ) To enable th is much h igher - leve l p rogramming language,
th is thes is in t roduces a new compi le r component tha t
au tomat ica l l y t rans la tes the ICE program in to an effic ien t
XMTC program.
(3) The ICE compi le r produces XMTC code tha t ach ieves
comparable per fo r mance to hand-opt im ized XMTC pro-
grams for a g iven PRAM algor i thm, wh i le requ i r ing much
less effo r t than the hand-wr i t ten opt im ized XMTC pro-
gram to wr i te and implement .
(4 ) Ex tended the compi le r to t rans la te nes ted para l le l p ro-
grams us ing the ICE language in to hand-opt im ized XMTC
code
The significance of these contr ibut ions is:
(1 ) The work in th is thes is enables programmers to wr i te
lock-s tep PRAM algor i thms as programs, imp lement them
’as- is ’ , and execute them over th readed mach ines wi th -
ou t sacr ific ing per fo r mance
(2) ICE requ i res programmers on ly to spec i fy the para l -
le l i sm ava i lable in an a lgor i thm. Th is reduces the effo r t
requ i red to wr i te para l le l code s ign ificant ly, and makes
6
i t much eas ie r fo r programmers to lear n how to wr i te
para l le l p rograms.
(3) The ICE compi le r is the firs t compi le r tha t t rans la tes
f rom a language based on the lock-s tep execut ion mode l ,
in to a programming language tha t fo l lows the threaded
execut ion mode l .
7
8
Chapter 2 : Background
2.1 Overv iew
The ICE language is in tended to make programming in
para l le l eas ie r by a l low ing programmers to wr i te para l le l
p rograms based on PRAM algor i thms. The ICE language
compi le r t rans la tes programs wr i t ten in ICE in to the XMTC
high leve l language, and is executed over the XMT proces-
sor. ICE he lps programmers by sav ing them the effo r t in
conver t ing the i r p rograms f rom PRAM algor i thms in to h igh
per fo r mance XMTC programs. Th is chapter w i l l go over a l l
the background necessar y fo r unders tand ing the bas is o f
ICE and apprec ia t ing the accompl ishment ach ieved in th is
thes is.
9
2.2 The PRAM Algor i thmic Models
Since ICE is grounded in the Para l le l Random Access
Mach ine (PRAM) mode l , unders tand ing PRAM is cruc ia l to
the unders tand ing of ICE. Deve loped in the 1980s and ear ly
1990s, PRAM is the para l le l equ iva len t o f the s tandard ran-
dom access mach ine mode l used for ser ia l a lgor i thmic the-
or y. PRAM is used by a lgor i thms des igners to mode l the
a lgor i thmic per fo r mance of para l le l a lgor i thms.
PRAM is in tended to abs t rac t shared memor y mach ines ;
PRAM assumes p para l le l p rocessors each of wh ich have
symmet r ic access t ime to shared memor y, and has own pr i -
va te memor y. There is no l im i t on the number o f processors
or the amount o f shared memor y in the sys tem. Each un i t
t ime, a processor can read, wr i te f rom shared memor y or
comple te a computa t ion s imu l taneous ly wi th o ther proces-
sors as shown in figure 2.1 . Th is resu l ts in confl ic ts wh i le
access ing same shared memor y loca t ions, fo r tha t reason
cer ta in ru les have been establ i shed to reso lve these con-











Figure 2.1: Representat ion of the standard PRAM mode. Only
p operat ion are executed at each t t ime step
• Exclusive-Read Exclusive-Write (EREW) Only one proces-
sor can access a memory locat ion at uni t t ime, for both
reads and wr i tes.
• Concurrent read exclusive-write (CREW) processors can
read same memory locat ion simultaneously at uni t t ime,
however only one processor can wr i te to i t .
• Concurrent read Concurrent write (CRCW) Processors
can read and wr i te simultaneously at uni t t ime. Reads
always happen before wr i tes. There are mult ip le rules for
determining which wr i te to memory succeeds. Some of
these rules are:
11
( i ) Pr ior i ty CRCW : The processor with the smal lest ID
wins and i ts wr i te is the one that succeeds.
( i i ) Common CRCW : Does not al low concurrent wr i tes un-
less al l processors are attempt ing to wr i te same value
to same memory locat ion.
( i i i ) Arbi t rary CRCW : Any processor among al l the pro-
cessors tr y ing to wr i te to the same memory locat ion
succeed. ICE fol lows this is the convent ion.
The PRAM model presents algor i thms as a sequence of
t ime units, each of which contains p operat ions being executed
as one instruct ion per one processor. As such PRAM only
al lows p operat ions to be executed concurrent ly, as shown
in figure 2.1. This presentat ion is cal led the standard PRAM
mode. To express operat ions that are performed in paral le l ,
the pardo (paral le l do) programming construct is used.
for P i , 1 <= i <= n pardo
perform some operat ion
The standard PRAM model has a few disadvantages es-
pecial ly when compared to other forms of present ing PRAM
algor i thms:
( i ) I t does not reveal how changing the number of processing
uni ts wi l l effect the speed of algor i thm execut ion.
12
( i i ) Wr i t ing a paral le l a lgor i thm requires speci fy ing a level of
detai l that might be unnecessary, due to that algor i thms
are based on the number of processors avai lable on the
plat form to be used.
2.3 The Work-Depth Model for PRAM
Algorithms
The Work-Depth (WD) model is an al ternat ive presentat ion
model for PRAM algor i thms. In this model, algor i thms are rep-
resented as a sequence of t ime units where each uni t contains
a group of operat ion to be executed concurrent ly as shown
in figure 2.2. The main di fference between this model and
the standard PRAM model, is that the number of concurrent
operat ion to be executed at each t ime step in WD model is
not bound by the number of avai lable processing uni ts as in
the standard PRAM model. Rules for resolv ing conflicts such
as arbi t rary, pr ior i ty and common CRCW are defined in WD
model simi lar to their defini t ion in standard PRAM.
Performance of a paral le l a lgor i thm of size n that is given
in WD mode is measured in terms of worst case running t ime










Figure 2.2: Representat ion of the Work-Depth mode. Execute
as many operat ion as needed at each t t ime step
2.4 The many-core XMT Architecture
A br ief review of some basic concepts of the eXpl ic i tMult i -
Threading (XMT) on chip general-purpose computer archi tecture
[3, 4, 5] fo l lows.
The XMT archi tecture was designed to capi ta l ize on the huge
on-chip resources becoming avai lable as a resul t of modern
fabr icat ion technologies. The pr imary goal of XMT has been
to take advantage of paral le l ism to improve the performance
of single-tasks. Since taking advantage of the huge body of
14
knowledge avai lable within PRAM is the goal behind XMT, i t
was designed with PRAM algor i thms in mind.
The XMT framework uses an arbi t rary CRCW (concurrent
read concurrent wr i te) SPMD (single program mult ip le data)
programming model. An arbi t rary number of vir tual threads,
in i t iated by a spawn instruct ion and terminated by a join in-
struct ion, share the same code [6] . The arbi t rary CRCW aspect
dictates that concurrent wr i tes to the same memory locat ion
resul t in an arbi t rary one commit t ing. No assumption needs to
be made beforehand about which wi l l succeed. An algor i thm
designed with this proper ty in mind permits each thread to
progress at i ts own speed from its in i t iat ing spawn to i ts termi-
nat ing, join , wi thout ever having to wait for other threads; that
is, i t exhibi ts “ independence of order semant ics” ( IOS) [5, 6] ,
such that no thread busy-waits for another thread. See Figure
2.4(b).
The XMT processor , shown in Figure 2.3a, implements the
above programming model efficient ly. I t includes a master
thread control uni t (MTCU), processing clusters (C0.. .Cn) each
compr is ing several thread control uni ts (TCUs), a high-bandwidth
low-latency interconnect ion network (an essent ia l component
presented in [7, 8]) between clusters and memory modules, a
15
global register file (GRF), a prefix-sum unit explained below,
and memory modules (MMs) each compr is ing on-chip cache
and off-chip memory, wi th several MMs possibly shar ing a mem-
ory control ler. The MTCU has a standard pr ivate data cache,
used only in ser ia l mode, and a standard instruct ion cache,
whi le shar ing the MMs with al l the TCUs.
Since the prefix-sum (ps) operation descr ibed above is a
central component of XMT, i t must be executed very efficient ly.
The hardware implementat ion of the prefix-sum unit [6, 9] can
accept binary input from mult ip le TCUs and the execut ion t ime
does not depend on the number of TCUs that are sending
requests to i t . I t enables constant t ime, low-overhead coor-
dinat ion between tasks, a key requirement for implement ing
efficient
fine-grained paral le l ism. As such, I t a lso can be used for
implement ing efficient and scalable inter- thread synchronizat ion,
by arbi t rat ing an order ing between the threads.
The XMT programming model al lows programmers to spec-
i fy an arbi t rary degree of paral le l ism in their code. Clear ly,
real hardware has fini te execut ion resources, so in general al l
threads cannot execute simultaneously. A hardware scheduler ,
which extends the stored-program-plus-program-counter appa-
16
ratus at the MTCU to the TCUs [9] , a l locates the indiv idual
v i r tual threads to the physical thread control uni ts (TCU). I t
re l ies heavi ly on hardware suppor t and the prefix-sum unit ,
which al low for a very efficient implementat ion. Before the
paral le l execut ion star ts, two global registers (grLO and grHI)
are ini t ia l ized with the thread IDs of the first and last thread.
Then the spawn instruct ion signals the beginning of the par-
al le l execut ion, which wakes up the TCUs in a way that each
TCU knows immediately i ts thread ID (TID) and makes the
MTCU broadcast the paral le l code to the TCUs. When a TCU
completes a thread and is ready to execute another thread,
i t performs a ps operat ion on grLO with an increment of 1
to get i ts new thread ID (TID). The hardware combines these
ps operat ions and al l request ing TCUs receive their TIDs in
a few cycles. Then each TCU executes the powerful chkid
instruct ion that compares i ts TID against grHI: i f T ID ≤ grH I
the TID is val id and the TCU proceeds to execute the thread;
otherwise i t blocks unt i l e i ther the TID becomes val id (grHI got
incremented), or al l TCUs are blocked signal ing the end of the
paral le l execut ion. The end of the paral le l sect ion is marked
by the join instruct ion.
17
(a) (b)
Figure 2.3: XMT hardware. (a) Block diagram. (b) Memory
Hierarchy in paral le l mode. The lef t s ide of (b) shows the
est imated latency to each memory hierarchy level f rom the
processing core for a 1024 TCU configurat ion (64 clusters ×
16 TCUs). Some elements are omit ted for simpl ic i ty, such as
the Master TCU, which operates in ser ia l mode, the global
register file and the prefix-sum unit .
Figure 2.3b g ives an over v iew of the XMT memory hierar-
chy whi le opera t ing in para l le l mode. The XMT TCUs/c lus te rs
has no pr iva te caches, s ince sca lable cache coherence pro-
toco ls are ver y compl ica ted for hardware imp lementa t ion and
ine ffic ien t fo r cer ta in types of memor y access pat te r ns, typ i -
ca l l y fo r fine-gra ined para l le l i sm. For the fine-gra ined para l -
le l i sm, the cache coherent pr iva te cache is a lso not e ffic ien t
in te r ms of power, due to the la rge granu lar i t y o f the data
movement between caches, ex t ra cache coherence message











Figure 2.4: XMT Programming. (a) Array Compact ion example.
Array A’s non-zero elements are copied into B. The order is
not necessar i ly preserved. After execut ing ps(inc,base) , the
base var iable is increased by inc and the inc var iable gets the
or ig inal value of base , as an atomic operat ion. (b) The XMT
execut ion model: switching between ser ia l and paral le l modes.
approach is the re la t i ve long la tency in memor y accesses
tha t requ i re round t r ip to shared para l le l cache through an
in te rconnect ion network . Severa l techn iques have been de-
s igned to reduce th is la tency or over lap wi th computa t ion ,
most no tably pre fe tch ing cus tomized for XMT [10] .
A firs t commi tment to s i l i con of XMT is repor ted in [3 ,
11 ] : a 64-processor, 75MHz computer pro to type based on
FPGA techno logy was bu i l t a t the Univers i ty o f Mar y land
(UMD)1 . Th is mi les tone cont r ibu tes towards advanc ing the
percept ion of PRAM implementab i l i t y f rom imposs ible to ava i l -
able.
1 Fol lowing an in te r na t iona l naming contes t w i th c lose to 6000 submis-
s ions, the name Para leap was g iven to the pro to type.
19
2.5 The XMTC language
To deepen the unders tand ing of the XMT arch i tec tu re,
an examinat ion of how i t i s programmed is requ i red . Ar -
ch i tec tu res tha t suppor t para l le l execut ion has programming
f rameworks tha t encourage the programmer to express a l l
the para l le l i sm ava i lable in the app l ica t ions, and XMT is no
d i ffe ren t . In XMT, a schedu ler a l loca tes para l le l th reads to
the phys ica l th read cont ro l un i ts , and the execut ion cons is ts
o f a l te r na t ing sequences of ser ia l and para l le l code. To
a l low such parad igm, XMT uses a programming language,
XMTC, which was des igned to prov ide programmers wi th low
leve l opera t ions suppor ted by the hardware, as wel l as easy
mapp ing of the s t ruc tures of the PRAM algor i thms.
The XMTC high- level language i s a modest ex tens ion of
s tandard C language deta i led in [12 ] . F igure 2 .5 shows the
genera l syn tax o f the XMTC language, wh i le figure 2.4(a)
presents an example of XMTC code. The XMTC language ex-
tends the C programming language by add ing few keywords
to a l low programmers to wr i te para l le l code and access the
XMT processor ’s d i ffe ren t fea tures. The most no table are
20
( i ) The spawn ( l ow ,high ) s tatement . Accepts the number o f
th reads to crea te as i ts parameter, as a spawn crea tes
(high− l ow+ 1 ) th reads. S imi la r to the XMT spawn ins t ruc-
t ion , i t spec ifies a code reg ion tha t is to be executed
in para l le l by the crea ted threads, and a lso ser ves as
a d i rec t ive to XMT to ”spawn” the threads. The number
o f th reads created is independent f rom the number o f
TCUs ava i lable in an XMT processor, o f ten exceed ing i t
s ign ificant ly. The threads are usua l ly shor t and the ex-
ecut ion swi tches f requent ly be tween ser ia l and para l le l
modes, as p ic tu red in F igure 2.4(b) . The threads ter -
mina te a t an impl ic i t join at the c los ing bracket o f the
spawn block . A l l tasks must comple te before proceed ing
beyond the spawn block .
( i i ) Thread- id $ . i s a reser ved ident ifier ins ide the para l le l
reg ion , and eva lua tes to a thread ’s un ique ID, wh ich
a l lows the SPMD programming s ty le o f XMTC. Uses the
$ to des ignate thread ID, wh ich takes in teger va lues
wi th in the range l ow ≤ $ ≤ high
( i i i ) Prefix-sum statements ps/psm (base, increment) . Defines
an atomic prefix-sum opera t ion s imi la r in func t ion to the
21
atomic Fetch-and-Add of the NYU Ul t racomputer [13 ] .
I t g ives programmers access to XMT’s power fu l p refix
sum uni t . The way prefix sum opera tes is by tak ing
a base and an increment as parameters, and va lue of
increment i s a tomica l l y added to base , and the or ig ina l
va lue of base i s re tu r ned in var iable used for increment .
The increment has to be a thread pr iva te var iable tha t
is usua l ly a l loca ted in a TCU’s loca l reg is te r. There are
two vers ions of prefix sum ava i lable.
• ps (base, increment) which takes on ly g loba l XMT
reg is te rs as the base parameters, and the increment
var iable can on ly have a va lue of e i ther 1 or 0 .
Uses XMT’s prefix-sum hardware, wh ich combines
mul t ip le concur ren t requests and execute a l l fo them
in cons tan t t ime.
• psm (base, increment) I t takes memor y loca t ions as
the base parameter, and the increment va lue can be
any in teger va lue. More expens ive than the ps var i -
an t , s ince i t requ i res a round t r ip to memor y. Mul -
t ip le psm reques ts ar r i v ing s imu l taneous ly to cache
wi l l be queued.
22
   … 
s𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆 𝒄𝒄𝒄𝒄𝒄𝒄𝒆𝒆 
s𝒉𝒉𝒆𝒆𝒆𝒆𝒆𝒆𝒄𝒄 𝒗𝒗𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒗𝒗𝒆𝒆𝒆𝒆𝒗𝒗 𝒄𝒄𝒆𝒆𝒄𝒄𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒅𝒅𝒆𝒆𝒄𝒄𝒅𝒅 
… 
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 (𝒆𝒆𝒄𝒄𝒍𝒍;  𝒉𝒉𝒆𝒆𝒉𝒉𝒉𝒉) { 
 
… 
 𝒑𝒑𝒆𝒆𝒆𝒆𝒗𝒗𝒆𝒆𝒅𝒅𝒆𝒆 𝒗𝒗𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒗𝒗𝒆𝒆𝒆𝒆𝒗𝒗 𝒄𝒄𝒆𝒆𝒄𝒄𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒅𝒅𝒆𝒆𝒄𝒄𝒅𝒅 
 𝒅𝒅𝒉𝒉𝒆𝒆𝒆𝒆𝒆𝒆𝒄𝒄𝒆𝒆𝒄𝒄 𝒑𝒑𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆 𝒄𝒄𝒄𝒄𝒄𝒄𝒆𝒆 
 … 
} //implicit join 
 
Figure 2.5: XMTC language syntax. low and high represent the
IDs of the first and last threads. The join is impl ic i t in the
closing bracket of the spawn block. var iable declared outside
the spawn block are shared whi le var iables declared inside are
thread pr ivate.
The XMT language fo l lows a fork - jo in mode l fo r the cre-
a t ion and ter minat ion of th reads. I t a lso fo l lows a conven-
t ion where shared var iables are dec la red outs ide the spawn
block , wh i le th read pr iva te var iables are dec la red ins ide the
spawn block .
To summar ize how para l le l p rogramming in XMT/XMTC
works ; a para l le l reg ion is de l ineated by spawn and join
s ta tements, as shown figure 2.4(a) . Ever y th read execut ing
the para l le l code is ass igned a un ique thread ID, des ignated
$. The threads proceed wi th independent cont ro l , synchro-
n iz ing at the imp l ic i t join which ter minates the threads. The
spawn s ta tement takes as parameters the IDs of the firs t and
23
l as t th read to crea te. Synchron iza t ion is ach ieved through
the prefix-sum (ps)and join commands.
F igure 2.4(a) demonst ra tes XMTC’s power by showing how
i t can be used to ass ign a un ique index in ar ray B when
compact ing an ar ray A. The order is no t necessar i l y pre-
ser ved. the ps opera t ion is used to acqu i re the next ava i lable
loca t ion in ar ray B , where the non-zero e lements o f ar ray
A wi l l be s tored . Th is example exh ib i ts how each thread
progresses at i t s own pace due to XMT’s Independence of
Semant ics IOS proper ty wh ich is accompl ished by us ing the
ps opera t ion to obta in nex t ava i lable loca t ion in ar ray B .
S ince ps prov ides answers in cons tan t t ime, th reads can
execute a t the i r own pace wi thout hav ing to wa i t fo r one
another.
Nested para l le l ism in XMTC. The XMTC Language a l lows
programmers to nes t spawn reg ions to crea te nes ted para l le l
code. However, when wr i t ing nes ted para l le l code us ing
the XMTC language, a t ten t ion must be pa id to how thread
IDs are hand led in th is s i tua t ion . The XMTC language has
on ly one way to access a thread ’s ID ( i .e . , $ ) , and has no
var ia t ion of i t fo r s i tua t ions when nested is invo lved. In
those s i tua t ions the $ wi l l ac t as an ident ifier fo r the thread
24
IDs of the inner most spawn block . Th is can be hand led by
an XMTC programmer by manua l ly c rea t ing a loca l var iable
to s tore the thread ID of a spawn block before s tar t ing a
new spawn block nes ted ins ide of i t .
2.6 The Programmer Workflow for
Writ ing PRAM-Based Programs
The ICE language prov ides programmers wi th sav ings in
t rans la t ing the i r p rograms f rom a PRAM algor i thm, to an ef -
fic ien t XMT para l le l p rogram. The steps programmers shou ld
take to t rans la te an a lgor i thm in to a para l le l p rogram is
known as The Programmer Workflow . Th is methodo logy l inks
a PRAM algor i thm to the XMT pla t fo r m too lcha in ( i .e . , com-
p i le r + hardware imp lementa t ion) . A d iscuss ion of the effo r t
invo lved in th is t rans la t ion is c ruc ia l fo r unders tand ing an
impor tan t benefit o f ICE.
The XMT pla t fo r m prov ides programmers wi th a workflow
for der iv ing effic ien t programs f rom PRAM algor i thms. I t a lso
a l lows them to reason about these a lgor i thm’s cor rec tness
and execut ion t ime [14] . Th is programmer workflow prov ides
a ” rec ipe” to programmers in conver t ing PRAM algor i thms
to h igh-per fo r mance XMTC programs. I t a lso prov ides s teps
programmers can take to incrementa l l y op t im ize the pro-
25
Figure 2.6: The program flow for translat ing PRAM algor i thm
into XMT algor i thm.
grams’ per fo r mance wi thout hav ing to redes ign the or ig ina l
a lgor i thm, wh ich a l lows them to avo id many of the para l le l
p rogramming p i t fa l l s .
26
The steps invo lved in the programmer workflow for deve l -
op ing an XMT implementa t ion for a PRAM algor i thm are as
fo l lows.
• Star t ing f rom a spec ific problem, a des ign s tage for
an a lgor i thm wi l l p roduce a sequence of s teps each
has a set o f concur ren t opera t ion tha t shou ld execute
in para l le l , fo r ming a h igh leve l WD descr ip t ion of the
para l le l a lgor i thm of in te res t .
• This dra f t i s fu r ther refined and opt im ized for work and
depth to fo r m a sequence of rounds each cons is t ing
of concur ren t opera t ions. I t a lso spec ifies the s teps
requ i red to advance in a g iven s tep. These rounds for m
a for mal work-depth descr ip t ion of the a lgor i thm.
• The programmer wr i tes the i r p rogram by t rans la t ing the
WD descr ip t ion in to an SPMD program us ing the XMTC
programming language, and fine tunes the program for
bes t per fo r mance.
• The program is then compi led in to an XMT executable
b inar y opt im ized for the bes t per fo r mance
F igure 2.6 g ives an over v iew of the d iscussed s teps in -
vo lved in the programmer workflow. The figure s tar ts f rom
27
a PRAM algor i thm which programmers wi l l use to wr i te a
h igh- leve l work-depth descr ip t ion of the a lgor i thm. Then pro-
grammers t rans la te the work-depth mode l in to i ts equ iva len t
XMTC code, and then i t i s fu r ther fine tuned for bes t per fo r -
mance. Th is XMTC code is then compi led us ing the XMTC
compi le r and executed us ing the XMT hardware.
2.7 Advantages of the XMT Plat form
2.7 .1 The Performance of the XMT Plat form
ICE uses the XMT pla t fo r m due to XMT’s ab i l i t y a t ex-
p lo i t ing para l le l i sm in i r regu la r programs which is a resu l t o f
des ign ing XMT wi th PRAM algor i thms in mind. XMT reta ins
good per fo r mance for ser ia l and regu lar para l le l p rograms
as wel l . Be low is a l i s t o f exper iments tha t show XMT’s per -
fo r mance as compared to o ther commodi ty arch i tec tu res. A l l
XMT’s speedups l i s ted be low were ach ieved over the bes t
ser ia l imp lementa t ion on the s ta te -o f - the-ar t vendor ’s p la t -
fo r m; hence they represent rea l improvements in process ing
t ime.
28
• Graph Connect iv i ty 1024-core XMT processor ach ieves
a speedup of 99 .8X, wh i le the NVid ia GTX480 had a
speedup of 27 .1X for graph connect iv i t y [15 ] .
• Graph Biconnect iv i ty 1024-core XMT ach ieves speedups
up to 33X, wh i le GPU/CPU hybr id ach ieved on ly a 4X
speedup [15 ] .
• Graph Tr iconnect iv i ty 1024-core XMT got a speedup of
129X aga ins t ser ia l on a core i7 920 processor [16 ] .
• Finding maximum flow The best speed up for th is a lgo-
r i thm on a hybr id NVed ia Fer mi GPU/CPU was 2.5X [17] .
In cont ras t , a speedup of 108X was at ta ined on a
1000-core XMT that uses the same s i l i con area as the
GPU [18] .
• Burrows-Wheeler t ransform - BZIP2 XMT reaches up
to 13X/25X Speedup for de /compress ion [19 ] . In com-
par ison , there was a s lowdown of 2 .8 fo r compress ion
and a speedup of 1 .1 fo r decompress ion on GPU.
• 2-D FFT XMT reached 20.4X speed up, whereas a 16-
core AMD opteron got less than 4X [20] .
29
• Gate- level Simulat ion Benchmark Sui te XMT obta ined
100X speedups versus ser ia l fo r [21 ] .
2.7 .2 The XMT Teachabi l i ty and Ease of Use
Programmer ’s produc t iv i t y and ease of programming are
cent ra l foca l po in ts fo r the XMT pla t fo r m. A Pla t fo r m that is
easy to lear n is a necessar y cond i t ion for i t to be an easy
to program pla t fo r m, thus, demonst ra t ing how XMT is easy
to teach and lear n has been one of the cent ra l ob jec t ives
of the pro jec t .
• Since 2007, more than 300 h igh-schoo l s tudents have
been taught to program the XMT pla t fo r m, inc lud ing
two exemplar y cases : Montgomer y B la i r H igh Schoo l ,
S i l ver Spr ing MD, and Thomas Jeffe rson High Schoo l
fo r Sc ience and Techno logy, A lexandr ia VA. At Thomas
Jeffe rson, Torber t [22 ] has incor pora ted XMT in to the i r
c lasses, and advocates us ing i t in the educat ion of
Computer Sc ience. To lber t repor ts tha t , when compared
to MPI , The XMT pla t fo r m spur red s tudents crea t iv i t y to
invent the i r own persona l programs to so lve a var ie ty o f
30
problem ins tead of chas ing the same canon ica l so lu t ion
as was the case wi th MPI .
• In a s tudy suppor ted by DARPA HPCS program, i t was
shown that the deve lopment t ime of XMTC is about ha l f
tha t o f MPI , under c i rcumstances favor ing MPI [23 ] .
• A jo in t exper iment between the Univers i ty o f I l l i no is
and the Univers i ty o f Mar y land compared programming
in both OpenMP and XMTC [24] . Th is s tudy inc luded 42
s tudents who were asked to ach ieve speedups for BFS
us ing OpenMP over an 8-core SMP, and us ing the XMT
pla t fo r m. The students cou ld not ach ieve a speed up
h igher than 1 when they used OpenMP. However, they
were able to obta in 7x to 25x speedups on the 64-TCU
XMT FPGA. In add i t ion , the PRAM/XMT par t o f the jo in t
course were able to wr i te more advanced a lgor i thms as
compared to the OpenMP par t .
31
32
Chapter 3 : The ICE Programming Language
3.1 Overv iew
This thes is presents methods to wr i te synchronous par -
a l le l code based on PRAM algor i thms ’as- is ’ and obta in
comparable per fo r mance when executed . Th is chapter w i l l
go over the deta i l s o f the ICE programming language and
i ts d i ffe ren t fea tures. The chapter w i l l d iscuss the ICE syn-
tax and the lock-s tep mode l tha t ICE fo l lows, as wel l as
the advantages tha t ICE ga ins by fo l low ing tha t mode l .
3.2 The ICE language
The Immedia te Concur ren t Execut ion ( ICE) language, a
modest ex tens ion of the C programming language, is a par -
a l le l p rogramming language tha t is in tended for ease of
programming and deve lopment o f h igh-per fo r mance para l -
le l p rograms based on the PRAM algor i thmic mode l . ICE
33
i s based on the WD-mode l o f PRAM that descr ibes a se-
quence of t ime steps, each conta in ing mul t ip le concur ren t
opera t ions. ICE enables tha t th rough fo l low ing the lock-s tep
execut ion mode l . ICE requ i res the programmer to s imp ly
express a l l ava i lable para l le l i sm and noth ing e lse. ICE is
un ique in be ing the on ly language tha t can take the PRAM
lock-s tep and t rans la tes i t to a threaded program.
A lock-s tep para l le l p rogramming mode l is one where
each s ta tement in a para l le l loop has a l l i t s i te ra t ions
appear to execute exac t ly in the same cyc le to the pro-
grammer. Th is appearance is enforced by the ICE compi le r
(d iscussed in 4) , usua l ly w i thout en forc ing same-cyc le exe-
cu t ion in hardware. F igure 2.2 shows the lock-s tep nature o f
ICE. Lock-s tep execut ion cont ras ts wi th the threaded execu-
t ion fo l lowed by v i r tua l l y a l l para l le l p rogramming languages,
where a l l the func t iona l un i ts execute independent th reads
of cont ro l wh ich proceed at the i r own unpred ic table pace,
and where synchron iza t ion wi th o ther th reads on ly happens
i f the program exp l ic i t l y reques ts i t . Lock-s tep execut ion
is common in hardware – for example in VLIWs or GPUs.
However we are not aware of a h igher leve l p rogramming
34
l anguage meant fo r genera l pur pose, i r regu la r programming
tha t uses lock-s tep execut ion .
The advantages of lock-s tep programming are tha t many
exp l ic i t synchron iza t ions become unnecessar y and in-p lace
concur ren t updates to aggregate data s t ruc tures becomes
poss ible wi thout in t roduc ing non-deter min ism ( these wi l l be
exp la ined in sec t ion 3.5 )
ICE prov ides a shor tcu t to programmers fo l low ing the pro-
grammer workflow discussed in 2 .6 , wh ich s tar ted f rom an
ICE abst rac t ion of an a lgor i thm and advanced to threaded
implementa t ion . ICE a l lows programmers immedia te ly wr i te
programs us ing the ICE abst rac t ion of an a lgor i thm ’as- is ’ .
The in ten t ion beh ind des ign ing ICE is not advocat ing a
lock-s tep mode l o f hardware execut ion . Indeed lock-s tep
para l le l hardware has most ly been exp lored in the past in
the contex t o f S IMD mach ines, wh ich have not met wide
success. SIMD mach ines can on ly exp lo i t vec tor and dense-
mat r ix para l le l codes wel l , bu t so can MIMD mach ines. Th is
work is pr imar i l y mot iva ted by para l le l i sm in i r regu la r pro-
grams in graph- t raversa l and d iv ide-and-conquer a lgor i thms,
wh ich do not para l le l i ze we l l on any ex is t ing para l le l ma-
ch ine (e i ther t rad i t iona l MIMD mul t i -cores or SIMD) . Hence
35
I t does not fu r ther cons ider t rans la t ion of ICE to e i ther o f
those mach ine types ; ins tead we focus on t rans la t ing ICE
to XMT code.
ICE is not meant to rep lace any of the cur ren t program-
ming mode ls e i ther, i t i s meant to work a long s ide them
ins tead. ICE is genera l l y be t te r su i ted for app l ica t ions based
on PRAM algor i thms when compared to threaded mode l . In
cont ras t , the threaded mode l is l i ke ly to be bet te r su i ted to
task para l le l app l i ca t ions. ICE is or thogona l to the threaded
mode l , and can be inc luded a long s ide a language l ike
XMTC, thus enabl ing programmers to choose and mix e i ther
lock-s tep ( ICE) or th readed (XMTC) mode ls, wh ichever is
more natura l fo r each para l le l sec t ion . Hence the same pro-
gram can have some para l le l loops imp lemented in XMTC,
and others in ICE.
3.3 The Syntax of the ICE Language
The ICE language enable programmers to wr i te para l le l
p rograms us ing lock-s tep execut ion mode l . The ICE language
extends the C programming language by in t roduc ing new
keyword , pardo , to a l low programmers to spec i fy where
para l le l i sm l ies in the a lgor i thm they in tend to imp lement .
36
pardo i s insp i red by the const ruc t pardo , shor t fo r PARal le l
DO, used in many PRAM tex ts [25 , 26 , 27 ] . pardo crea tes
a number o f concur ren t v i r tua l lock-s tep para l le l con tex ts.
F igure 3 .1 prov ides the ICE syntax , and the gener ic
s t ruc tu re o f the new pardo keyword . The pardo keyword
requ i res programmers to spec i fy four parameters ; a para l le l
con tex t cont ro l var iable used to re fer to a para l le l con tex t
ID, the ID of the firs t para l le l con tex t low , the ID of the
las t para l le l con tex thigh , and the s t r ide step . A pardo cre-
a tes (high − l ow )/step + 1 para l le l con tex ts tha t execute the
ins t ruc t ions spec ified ins ide the pardo reg ion based on the
lock-s tep mode l . Concur ren t wr i tes per fo r med by mul t ip le
para l le l con tex ts to the same memor y loca t ion are hand led
us ing arb i t ra r y CRCW. ICE fo l low the convent ion of hav ing
para l le l con tex t loca l var iables dec la red ins ide the para l le l
reg ion , wh i le shared var iables are dec la red in ser ia l reg ions.
The ICE language a l lows programmers to spec i fy nes ted
para l le l i sm by us ing the pardo keyword f rom wi th in a pardo
reg ion . Each para l le l con tex t c rea ted by the outer pardo
crea tes mul t ip le para l le l con tex ts as spec ified by the inner
pardo . Locks tep execut ion ex tends to the newly crea ted
para l le l con tex ts. As such, they advance synchronous ly wi th
37
   … 
s𝒆𝒓𝒊𝒂𝒍 𝒄𝒐𝒅𝒆 
s𝒉𝒂𝒓𝒆𝒅 𝒗𝒂𝒓𝒊𝒂𝒃𝒍𝒆𝒔 𝒅𝒆𝒄𝒍𝒂𝒓𝒂𝒕𝒊𝒐𝒏 
… 
𝑝𝑎𝑟𝑑𝑜 (𝒑𝒊𝒅 =  𝒍𝒐𝒘;  𝒉𝒊𝒈𝒉;  𝒔𝒕𝒆𝒑) { 
 
… 
 𝒑𝒓𝒊𝒗𝒂𝒕𝒆 𝒗𝒂𝒓𝒊𝒂𝒃𝒍𝒆𝒔 𝒅𝒆𝒄𝒍𝒂𝒓𝒂𝒕𝒊𝒐𝒏 




Figure 3.1: ICE language Syntax.
a l l para l le l con tex ts o f the same leve l c rea ted by any ’par -
en t ’ para l le l con tex ts.
Var iable loca l i t y fo r nes ted ICE fo l lows the same pr inc i -
p le tha t was d iscussed ear l ie r, namely ; var iables dec la red
ins ide a pardo reg ion are pr iva te to each ind iv idua l contex t ,
and var iables dec la red outs ide a pardo reg ion are shared
between a l l the para l le l con tex ts crea ted by tha t pardo .
The impl ica t ion of th is is tha t var iables tha t are pr iva te to
a contex t , a re shared between a l l para l le l con tex ts crea ted
by tha t contex t , no t across a l l con tex ts o f the same leve l
o f nes t ing .
Table 3 .1 prov ides a compar ison between the syntax of
the lock-s tepped pardo and the threaded spawn . ICE and
38









Num. N (UB − LB )/ST + 1 UB − LB + 1
First—last IDs LB — LB +ST ×N LB — UB
Str ide ST
1




al l paral le l
contexts before






Synchronizat ion After everyInstruct ion join or (ps )
XMTC fo l low the same convent ion of how loca l and shared
var iables are dec la red .
3.4 Example Showcasing the ICE
Language
To see the fea tures and advantages of the ICE pro-
gramming language, cons ider the example in figure 3.2 .
F ig 3 .2(a) shows the problem spec ifica t ion for po in te r jump-
39
i ng , a we l l -known and usefu l task in t ree and graph a lgo-
r i thms. The input shown cons is ts o f ar ray S conta in ing the
spec ifica t ion of a fores t o f t rees, and ar ray W conta in ing an
in i t ia l we igh t a t each node. The outpu t “fla t tens” the t ree
by d i rec t ly po in t ing each node to the root o f i t s t ree, and
computes the sum of we igh ts (or d is tance) f rom the node to
the parent in the input t ree. The example shows a spec ific
ass ignment o f we igh ts wh ich wi l l compute the d is tance to
the root in the outpu t ; however, any input ass ignment o f
we igh ts can be chosen. The po in te r jumping a lgor i thm is
wide ly used, fo r example in the d is jo in t -se t (un ion-find) data
s t ruc ture fo r e ffic ien t ly main ta in ing sets and suppor t ing set
un ion and find opera t ions. In the case tha t the input t rees
degenera te to l inked l i s ts , comput ing the outpu t becomes
the wel l -known l i s t rank ing problem, a key component o f
many a lgor i thms, so ca l led because i t ca lcu la tes the pos i -
t ion (or rank) o f ever y e lement in a l inked l i s t . L is t rank ing
a lso computes a prefix sum opera t ion for any input we igh ts
in W, and stores the resu l t in W. L is t rank ing is a lso used
to so lve many problems on t rees and graphs v ia an Eu ler
tour techn ique (Tar jan and Vishk in [28 ] ) .
40
Problem  
Given a linked list with n elements, find for every 
elements its distance from the last element. 
Input 
• Array S(1...n): S(i) contains the index of the 
successor of element i. The successor of the last 
element is the element itself. 
• W(1…n): W(i) contains the weight of element i. 
Initially W(i)=0 for the last element in the list and 
W(i)=1 for all other elements. 
Output 
• S(i) is the index of the last element of the list. 
W(i) is the distance of element i from this last element. 
     
 
 
void pointer_jump (int S[n], int W[n], int n) { 
   pardo (unsigned i = 0; n-1;1) { 
        while (S[i] != S[S[i]]) { 
 W[i] = W[i] + W[S[i]]; 
 S[i] = S[S[i]]; 
        } 
    }       
} 
(a) Problem specification (b) ICE program 
psBaseReg flag;  // number of threads that require 
                              another loop iteration 
void pointer_jump (int S[n], int W[n], int n) { 
    int W_tmp[n]; 
    int S_tmp[n]; 
    do { 
        spawn (0, n-1) { 
 if (S[$] != S[S[$]]) { 
     W_tmp[$] = W[$] + W[S[$]]; 
     S_tmp[$] = S[S[$]]; 
 } else { 
     W_tmp[$] = W[$]; 
     S_tmp[$] = S[$]; 
 } 
        } 
        flag = 0; 
        spawn (0, n-1) { 
 if (S_tmp[$] != S_tmp[S_tmp[$]]) { 
     int i = 1; 
     ps(i, flag); 
     W[$] = W_tmp[$] + W_tmp[S_tmp[$]]; 
     S[$] = S_tmp[S_tmp[$]]; 
 } else { 
     W[$] = W_tmp[$]; 
     S[$] = S_tmp[$]; 
 } 
        } 
    } while (flag != 0); 
}  
void pointer_jump (int S[n], int W[n], int n) { 
    int W_tmp[n]; 
    int S_tmp[n]; 
    int *W_rd = W, *W_wt = W_tmp; 
    int *S_rd = S, *S_wt = S_tmp; 
    int *tmp_ptr; 
    int crs_size = n/P + ((n%P) > 0); 
    int flag = 1; 
    while (flag != 0) { 
        flag = 0; 
        #pragma omp parallel num_threads(P) { 
            #pragma omp parallel for reduction(+,flag) schedule(static, crs_size)  
     for (int i = 0; i < n; i++) { 
         if (S[i] != S[S[i]]) { 
  int x = 1; 
  flag += x; 
  W_wt [i] = W_rd[i] + W_rd[S_rd[i]]; 
  S_wt [i] = S_rd[S_rd[i]]; 
         } else { 
  W_wt[i] = W_rd[i]; 
  S_wt[i] = S_rd[i]; 
         } 
     } 
         } 
         tmp_ptr = W_rd;      W_rd = W_wt;      W_wt = tmp_ptr; 
         tmp_ptr = S_rd;      S_rd = S_wt;         S_wt = tmp_ptr; 
 } 
 
(c) XMTC program (d) OpenMP Program 
 
Figure 3.2: Pointer jumping example showing simpl ic i ty of ICE
code. (a) provides a descr ipt ion of the pointer jumping problem.
This problem is then solved using ICE programming language
(b), XMTC programming language (c), and OpenMP (d)
41
Figure 3.2(b) shows the ICE code to solve the pointer jump-
ing problem defined in figure 3.2(a). From this figure i t can be
seen how the ICE programming language is loosely based on
the PRAM algor i thmic model. I t has ser ia l regions, and paral le l
regions inside pardo constructs. As can be seen in the figure,
the ICE code for pointer jumping is indeed very simple. The
in-place updates of S and W are possible because of the lock-
step nature of execut ion, where, for example, the r ight-hand
side (RHS) of the first statement in the loop (W (i) + W (S (i)))
is read and computed on al l the paral le l uni ts before the LHS
wr i tes the new value of W(i) .
Al though the code in figure 3.2 uses arrays to implement
trees, pointer jumping can be implemented in ICE with struc-
tures and pointers just as easi ly. The code wi l l be conceptual ly
simi lar.
Figure 3.2(c) shows the XMTC code to solve the same
pointer jumping problem. XMTC, as discussed in 2.5, has
paral le l constructs such as spawn (x ,y ) which star ts y − x + 1
paral le l threads numbered x to y. Since the paral le l threads
are independent, they proceed at their own pace, synchroniz ing
only at the impl ic i t jo in at the end of the spawn block, or at
the prefix sum (ps ) operat ion shown.
42
As can be seen in the figure, the XMTC code is longer
and much more compl icated than the ICE code. The under ly ing
cause is the unpredictable pace of paral le l threads, which
prohibi ts in-place updates , such as those of arrays S and
W . Instead the threaded code must use temporar ies S temp
and W temp . In the first par t of the code, the program reads
from the or ig inal arrays and wr i tes to the temporary arrays;
roles are reversed in the second par t . The first and second
par t al ternate unt i l the computat ion is completed. Addi t ional
synchronizat ion is needed to count the number of incomplete
threads remaining in the f lag var iable; count ing is done using
the prefix sum (ps ) construct descr ibed in sect ion 2.51 .
F igure 3.2(d) shows an OpenMP code to solve the same
pointer jumping problem. The OpenMP code in figure 3.2(d)
essent ia l ly executes simi lar ly to the XMTC version. However,
there are two main di fferences. 1. The ps operat ion in XMTC
version is replaced by a reduct ion operat ion in OpenMP. 2.
Unl ike the XMTC version, the loop was not unrol led in the
OpenMP version. Instead, two sets of pointers were used to
al ternate the source and dest inat ion of copying between the
1 The ps operat ion could have been avoided by mult ip le wr i tes of true
to a boolean var iable cal led threads-remaining in the loop, but that would
create a hot-spot in memory. The XMT ps operat ion uses registers, avoiding
the hot spot.
43
or ig inal and temporary S and W arrays. I t is imporant to un-
derstand that implementat ions in figures 3.2(c) and (d) are ful ly
interchangeable between XMTC and OpenMP. Namely, the im-
plementat ions wi l l work very simi lar ly regardless of the plat form
used. However, when implemented on a simi lar plat form, the
implementat ion in figure 3.2(c) wi l l have a sl ight performance
advantage over the implementat ion in figure 3.2(d), whi le the
later is sl ight ly shor ter and easier to wr i te.
The ICE compi ler t ranslates lock-step ICE programs into
mult i - threaded paral le l sof tware. Methods for this translat ion
wi l l be discussed in sect ion 4.3. This compi ler t ranslates ICE
programs to XMTC; thus leveraging XMT’s abi l i ty to execute
irregular programs efficient ly.
The code in figure 3.2(c) gives good speedup on XMT de-
spi te the code being very fine grained ( i .e. , having shor t par-
al le l sect ions) . Pointer jumping on XMT gives a speedup of
50X over the best ser ia l a lgor i thm on an Intel Core i7-920.
Despi te the many more cores of XMT, these are comparable
in area since XMT cores are extremely l ightweight. This is
despi te the paral le l version performing O(n log n) work whi le
the ser ia l a lgor i thm is O(n) , so the paral le l code is not work
efficient compared to the ser ia l a lgor i thm. Paral le l code on
44
t radi t ional mult i -cores is unl ikely to get any speedup at al l be-
cause of the high cost barr iers, poor load balancing in coars-
ened versions (where mult ip le paral le l sect ions are combined
into longer-running paral le l sect ions to reduce barr ier costs),
and the work- inefficiency of the paral le l a lgor i thm.
The example in fig 3.2 shows many of the strengths of the
ICE programming model, which wi l l be discussed in sect ion 3.5.
3.5 The ICE language Advantages
The ICE language was designed to be easy to use, and
to leverage the scient ific wealth in PRAM algor i thms. The
ICE language fol lows the lock-step model which al lows i t to
have many advantages when compared to the threaded model
fol lowed by many other paral le l programming languages. These
advantages are:
• Easier translation from PRAM algorithms. The ICE ab-
stract ion has been used as the first stage in the design
and analysis of PRAM algor i thms. PRAM algor i thms readi ly
fit into the ICE programming model whereas extra effor t is
needed to fit them into a threaded model. This is i l lus-
trated by the great di fference between figures 3.2(b) and
45
(c) - manual ly translat ing the first to the second can be
a significant effor t . Thus ICE makes paral le l programming
easier, fu lfi l l ing one of our pr imary goals.
• No need to think about synchronization. Thinking about
synchronizat ion is a major contr ibutor to making paral le l
programming di fficul t . ICE great ly reduces this problem
by assuming a maximum degree of synchronizat ion: there
is an impl ied barr ier between every statement in a paral-
le l region. Thus in ICE, synchronizat ion comes “for free”
in terms of programmer effor t . This is unl ike threaded
languages, where the programmer must decide when syn-
chronizat ion is needed and when it is not, and expl ic i t ly
request i t when needed. This is i l lustrated by figure 3.2(c),
where the programmer has to decide the locat ion of syn-
chronizat ion at the end of spawn blocks, and the locat ions
of any needed ps operat ions. In ICE the compi ler would
decide when synchronizat ion is needed when translat ing to
a threaded model, rel ieving the programmer of that burden.
• No need to introduce intermediate variables. The lack
of assumed synchronizat ion in threaded programming mod-
els of ten resul ts in the need for intermediate var iables to
46
avoid race condi t ions. For example, the in-place paral le l
update of a data structure often requires copying to an
intermediate data structure to avoid race condi t ions. The
code in figure 3.2(c) has S temp, W temp , and flag as in-
termediate var iables. This dupl icat ing and copying of data
structures must be managed by the programmer, increasing
his or her burden. In contrast the ICE programming model
makes intermediate var iables unnecessary in most cases
because of i ts assumed lock-step synchronizat ion, mak-
ing paral le l programming easier (wi th a convent ion such
as: ’a l l reads complete before wr i tes’) . Of course, such
intermediate var iables may be re- introduced by the com-
pi ler when it t ranslates ICE to a threaded language. In
effect , the ICE compi ler takes over the management of
intermediate var iables instead of the programmer.
• Avoids unintended race condit ions. Threaded program-
ming models al low the programmer to express unintended
race condi t ions. To el iminate them, the programmer must
be proact ive. In par t icular, the programmer must know
about and rely on consistency models.
The threaded code in figure 3.3 i l lus t ra tes race cond i -
t ions and cons is tency mode ls. I t shows a c lass ic exam-
47
            
P1:   A = 0; 
         … 
         A = 1; 
L1:    if (B == 0) …  
P2:   B = 0; 
         … 
         B = 1; 
L1:    if (A == 0) … 
 
Figure 3.3: Threaded code with race condi t ion
p le f rom Hennessy and Pat te rson tex tbook (HP:AQA,4 th
ed,pp 243) , where two para l le l p rocesses P1 and P2
runn ing on d i ffe ren t cores are shown s ide-by-s ide. As-
sume both cores have loca t ions A and B cached wi th
in i t ia l va lue 0. The quest ion here is : i s i t poss ible fo r
the if s ta tements in both threads to eva lua te to t rue?
At firs t g lance i t seems imposs ible. Hypothe t ica l l y, i f
wr i tes are seen on remote processors immedia te ly, then
i t i s no t poss ible fo r bo th to eva lua te to t rue. How-
ever wr i tes are of ten de layed on rea l mach ines ; hence
rea l mach ines use cons is tency mode ls to define a l lowed
behav io rs. For example on mach ines wi th sequent ia l
cons is tency both if s ta tements cannot eva lua te to t rue.
However some computers imp lement weaker cons is tency
mode ls fo r e ffic iency, where the anomalous behav io r is
a l lowed. Unfor tunate ly th readed programming mode ls
usua l ly expose the cons is tency mode l to the program-
48
mer, who must unders tand i t we l l to avo id in t roduc ing
un in tended race cond i t ions.
In cont ras t the lock-s tep para l le l behav io r o f the ICE
programming mode l makes i t imposs ible fo r the pro-
grammer to express such un in tended race cond i t ions.
Wi th ICE, a programmer never needs to cons ider race
cond i t ions or cons is tency mode ls. Ins tead the compi le r
manages race cond i t ions when t rans la t ing ICE code to
threaded code in a p la t fo r m-spec ific way; thus re l iev ing
the programmer of th is burden.
• No need to think about schedul ing or coarsening.
Al though dec la ra t i ve (pre-schedu l ing) th readed mode ls
such as XMTC have been proposed, severa l th readed
mode ls in common use such as MPI and pthreads are
post -schedu l ing , thus requ i r ing the programmer to manu-
a l l y schedu le ava i lable para l le l i sm in to N threads, where
N is the number o f hardware contex ts ava i lable on the
targe t hardware2 The programmer is a lso respons ible fo r
coarsen ing i f the ava i lable para l le l i sm exceeds N. In
cont ras t ICE is a dec la ra t i ve programming mode l where
2 The number o f hardware contex ts is the number o f th reads tha t the
hardware can actua l l y run at any one ins tan t . Th is equa ls the number
o f cores × the hyper - th read ing fac tor fo r mul t i -cores, and equa ls the
number o f TCUs on XMT.
49
the programmer s imp ly expresses a l l ava i lable para l -
le l i sm wi thout regard to the number o f hardware con-
tex ts, o r the schedu l ing of the code to those contex ts.
Schedu l ing and coarsen ing is per fo r med automat ica l l y
by the compi le r and/or run- t ime sys tem. Th is s ign ifi -
cant ly reduces the burden on the programmer s ince he
or she no longer needs to do schedu l ing /coarsen ing ,
and i t a lso makes the code more por table across XMT
computers wi th d i ffe ren t numbers o f hardware contex ts.
Th is fea ture is a l ready ava i lable in the XMTC compi le r,
and the ICE compi le r takes advantage of tha t .
• No need to think about data decomposi t ion or local i ty .
The un i fo r m-memor y access (UMA) des ign of XMT re-
l ies on a h igh bandwid th low la tency in te rconnect ion
network between TCUs and shared memor y. Thus a l l
func t iona l un i ts see the same la tency to a l l reg ions of
memor y, except fo r reg is te rs loca l to the i r TCUs, and
pre fe tch buffe rs. Th is led to a s i tua t ion (suppor ted by
our exper imenta l resu l ts ) where the XMTC programmer
does not need to cons ider data decompos i t ion or loca l i t y
as a firs t -o rder cons idera t ion . The XMTC programmer ’s
workflow [2 , 29 ] ins t ruc ts to firs t p roduce a handwr i t -
50
ten ICE- l i ke a lgor i thm wi th no data decompos i t ion or
loca l i t y ; the programmer is then expected to deve lop
XMTC code in wh ich ever y spawn command compr ises
i ts own scop ing (c f. F igure 2 .4(a) ) ; th is scop ing a l -
lows des ignat ion of loca l var iables tha t , in tu r n , the
XMTC compi le r t rans la tes in to loca l reg is te rs. For ICE,
even th is des ignat ion of loca l var iables wi l l be le f t to
the ( ICE) compi le r. S ince not need ing decompos i t ion
is inher i ted f rom XMT, both ICE and XMTC have th is
advantage, bu t no t th readed programming mode ls fo r
NUMA mach ines.
Given the advantages above, ICE represents a significant
leap in the ease of programming compared to threaded pro-
gramming models. In addi t ion, execut ion on XMT wil l del iver
unmatched speedups for i r regular programs wr i t ten in ICE.
51
52
Chapter 4 : The ICE Language Compi le r
4.1 Overv iew
In this thesis, an ICE compi ler was bui l t to translate ICE pro-
grams to threaded XMTC programs. The output XMTC code is
compi led, using the exist ing relat ively mature and wel l -studied
XMTC compi ler, into an executable XMT binary. This chapter
wi l l go over the chal lenges in producing correct t ranslat ion,
di fficul t ies in opt imizing the translat ion, and the complete struc-
ture of the ICE compi ler.
4.2 Translat ing ICE to XMTC
This sect ion wi l l focus on the main chal lenges and the
problems that may ar ise whi le bui ld ing a new compi ler that
translates from ICE, a language fol lowing the lock-step model
into XMTC, a threaded language.
53
4.2 .1 Spl i t t ing a pardo Region into Mul t ip le spawn
blocks
To translate ICE programs to XMTC programs, the pardo
region is spl i t into mult ip le spawn regions. Replacing every
pardo with spawn wil l not work since the former requires lock-
step execut ion, but the lat ter (regular mult i - threading) does
not ensure i t . We saw this in figure 3.2. Spl i t t ing occurs at
points where a barr ier is required. In XMT there is no way to
implement barr iers except through using the join instruct ion. A
join is introduced by terminat ing a spawn region and star t ing a
new one, effect ively spl i t t ing the pardo . This solut ion ensures
that there wi l l be no violat ion of the data dependencies (true
or ant i -dependence) between the memory accesses within the
pardo region. This method’s downside is that the paral le l ism
granular i ty is reduced, but i ts degree is maintained.
To ensure correctness, the order of reads and wr i tes must
be maintained. Thus when translat ing ICE to XMTC, i t is re-
quired that a pardo be spl i t into mult ip le spawn blocks wherever
the pardo contains both a read and a wr i te to a data object ac-
cessed by at least two di fferent paral le l contexts. This ensures
that a memory access is completed by al l paral le l contexts,
54
before any context star ts with the next memory access. This
spl i t t ing is performed by introducing a barr ier between the read
and the wr i te. Two cases are possible: ant i -dependence where
a wr i te to a data object are done after a read (e.g. W and
S in figure 3.2(b)) , and true dependence where a read is per-
formed after a wr i te. Both cases require spl i t t ing the pardo
region into two successive spawn regions. However, in the
ant i -dependence case, we also need to introduce a (compi ler-
inser ted) temporary, to which we perform the wr i tes instead in
the first spawn region, and copy them back in the second.
4.2 .2 Communicat ion of Intermediate Informat ion
Among spawn Blocks
Spl i t t ing pardo regions is l ikely to introduce many chal langes
for maintaining correct operat ion of the translated ICE code.
Since di fferent spawn blocks, prevent ing correct progression of
the data and control flow of the program
4.2 .2 .1 Handl ing Data Flow Across pardo Region Spl i ts
Spl i t t ing pardo regions is l ikely to be a problem to the data
flow within a pardo block. Processors perform computat ions by
reading source data from memory which is then processed to
55
produce the final resul ts that are stored into memory again.
Dur ing computat ions, the intermediate data is performed over,
and kept in a processor ’s registers. Adding barr iers between
sources’ loads and a resul ts ’ stores prevent correct data flow,
due to the inabi l i ty to communicate a paral le l context ’s inter-
mediate data between one spawn block and the next.
This is resolved by ’demoting’ the registers and recording
their contents onto memory locat ions. Hence, for each interme-
diate value that is st i l l being used in later spawn blocks, The
ICE compi ler creates an array of n elements, where n is the
number of paral le l contexts. Each of these element has the
same data type, size, and holds the same value as that of the
intermediate value being recorded. Then, each intermediate
value is saved onto memory before the spl i t , and retr ieved
after the spl i t to be used in subsequence spawn blocks.
4.2 .2 .2 Handl ing Contro l F low Across pardo Region Spl i ts
Spl i t t ing pardo reg ions may cause compl ica t ions for the
program’s cont ro l flow. There are two cases when th is can
happen:
(1 ) When a pardo reg ion conta ins a cond i t iona l b ranch where
one of i t s d i rec t ions requ i res a bar r ie r as in figure 4.1 .
56
pardo (i = 0; n; 1) { 
   if (i < 50) { 
      A[i+1] = c[i]; 
      c[i] = A[i] + 1; 




   unsigned i = $; 
   cond[i] = i< 50; 
   if (i < 50) 




unsigned i = $; 
   if (cond[i]) 
      c[i] = A[i] + 1; 
} 
(a) Ice code (b) XMTC translation 
 
Figure 4.1: (a) A pardo with a condi t ional branch. (b) I ts XMTC
translat ion.
(2 ) When a pardo reg ion conta ins a ser ia l loop wi th in wh ich
a bar r ie r is needed. Th is causes a problem when ex-
press ing the continue and break s ta tements, and the
ser ia l loop ’s back edge as in figure 3.2(b) .
To main ta in cor rec tness, a para l le l con tex t must preser ve
i ts in tended cont ro l flow, wh ich is not eas i l y poss ible in
these cases s ince XMT disa l lows branch ing between spawn
blocks. To tha t end, branch dec is ions are communica ted
across sp l i t s by record ing the branch s ta te fo r each contex t
in to memor y, and re t r ieve i t when needed. Hence, fo r the
firs t case when a branch cond i t ion is eva lua ted as in fig-
57
ure 4.1(b) , we record the resu l t to memor y ( temporar y ar ray
cond ) and re t r ieve i t in any la te r spawn tha t is on e i ther
d i rec t ion of the cond i t iona l b ranch. A s imi la r so lu t ion is
used for the second case where the ser ia l loop is taken
outs ide the para l le l reg ion and is executed by the MTCU,
the loop cond i t ion becomes a flag ind ica t ive o f the ex is -
tence of th reads tha t have not comple ted execut ion yet , and
the or ig ina l loop ter minat ion cond i t ion becomes a nor mal
branch and is t rea ted as in the branch case. An example
of th is is the do-while l oop in figure 3.2(b) (c ) where the
ser ia l loop is taken outs ide the spawn block , the ter minat -
ing cond i t ion now is (f lag ! = 0) i ns tead of (S ( i) == S (S ( i) ) ) .
flag i s inc remented by threads which s t i l l have work to do,
us ing the ps opera t ion . Fur ther more, temporar y ar rays are
used to record when a contex t executes a continue or break .
Resu l tan t spawn blocks f rom th is loop sp l i t w i l l check i f the
contex t have executed e i ther, and wi l l ac t accord ing ly.
4.3 Opt imiz ing The Translated ICE Code
Spl i t t ing a pardo reg ion in to mul t ip le spawn blocks can
degrade per fo r mance due to the overhead of c rea t ing and
manag ing more threads, and due to us ing memor y to commu-
58
pardo (int i = 0; n; 1) { 
   A[i+1] = c[i]; \\A1 
   c[i] = A[i] + 1; \\A2 
   B[i-1] = d[i]; \\B1 
   d[i] = B[i] + i; \\B2 
} 
spawn(0,n) { 
      unsigned i = $; 




      unsigned i = $; 
      c[i] = A[i] + 1; \\A2 




      unsigned i = $; 




      unsigned i = $; 
      A[i+1] = c[i]; \\A1 




      unsigned i = $; 
      c[i] = A[i] + 1; \\A2 
      d[i] = B[i] + i; \\B2 
} 
 
(a) Code in ICE (b) Equivalent code 
in XMTC 
(c) Optimized XMTC 
 
Figure 4.2: Reschedul ing memory accesses. Statement A2 is
dependent on statement A1, and statement B2 is dependent on
statement B1. statements A are independent from statements
B
n ica te in fo r mat ion between the d i ffe ren t spawn blocks which
increases the degradat ion even fur ther. Th is is exacerbated
when the number o f sp l i t s is h igh . Hence i t i s c ruc ia l to
avo id sp l i t t ing whenever poss ible, and to mi t iga te the effec ts
o f the unavo idable sp l i t s .
Sp l i t t ing a pardo can be avo ided i f we can prove tha t
a memor y loca t ion is exc lus ive ly accessed on ly by a one
para l le l con tex t . In th is case, the sp l i t t ing becomes unnec-
essar y and a d i rec t convers ion f rom a pardo to a spawn
wi l l be poss ible. One example of th is is when a para l le l
59
contex t w i th ID ’ i ’ a lways reads and wr i tes to A [ i ] ; hence
we know that no two contex ts access the same memor y
loca t ion . Th is means tha t no race cond i t ions are poss ible ;
hence no sp l i t t ing is needed.
Opt imizat ion for ant i -dependence case with in ser ia l loops
in pardo When the ant i -dependence is wi th in a loop in a
pardo reg ion (as in figure 3.2 example) , we can get be t te r
per fo r mance by unro l l ing the pardo once, and then t rans for m-
ing the two loops tha t resu l t so tha t the firs t loop updates
temporar y data s t ruc tures tha t are c lones of the or ig ina l
da ta s t ruc tures, and the second loop does the oppos i te.
An example of th is is seen in figure 3.2(c ) . Thereaf te r the
pardo reg ion is sp l i t to p lace the two loops in d i ffe ren t
spawn blocks in the XMTC output . Other e lements in the
figure such as ps opera t ion and ’flag’ wi l l be d iscussed in
deta i l shor t l y.
4.3 .1 Cluster ing of Memory Instruct ions
In an opt im iza t ion for unavo idable sp l i t s , we rear range
memor y accesses wi th in a pardo i n to clusters to min imize
the number o f sp l i t s needed. Each c lus ter represents a
spawn block . These c lus ters cons is t o f a group of memor y
60
accesses tha t are independent f rom one another across the
d i ffe ren t para l le l con tex ts. When a pardo reg ion is sp l i t in to
mul t ip le spawn blocks, o f ten there are more sp l i t s than nec-
essar y. We see an example of th is in figure 4.2(a) , where
there is a dependence between sta tements A1 and A2, and
another between B1 and B2, but none ex is t be tween the A
and B sta tements. Wi thout op t im iza t ion we wi l l end up wi th
th ree spawns af te r the sp l i t t ing as in figure 4.2(b) . However,
by rear rang ing and group ing independent memor y accesses
as in figure 4.2(c ) and on ly then do ing the sp l i t t ing , we
end up wi th two spawns. We ca l l th is reschedu l ing scheme
cluster ing .
The c lus ter ing a lgor i thm is a l i s t schedu l ing a lgor i thm.
F igure 4.3 shows the a lgor i thm used. The compi le r bu i lds
a dependence graph in wh ich we capture a l l da ta (flow or
’ loop-car r ied ’1 ) and cont ro l dependenc ies between a l l the
memor y accesses. Then we star t bu i ld ing one c lus ter a t a
t ime by schedu l ing a l l ’ready- to -fi re ’ nodes in the cur ren t
c lus te r ( l ines 28 - 34) . A node is ’ready- to -fi re ’ i f i t sa t i sfies
the cond i t ions in the l ines (13 - 25) . In s imp le te r ms, when
1 Even though the execut ion order w i th in a pardo i s d i ffe ren t f rom that
o f a loop, the ter m ’ loop car r ied dependence ’ is be ing used to re fer to
the para l le l con tex ts cross dependence between d i ffe ren t memor y access
in the pardo block
61
1 𝑴𝑴: 𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑎𝑎𝑎𝑎𝑎𝑎 𝑚𝑚𝑠𝑠𝑚𝑚𝑜𝑜𝑚𝑚𝑚𝑚 𝑎𝑎𝑎𝑎𝑎𝑎𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 
2 𝑪𝑪𝑪𝑪𝒊𝒊 = {𝒎𝒎 𝝐𝝐 𝑴𝑴 ∶ 𝒎𝒎 is a member of cluster 𝐢𝐢} 
3 𝑵𝑵𝑴𝑴 =  {𝒎𝒎 𝝐𝝐 𝑴𝑴 ∶ 𝒎𝒎 is not a member of any cluster} 
  
 For an 𝒎𝒎 𝝐𝝐 𝑵𝑵𝑴𝑴: 
4 𝑪𝑪𝒎𝒎  =  {𝒎𝒎𝑪𝑪 𝝐𝝐 𝑴𝑴 ∶  loop carried dependence between 𝒎𝒎𝑪𝑪 𝑎𝑎𝑎𝑎𝑎𝑎 𝒎𝒎} 
5 𝑭𝑭𝒎𝒎    =  {𝒎𝒎𝑭𝑭 𝝐𝝐 𝑴𝑴 ∶  𝒎𝒎 is Data flow dependent on 𝒎𝒎𝑭𝑭 } 
6 𝑪𝑪𝒎𝒎    =  {𝒎𝒎𝑪𝑪 𝝐𝝐 𝑴𝑴 ∶ 𝒎𝒎 is control dependent on value of 𝒎𝒎𝑪𝑪 }. 
7 𝑪𝑪𝑷𝑷𝒎𝒎 =  {𝒎𝒎𝑪𝑪𝑷𝑷 𝝐𝝐 𝑴𝑴 ∶ 𝒎𝒎 exist in a different loop from 𝒎𝒎𝑪𝑪𝑷𝑷 } 
8 𝑵𝑵𝑪𝑪𝒎𝒎 =  𝑪𝑪𝒎𝒎  ∩ 𝑵𝑵𝑴𝑴 
9 𝑵𝑵𝑭𝑭𝒎𝒎 =  𝑭𝑭𝒎𝒎  ∩ 𝑵𝑵𝑴𝑴 
10 𝑵𝑵𝑪𝑪𝒎𝒎 =  𝑪𝑪𝒎𝒎  ∩ 𝑵𝑵𝑴𝑴 
11 𝑵𝑵𝑪𝑪𝑷𝑷𝒎𝒎 =  𝑪𝑪𝑷𝑷𝒎𝒎  ∩ 𝑵𝑵𝑴𝑴 
 
12 Define Procedure ConflictsWith ( 𝑚𝑚,𝐶𝐶𝐶𝐶 ) : 
13  if 𝑁𝑁𝐶𝐶𝑚𝑚  ≠  Φ then 
14   return true 
15  if 𝐶𝐶𝑚𝑚 ⋂ 𝐶𝐶𝐶𝐶   ≠  Φ then 
16   return true 
17  if 𝐶𝐶𝑃𝑃𝑚𝑚 ⋂ 𝐶𝐶𝐶𝐶   ≠  Φ then 
18   return true 
19  for 𝑚𝑚𝐹𝐹 𝜖𝜖 𝑁𝑁𝑁𝑁𝑚𝑚 do 
20   if ConflictsWith (𝒎𝒎𝑭𝑭 ,𝐶𝐶𝐶𝐶 ) then 
21    return true 
22  for 𝑚𝑚𝐶𝐶 𝜖𝜖 𝑁𝑁𝐶𝐶𝑚𝑚 do 
23   if ConflictsWith (𝒎𝒎𝑪𝑪 ,𝐶𝐶𝐶𝐶 ) then 
24    return true 
25  return false 
   
26 Define Procedure cluster: 
27  Def: integer i = 0 
28  While (𝑁𝑁𝑁𝑁 ≠  Φ) do 
29   define new cluster 𝑪𝑪𝑪𝑪𝒊𝒊 
30   for 𝑚𝑚 𝜖𝜖 𝑁𝑁𝑁𝑁 do 
31    if ConflictsWith (m, 𝐶𝐶𝐶𝐶𝑖𝑖) then 
32     skip m 
33    else 
34     Add m to 𝑪𝑪𝑪𝑪𝒊𝒊 
35   i = i + 1 
 
Figure 4.3: The cluster ing algor i thm.
the compi le r cons iders a memor y access to be added to
c lus ter i , tha t memor y re ference and a l l the unschedu led
62
data flow and cont ro l memor y accesses i t depends on must
no t have a ’ loop car r ied ’ dependence wi th any member o f
tha t c lus te r. The c lus ter ing a lgor i thm has a complex i ty o f
O(nl ) , where n i s the number o f ins t ruc t ions tha t access
memor y, and l i s the number o f resu l t ing c lus ters.
4.3 .2 Reducing the Number of Temporar ies
The ICE compi le r a t tempts to min imize the amount o f
in te r med ia te in fo r mat ion communica ted across the d i ffe ren t
pardo reg ion sp l i t s , such as branch d i rec t ions, loop s ta tes,
and in te r med ia te data . Th is in fo r mat ion is s to red to and
re t r ieved f rom memor y, wh ich can cause per fo r mance degra-
da t ion . So in order to ach ieve max imum per fo r mance, avo id -
able memor y accesses must be e l im ina ted or promoted to
loca l var iables ins ide the spawn blocks tha t resu l ted f rom
the sp l i t t ing where poss ible. A l te r na t ive ly, communica ted in -
fo r mat ion must be aggregated such tha t i t can be stored
and re t r ieved in the leas t number o f accesses poss ible. For
tha t reason, 1 . We take c lus ter ing a s tep fur ther. Memor y
accesses schedu led to an ear l ie r c lus te r are moved to a
la te r c lus te rs i f these c lus ters conta in members dependent
on the memor y accesses and i t i s lega l to do so. For a
63
move to be lega l , a memor y access must sa t is fy a l l the
cond i t ions in the l ines (13 - 25) in figure 4.3 for the targe t
c lus te r, and a l l c lus te rs in between the targe t c lus te r and
the memor y access or ig ina l c lus te r. 2 . We use b i t vec-
to rs to record the branch d i rec t ions for sp l i t pardo reg ions,
where each branch dec is ion a long the t ree gets a s ing le
b i t .
4.3 .3 Fix ing Contro l F low af ter Cluster ing
The c lus ter ing process wi l l resu l t in reorder ing of mem-
or y accesses which can poten t ia l l y d is t r ibu te ins t ruc t ion of
a bas ic block across two c lus ters or more. Th is causes two
major problems: 1- Compl ica te and d isorgan ize the cont ro l
flow of a pardo reg ion . Ins t ruc t ions tha t have the same par -
en t bas ic block can be scat te red across mul t ip le (po ten t ia l l y )
nonconsecut ive spawn blocks, and wi l l l i ke ly be preceded or
fo l lowed by other ins t ruc t ions tha t be long to o ther bas ic
blocks. More on th is in subsec t ion 4.4 .3 2- A b igger prob-
lem is tha t i t p revents the t rans for mat ion of a ser ia l loop
wi th in a pardo reg ion , d iscussed in subsec t ion 4.3 above, in
wh ich a sp l i t ser ia l loop wi th in a pardo block is rep laced by
a ser ia l loop outs ide the resu l t ing spawn blocks. S ince, a f te r
64
c lus ter ing , the ins t ruc t ions be long ing to tha t ser ia l loop are
l i ke ly to get mixed wi th ins t ruc t ions f rom other bas ic blocks
tha t are not par t o f the ser ia l loop.
We so lve th is problem by creat ing an empty rep l i ca o f
the Cont ro l F low Graph (CFG) of the pardo reg ion in a l l
resu l tan t spawn blocks. As such, ever y bas ic block ins ide
the pardo wi l l have an empty copy of i t ins ide ever y resu l t -
ing spawn block . Then a copy of the branch ter minat ing the
or ig ina l bas ic block wi l l be p laced in each of the rep l i ca ted
bas ic blocks. Th is a l lows us to main ta in the cor rec tness of
the cont ro l flow more eas i l y, and a l lows a d i rec t and un-
compl ica ted p lacement o f the memor y accesses in the i r re -
spec t ive spawn blocks. Bas ica l l y, a memor y access is s imp ly
moved f rom the or ig ina l parent bas ic block ins ide the pardo
block , to the parent block ’s rep l i ca ins ide the spawn block
where i t be longs. Fur ther more, we can st i l l use memor y to
communica te cont ro l d i rec t ion as d iscussed in sec t ion 4.2
above, however i t now must be per fo r med in ever y spawn
block .
There are two except ions where a bas ic block is not
rep l i ca ted :
65
1 I f the bas ic block is a targe t o f a cond i t iona l b ranch
whose cond i t ion cannot be ca lcu la ted at a spec ific spawn
block yet because i t depends on a memor y access(es)
tha t have been p laced at a la te r spawn block . Whi le
the cond i t ion is not ready, the cond i t iona l b ranch wi l l
be rep laced wi th a d i rec t b ranch to the firs t common
immedia te pos t -dominator bas ic block of the cond i t iona l
b ranch ’s ta rge ts.
2 I f the bas ic block be longs to a ser ia l loop ins ide a
pardo block . S ince, as was d iscussed in sec t ion 4.2 ,
we ach ieve the back edge of the loop by creat ing a
ser ia l loop outs ide the spawn blocks and rep lace the
loop wi th branches ins ide of i t , the bas ic blocks f rom
the loop cannot ex is t a long bas ic blocks f rom outs ide
i t , s ince tha t means tha t these other bas ic blocks wi l l
execute ever y t ime the loop is executed . Ins tead, dur ing
c lus te r ing we make sure tha t a c lus te r is no t shared
between mul t ip le loops ( l ines 17 - 18 of figure 4.3) . As
such, a sp l i t ser ia l loop wi l l be c lus tered in to a set o f
consecut ive spawn blocks.
66
4.4 The ICE Compi ler Structure
The ICE compi le r uses a modified Clang f ron tend and
the LLVM compi le r in f ras t ruc tu re to per fo r m source- to -source
t rans la t ion of ICE program in to XMTC program. Thereaf te r
the XMTC code is compi led us ing the ex is t ing gcc-based
XMTC compi le r [12 ] . C lang was modified by add ing the
pardo ’ keyword , and implement ing the pars ing of the pardo
and the re levant IR code genera t ion . Even though mul t ip le
LLVM passes were a lso imp lemented to accompl ish a l l the
var ious s teps requ i red to conver t the lock-s tep semant ics
in to th readed code, na t ive LLVM passes were used wherever
poss ible.
4.4 .1 pre l iminary Code Opt imizat ion
The LLVM compi le r s tack is des igned for ser ia l th readed
code executed by a s ing le processor, mak ing i t incompat -
ib le wi th lock-s tepped para l le l code. S ince the ava i lable
compi le r t rans for mat ions do not take in to account many of
the proper t ies o f para l le l code (e.g . d i ffe ren t ia t ing between
shared vs loca l var iables or ser ia l vs para l le l con tex ts ) ,
Some steps were requ i red to main ta in the cor rec tness of
67
the ICE code when us ing nat ive LLVM passes. For example,
the beg inn ing and end of a pardo block are marked when
genera t ing IR f rom source. A lso, each para l le l sec t ion is
out l ined in to i ts own func t ion , g iv ing i t a d i ffe ren t contex t
f rom i ts sur round ing code. Fur ther more, on ly the fo l low-
ing nat ive LLVM trans for mat ions tha t are guaranteed to not
mod i fy the memor y order ing were used ( l i s ted by order o f
usage) :
(1 ) Control F low Graph Simpl ificat ion CFGSimplify pass
which removes a l l empty and ext raneous bas ic blocks.
Th is he lps in mak ing many of the passes the compi le r
uses more effic ien t in reason ing about cont ro l and data
flow of the program2 .
(2 ) Memory to register promot ion mem2reg pass which t rans-
fo r ms the code in to SSA (Sta t ic S ing le Ass ignment ) fo r m
mak ing subsequent op t im iza t ions much eas ie r.
(3 ) Instruct ion combine InstCombine pass to combine in -
s t ruc t ions in to s imp ler fo r ms whenever poss ible. Th is
he lps in remov ing a l l ex t ra ins t ruc t ions thus mak ing
2 There are many passes tha t are har med by usage of th is pass as
wel l . However, s ince none of these passes are used in the ICE compi le r,
usage of CFGSimpl i f y w i l l on ly benefit the compi la t ion process.
68
the code more effic ien t . I t a lso he lps in reduc ing the
amount o f in fo r mat ion communica ted across pardo sp l i t s
(4 ) Global Value Number ing (GVN) pass which finds a l l
redundant ins t ruc t ions and remove them.
At th is s tage, the c lus ter ing and schedu l ing of pardo block
ins t ruc t ions is per fo r med, Which invo lve mul t ip le s tages to
bu i ld the dependency graph and per fo r m the c lus ter ing a l -
gor i thm. I t a lso invo lves a l l the s teps taken to reduce the
in fo r mat ion communica ted across sp l i t s . Th is is exp lo red
be low.
4.4 .2 Bui ld ing the Dependency Graph
The c lus ter ing a lgor i thm re l ies on dependenc ies between
memor y re ferences to dec ide which spawn block each mem-
or y access wi l l be ass igned to. For tha t pur pose, the ICE
compi le r bu i lds a dependency graph tha t takes in to cons ider -
a t ion on ly the dependenc ies tha t may affec t the cor rec tness
or per fo r mance of the t rans la t ion f rom ICE as d iscussed
in 4 .2 . The compi le r re l ies on the Dependence Ana lys is
(DA) pass in bu i ld ing tha t dependence graph.
The Dependence Ana lys is pass is a nat ive LLVM ana ly -
s is pass tha t uses cer ta in dependency tes ts to deter mine
69
whether a dependence ex is ts between a pa i r o f memor y
accesses, and i f i t ex is ts , i t a t tempts to prov ide as much
in fo r mat ion as poss ible about the dependence. Th is ana l -
ys is pass bu i lds an in te r na l dependency graph based on
mutua l dependenc ies between memor y re ferences, and can
be quer ied about the dependency re la t ionsh ip between two
memor y accesses, respond ing wi th one of th ree s ta tes ; de-
pendent (flow, outpu t , an t i ) , independent , o r confused.
The dependence ana lys is pass checks for the cond i t ions
necessar y to app ly the su i table dependence tes t , and then
app l ies tha t tes t . The pass per fo r ms each of the fo l low-
ing dependence tes ts : the Zero Index Var iable (Z IV) tes t ,
the Sing le Index Var iable (SIV) s t rong and weak tes ts, the
Rest r i c ted Double Index Var iable (RDIV) tes t , and one of
the Mul t ip le Index Var iable (MIV) tes ts. For readers in te r -
es ted in knowing the cases where each tes t app l ies, o r the
methodo logy each tes t uses to d isprove dependence between
a memor y re ference pa i r, consu l t Append ix A.
When the dependence pass is quer ied about a memor y
re ference pa i r, i t w i l l fi rs t de ter mine which dependence tes t
i s most su i table to prove independence between the re fer -
ence pa i r, and then app ly a l l the su i table tes ts. I f indepen-
70
dence cannot be proved, the pass wi l l p rov ide in fo r mat ion
about d is tance and d i rec t ion vec tors i f the tes t used is ca-
pable o f find ing tha t k ind of in fo r mat ion . Th is in fo r mat ion
can be ver y usefu l fo r op t im iz ing passes as wel l as fo r
au to-para l le l i z ing compi le rs, espec ia l l y in case of nes ted
ser ia l loops.
In order to be able to use the dependence ana lys is pass,
the ICE compi le r c rea tes fake loops based on the pardo
reg ions. Th is is done for two reasons; first , by crea t ing
fake ser ia l loops out o f para l le l pardo reg ions, the compi le r
is e ffec t i ve ly check ing for the problems bound to ar ise when
a pardo reg ion is t rans la ted as is to a s ing le spawn block .
Dependenc ies tha t are found between d i ffe ren t i te ra t ions of
the ser ia l loop wi l l t rans la te as dependenc ies between the
d i ffe ren t th reads created by the spawn . Second , Th is pass
is in tended for the dependence ana lys is w i th in ser ia l loops
and st ruc tures, and is unable to recogn ize para l le l loops
inc lud ing any lock-s tep loops. As a resu l t , the dependence
ana lys is pass is not go ing to be able to recogn ize the
dependenc ies wi th in a pardo reg ion cor rec t ly, and i t w i l l no t
find any loop car r ied dependence. For tha t reason, mak ing
a para l le l pardo reg ion resemble a ser ia l loop wi l l a l low the
71
pass to prov ide a l l dependency in fo r mat ion needed to bu i ld
the dependence graph used by the c lus ter ing a lgor i thm. The
fake ser ia l loop is crea ted as fo l lows:
(1 ) The compi le r adds two new empty bas ic blocks, one for
the loop header wh ich is inser ted r igh t a f te r the pardo
reg ion header, and another bas ic block tha t w i l l rep lace
the pardo block t ra i le r, and wi l l be the las t bas ic block
to execute in ever y i te ra t ion of the loop. For pur poses
of th is exp lanat ion , th is bas ic block wi l l be re fer red to
as the ” loop t ra i le r ” . A f te r th is is comple ted , the loop
header shou ld dominate a l l pardo bas ic blocks, and the
loop t ra i le r shou ld pos t -dominate a l l the pardo bas ic
blocks.
(2 ) The compi le r adds the code requ i red for check ing the
loop cond i t ion and p laces i t in to the loop header bas ic
block . Th is check makes sure the loop index var iable
does not exceed the high para l le l con tex t ID in the
pardo s ta tement . I f the check fa i l s , the loop execut ion
ter minates, and execut ion is se t to cont inue at the pardo
reg ion ’s t ra i le r bas ic block . The compi le r a lso adds
ins t ruc t ions to the loop t ra i le r, fo r inc rement ing the loop
72
i ndex by step and an uncond i t iona l b ranch to the loop
header.
(3 ) The loop index var iable is in i t ia l i zed to l ow para l le l
con tex t ID outs ide the loop.
Once the dependency graph is comple ted , the fake se-
r ia l loop s t ruc ture is removed, and a l l the changes are
reversed. Af te r the dependency graph comple t ion , the com-
p i le r executes the c lus ter ing pass. Once c lus ter ing has
comple ted and c lus ters are set , the compi le r w i l l sp l i t the
pardo block , rep l i ca te the CFG as d iscussed in 4 .3 .3 , and
move a l l memor y re ferences and in te r med ia te ins t ruc t ions to
the i r respec t ive c lus ters.
4.4 .3 Mainta in ing Correctness of LLVM’s SSA
Form
The LLVM compi le r In f ras t ruc tu re In te r med ia te representa-
t ion ( IR) uses the Sta t ic S ing le Ass ignment (SSA) to repre-
sent the in te r med ia te opera t ions, and to main ta in the def -
use and use-def cha ins wi th in the compi led modu le. One of
the requ i red proper t ies o f SSA representa t ion is tha t a Defi-
n i t ion must dominate a l l i t s Uses , w i th the on ly except ion
73
to th is ru le be ing i f the user ins t ruc t ion is a ph i -node. As
a resu l t o f c lus te r ing and the subsequent rep l i ca t ion of the
CFG, fo l lowed by the p lacement o f ins t ruc t ions in to the i r
paren t bas ic block ’s rep l i ca wi th in the c lus ter they be long
to, i t i s o f ten the case tha t the dominat ion proper ty o f the
SSA representa t ion is broken. F igure 4.4 shows an example
o f th is problem.
Th is problem is reso lved through c lon ing the offend ing
defin i t ion in to memor y in the c lus ter where i t occurs, then
read ing the c lone in the user ins t ruc t ion ’s c lus te r, r igh t
be fore the user ins t ruc t ion . Unfor tunate ly, Th is so lu t ion may
poten t ia l l y resu l t in c lon ing too many in te r med ia te defin i t ions
in to memor y.
Each defin i t ion c loned wi l l cos t a t leas t two memor y ac-
cesses ; store i ns t ruc t ion for mak ing the c lone copy and a
load i ns t ruc t ion ever y t ime the c loned va lue is used by an
ins t ruc t ion . Hence, in order to min imize the overhead of
memor y accesses, c lones must no t be per fo r med un less i t
i s necessar y. As such the compi le r checks for oppor tun i -
t ies o f cheaper opt ions tha t enables the reca lcu la t ion of the
va lue ins tead of c rea t ing a c lone of i t . The reca lcu la t ion is






A1 = Load b[i+1] 
A2 = add A1, 5 




. A1 = Load b[i+1] 





A11 = load temp 
A2 = add A11, 5 
Store A2, b[i] 
Cluster 2 
After Clustering 
Figure 4.4: Example showing placement of instruct ions into
their respect ive clusters, before and after cluster ing, and the
CFG repl icat ion process. The store depends on the load across
mult ip le paral le l contexts. The placement of instruct ions A1 and
A2 into clusters 1 and 2 respect ively, resul ts in breaking the
SSA dominance proper ty. To resolve this problem, A1 is c loned
to temp in c luster 1, and temp is retr ieved to be used by A2
in c luster 2
75
s tore. In tha t sense, un less the va lue to be c loned is ( i )
the resu l t o f a computa t ion invo lv ing at leas t th ree loads
f rom memor y, or ( i i ) i t i s the resu l t o f the computa t ion of
two loads and the c lone is go ing to be used in more than
a s ing le c lus te r, o r ( i i i ) i t uses the resu l ts o f memor y read
tha t is confl ic ted wi th o ther memor y accesses wi th in the
c lus ter (as is the case in figure 4.4) , then the va lue is
not c loned and is reca lcu la ted ins tead. The compi le r keeps
t rack of the va lues tha t i t c loned, so tha t la te r on af te r
op t im iz ing the cont ro l flow graph, the compi le r w i l l check
which ones are not necessar y anymore and remove them.
Fo l low ing th is, the compi le r executes the CFGSimpl i f y pass
to remove a l l the unnecessar y or empty bas ic blocks tha t
resu l ted f rom rep l ica t ing the CFG. Af te r tha t , the compi le r
executes another round of the LLVM passes used for the
pre l im inar y opt im iza t ion s tage to c lean up and remove a l l
the ex t ra var iables or memor y c lones. Af te r tha t the com-
p i le r w i l l check to make sure tha t on ly the necessar y c lones
are le f t . I f the LLVM passes seem to have missed a c lone
tha t is no t needed anymore, the compi le r w i l l remove i t .
76
4.4 .4 Transforming the LLVM IR into XMTC code
Fina l l y, the compi le r t rans la tes the LLVM IR to XMTC
high leve l code us ing our XMTC backend. The XMTC back-
end is a modified vers ion of LLVM nat ive C Backend wi th
added suppor t to genera te h igh- leve l XMTC code. Here
the compi le r does the sp l i t t ing o f pardo reg ions in to spawn
blocks based on the resu l ts o f the c lus ter ing pass. A lso,
in th is s tage the compi le r sp l i t s loops and cond i t iona ls as
d iscussed ear l ie r, c rea te a l l a r rays for communica t ing in te r -
med ia te data , and any other s teps requ i red for genera t ing
cor rec t XMTC code. Af te r the XMTC code is produced, i t i s
compi led us ing the ex is t ing gcc-based XMTC compi le r [12 ]
to produce b inar ies fo r the XMT FPGA and XMT cyc le ac-
cura te s imu la to r.
4.5 Suppor t for Nested Para l le l ism
Correc t t rans la t ion of nes ted ICE code in to nes ted XMTC
code is s imi la r to non nested ICE code in tha t i t requ i res
sp l i t t ing the pardo reg ion in to mul t ip le spawn reg ions. How-
ever, sp l i t t ing an inner pardo reg ion requ i res tha t a l l ou ter
pardo reg ions conta in ing i t to be sp l i t as we l l . Trans la t ing
77
nested ICE code by on ly sp l i t t ing the inner pardo reg ion
wi thout sp l i t t ing any of the outer pardo reg ions wi l l c rea te
mul t ip le spawn blocks conta ined wi th in one enc los ing spawn
block . Each ’parent ’ th read created by the outer spawn wi l l
in tu r n execute i ts ins tance of the inner spawn ca l l s a t i t s
own pace, c rea t ing mul t ip le tasks tha t are synchronous on ly
w i th th reads created by same spawn ca l l ins tance. So, a
’paren t ’ th read may poten t ia l l y comple te the execut ion of
mul t ip le inner spawn ca l l s be fore any is executed by other
’paren t ’ th reads. Thus, the para l le l con tex ts crea ted by a
nested pardo wi l l no t synchron ize wi th o ther nes ted para l le l
con tex ts on same leve l o f nes t ing , consequent ly, b reak ing
the lock-s tep execut ion semant ics o f ICE.
Trans la t ing the nested ICE code wi l l face many of the
same problems and use many of the so lu t ions used whi le
t rans la t ing non-nes ted ICE code wi th minor d i ffe rences. S im-
i la r to the non-nes ted case, sp l i t t ing an inner nes ted pardo
wi l l in t roduce many of the data flow and cont ro l flow prob-
lems d iscussed in sec t ion 4.2 .2 . To reso lve th is, temporar ies
wi l l need to be created to communica te the in te r med ia te
data computa t ions and cont ro l d i rec t ion of the nes ted para l -
le l con tex ts. However, s ince a sp l i t w i th in the inner pardo
78
requ i res tha t we sp l i t i t s enc los ing pardo reg ions as wel l ,
the temporar ies crea ted shou ld account fo r a l l the para l le l
con tex t c rea ted across a l l enc los ing para l le l con tex ts on a l l
lev le ls o f nes t ing . Hence, the ICE compi le r c rea tes mul t i -
d imens iona l a r ray temporar ies wi th as many d imens ions as
the leve ls o f nes t ing .
Opt im iz ing for the nes ted ICE code is per fo r med in a
ver y s imi la r manner to the non-nes ted case as wel l . The
c lus ter ing a lgor i thm is s t i l l used to min imize the number o f
sp l i t s invo lved, as wel l as the opt im iza t ions used to reduce
the number o f temporar ies. Cont ro l flow af te r c lus te r ing is
s t i l l rep l i ca ted . However, there are two changes tha t are
added for nes t ing :
(1 ) Bui ld ing the Data Flow Graph The change requ i red
was in ex tend ing the a lgor i thm used to recogn ize the
pardo nest ing , wh ich a l lowed the data dependency graph
to capture the data flow across nes ted ICE code as
wel l . S ince the c lus ter ing a lgor i thm re l ies so le ly on the
dependency graph, the c lus ter ing a lgor i thm does not
requ i re o ther changes.
(2 ) Handl ing ser ia l loops occurr ing with in a nested pardo
block s imi la r to how ser ia l loops were hand led in the
79
non-nes ted ICE code, ser ia l loops are taken outs ide
the pardo reg ion conta in ing i t . A lso, s ince a l l enc los ing
pardo blocks need to be sp l i t , the ser ia l loop is a lso
taken outs ide a l l enc los ing pardo blocks and is executed
by the MTCU, the loop cond i t ion becomes a flag ind ica-
t i ve o f the ex is tence of th reads tha t have not comple ted
execut ion yet , and the or ig ina l loop ter minat ion cond i -
t ion becomes a nor mal branch wi th in the pardo where
the loop used to res ide and is t rea ted as in the regu lar
branch case.
80
Chapter 5 : Eva lua t ing The ICE Language: Resu l ts
and Ana lys is
This chapter presents the resu l ts o f the exper iments o f
wr i t ing and compi l ing ICE programs, and compare the resu l ts
to programs wr i t ten in the XMTC language. F i rs t , th is chap-
te r examines the d i ffe rence in ease of programming between
ICE and XMTC by showing a compar ison of the number
o f l ines of code needed to imp lement the same algor i thms.
Then, i t examines the t rans la t ion accuracy of the ICE com-
p i le r, by compar ing the ICE to XMTC trans la t ion produced
by the compi le r, to the hand-opt im ized XMTC in ter ms of
the number o f requ i red spawn blocks and temporar ies used.
F ina l l y, Th is chapter l i s ts and examines the per fo r mance of
ICE programs for the used benchmarks, and compares to
tha t o f the manua l ly -op t im ized XMTC.
81
5.1 Environment and Methodology
Since ICE is a new language wi th no s tandard ized bench-
marks, a su i te o f 16 benchmarks based on common PRAM
algor i thms was deve loped to be used for the exper iments.
Th is benchmark su i te conta ins benchmarks tha t can be c las-
s ified as nested and non-nes ted a lgor i thms, or regu la r and
i r regu la r programs. A l is t o f the the non-nes ted benchmarks
in the new benchmark su i te is ava i lable in table 5 .1 , and
the nested benchmarks can be seen in table 5 .2 . four o f
the five nested benchmarks had a non-nes ted vers ion of
them. In except ion of BFS, the non-nes ted vers ions of the
a lgor i thms were essent ia l l y fla t tened vers ions of the nes ted
benchmarks. The fi f th benchmark ( i .e . , topo log ica l sor t ) does
not have a non-nes ted counter par t in the benchmark su i te.
A deta i led descr ip t ion of each of the a lgor i thms the bench-
marks were based on can be found in [25 , 26 , 27 ] . For
each benchmark , a lock-s tep pseudo-code was wr i t ten based
on the PRAM algor i thm for the benchmark , then based on
that pseudo-code two vers ions were imp lemented: an XMTC
vers ion tha t is manua l ly op t im ized for bes t per fo r mance, and
the ICE vers ion . The ICE vers ions was compi led us ing the
82
Table 5.1: A l ist of the non-nested benchmarks.
Benchmark Problem Size Abrv.




Sample Sor t * 131072
SMP
Breadth First Search * 32768 nodes65536 edges
BFS




Tree Contract ion 32768 nodes
CTRC
Tree Root ing * 32768 nodes65536 edges
RANK
2D Jacobi
Stenci l Computat ion 512x512
JAC
LU Factor izat ion 512x512
LU
Cholesky
Factor izat ion 512x512
CHO
new ICE compi le r, then the compi le r ’s ou tpu t XMTC code is
compi led us ing the XMTC compi le r. The same XMTC com-
p i le r is used for compi l ing both the hand-opt im ized XMTC
program and the automat ica l l y genera ted XMTC code tha t
was t rans la ted f rom ICE. Af te r the XMTC compi le r have
produced an XMTC executable, i t i s executed us ing the 64-
TCUs XMT FPGA.
83
Table 5.2: A l ist of the nested benchmarks.
Benchmark NestingDepth Problem Size Abrv.
Topological Sor t 2 32768 nodes65536 edges
TOBO
Breadth First Search * 2 32768 nodes65536 edges
NBFS
2D Jacobi
Stenci l Computat ion 2 512x512
NJAC
LU Factor izat ion 2 512x512
NLU
Cholesky
Factor izat ion 2 512x512
CHO
5.2 Ease of use and Code size
This sec t ion examines the d i ffe rences in code s izes for
bo th ICE and XMTC implementa t ions of a l l benchmarks. The
code s ize is used as a measure of ease of programming.
Th is is fa i r because ICE and XMTC are ex tens ions of the C
language, each fea tur ing an ext ra keyword to express para l -
le l i sm: pardo to lock-s tep para l le l i sm in ICE, and spawn fo r
express ing threads in XMTC. Both languages are ident ica l
o therw ise. Th is means tha t fo r the same pseudo-code of
an a lgor i thm wi th same inputs and outpu ts, smal le r code
ind ica tes s imp ler programs, and the increase in code s ize










































Figure 5.1: Code size for the ent i re program normal ized to
XMTC.
and/or h igher per fo r mance, as is the case in the example
in figure 3.2 . Thus, we be l ieve compar ing l ines of code to
approx imate ease of programming is a va l id approach to
demonst ra te the ease of programming of ICE compared to
XMTC.
Two d i ffe ren t measurement o f code s ize are prov ided: a
measurement fo r the para l le l a lgor i thmic par t on ly to ex-
amine the ICE language ab i l i t y in he lp programmers wr i te
s imp ler para l le l code, and a measurement fo r the ent i re
program prov ided for comple teness. For both measures, the
number o f l ines of code natura l l y exc ludes whi te spaces and










































Figure 5.2: Code size of the algor i thm’s paral le l sect ions nor-
mal ized to XMTC.
l i ne. For the a lgor i thmic para l le l por t ion of the code, we
measure on ly the benchmark ’s code s ize for para l le l sec-
t ions on ly, exc lud ing a l l shared var iable dec la ra t ions and
non- recur r ing in i t ia l i za t ions, a l l ser ia l a lgor i thms used as
par t o f the main para l le l a lgor i thm ( i .e . , ser ia l sor t ing or
summat ion , e tc . ) , the main func t ion , and a l l p reprocessor
d i rec t i ves.
F igure 5 .1 shows a compar ison of the reduc t ion in the
ent i re program code s ize for non-nes ted ICE nor mal ized to
opt im ized XMTC. Th is graph shows that ICE has a smal le r
code s ize when compared to XMTC for seven out o f our
e leven benchmarks. The other four benchmarks saw no
86
reduc t ion in code s ize, s ince they conta in none of the cases
tha t ICE can he lp programmers wi th . These benchmarks
were inc luded on ly as a base- l ine case. ICE prov ides a
reduc t ion in the s ize of code by 11.01% on average for
the ent i re benchmark su i te, and 16.08% on average for the
benchmarks tha t showed an improvement .
F igure 5 .2 shows the percentage of code s ize reduc t ion
for the para l le l a lgor i thm par t o f the benchmark for non-
nes ted ICE when nor mal ized to the XMTC vers ion . Here
as wel l , ICE prov ides the la rges t reduc t ion in s ize of code
when compared to XMTC wi th reduc t ion of up to 57.14% in
some cases. ICE prov ides an average reduc t ion of 21 .61%
for the ent i re se t , and 33.35% for benchmarks tha t showed
an improvement . Th is shows the poten t ia l o f ICE to reduce
code s ize (and there fore programming effo r t ) compared to
XMTC, which is a more t rad i t iona l th readed language.
figures 5.3 and 5.4 show that the ease of programming
benefit o f ICE extends to nes ted ICE as wel l . F igure 5 .3
prov ides a compar ison of the reduc t ion in s ize of the code
for the ent i re program for nes ted ICE programs nor mal ized
to opt im ized nested XMTC programs, wh i le figure 5.4 shows



































Figure 5.3: Code size for the ent i re program normal ized to


































Figure 5.4: Code size of the algor i thm’s paral le l sect ions nor-
mal ized to XMTC for nested benchmarks.
gor i thmic par t o f the benchmarks for nes ted ICE when nor -
mal ized to the nested XMTC vers ion . In both figures, ICE
88
prov ides an average reduc t ion in code s ize of 13 .28% for
the ent i re program, and 34.14% for the para l le l a lgor i thm
por t ion of the code. We also not ice in figure 5.4 tha t the
max imum reduct ion in code s ize for the a lgor i thmic par t o f
the program was 64.71%.
F ina l l y, When examin ing the reduc t ion in code s ize for the
ent i re benchmark su i te, ICE prov ides an average reduc t ion
in the s ize of code by 11.72% for the ent i re program, and
25.53% for the para l le l a lgor i thmic par t on ly o f the code.
5.3 Accuracy
In th is sec t ion we take a look at the ICE compi le r ’s
accuracy and effec t i veness in t rans la t ing to XMTC. We look
at the number o f spawn blocks and temporar ies1 used to
imp lement our benchmarks. We be l ieve tha t th is w i l l he lp
demonst ra te the ICE compi le r ’s e ffec t i veness in produc ing
h igh per fo r mance XMTC programs, due to the effec t spawns
and temporar ies has on the runt ime per fo r mance of the
t rans la ted XMTC code as d iscussed in sec t ion ??
1 Each temporar y was used to s tore on ly one va lue tha t may be read
mul t ip le t imes.
89
We look at table 5 .3 to see the number o f spawns and
temporar ies used by the programmer and the ICE compi le r.
Th is table shows that n ine out o f 15 of the 16 bench-
marks had the same number o f spawns and temporar ies in
both XMTC vers ions. The benchmark tha t is le f t had more
spawns and temporar ies compared to hand-wr i t ten XMTC.
Th is benchmark had mul t ip le independent ind i rec t memor y
re ferences tha t cannot be detec ted by compi le rs. However,
the programmer for the hand-wr i t ten vers ion was able to
avo id the ex t ra sp l i t s and temporar ies.
The ab i l i t y o f the ICE compi le r to genera te h igh qua l i t y
code is h igh ly re l ian t on the per fo r mance of the a l ias ana l -
ys is used to deter mine the dependenc ies between memor y
accesses. These dependency re la t ionsh ips are used dur ing
the c lus ter ing s tep to deter mine the number o f requ i red
sp l i t s and spawn blocks as was d iscussed in 4 .3 .1 . When-
ever uncer ta in about a dependency, the compi le r conser va-
t i ve ly assumes a dependence ex is ts anyway. Th is means
tha t whenever a l ias ana lys is is quer ied about memor y re f -
e rences and i t p rov ide defin i t i ve answers o f no-a l ias, the
c lus ter ing a lgor i thm makes bet te r c lus te r ing dec is ions, and
u l t imate ly produces a more effic ien t code. A l ias ana lys is is
90






Spawns Temp. Spawns Temp.
Integer Sor t 3 0 3 0
Merging 4 0 4 0
Sample Sor t 8 0 8 0
Breadth
First Search 3 0 3
0
Graph Connect iv i ty 12 2 13 3
Maximum Finding 4 0 4 0
Tree Contract ion 7 4 7 4
Tree Root ing 5 2 5 2
Jacobi 2 1 2 1
LU Factor izat ion 1 0 1 0
Cholesky
Factor izat ion 2 0 2
0
Topological Sor t 5 0 5 0
Nested Breadth
First Search 3 0 3
0
Nested Jacobi 4 1 4 1
Nested LU
Factor izat ion 2 0 2
0
Nested Cholesky
Factor izat ion 3 0 3
0
91
a large fie ld o f compi le r theor y research and any advance-
ments wi th in the fie ld wi l l benefi t the opera t ion of the ICE
compi le r. However, i t i s ou ts ide the scope of th is thes is
and we wi l l no t d iscuss i t any fur ther.
5.4 Performance
The XMT pla t fo r m is exce l len t a t exp lo i t ing para l le l i sm in
i r regu la r a lgor i thms. A l is t o f examples of publ ished work
tha t shows XMT’s speedups aga ins t commodi ty supersca la r
arch i tec tu res was d iscussed in sec t ion 2.7 .1 . Th is a lso
va l ida tes the cho ice of compar ing the per fo r mance of the
ICE language to the XMTC language for b inar ies executed
over the XMT pla t fo r m.
In th is sec t ion , we wi l l focus on the per fo r mance com-
par ison between b inar ies compi led f rom ICE and XMTC.
The XMT FPGA, which has 64 TCUs, was used to obta in
these per fo r mance measurements l i s ted be low for bo th hand-
opt im ized XMTC and ICE vers ions of the same algor i thm
pseudo-code. F igures 5.5 and 5.6 prov ides the speedup
of ICE nor mal ized to hand-opt im ized XMTC for both nes ted
and non-nes ted cases. F igures 5.7 and 5.8 shows the
net run- t ime improvement o f ICE re la t i ve to hand-opt im ized
92
XMTC, nor mal ized to hand-opt im ized XMTC for both nes ted
and non-nes ted cases. At the end of execut ing a b inar y,
the XMT FPGA prov ides the number o f cyc les requ i red to
execute the XMT binar y. These numbers were co l lec ted for
a l l 16 benchmarks, and then used for per fo r mance compar -
isons. The cyc le count o f hand-opt im ized XMTC is used as
bas is fo r the compar ison, and these figures show the the
per fo r mance resu l ts fo r the ICE code nor mal ized to hand-
opt im ized XMTC programs.
To ensure tha t ICE is be ing compared to the fas tes t
hand-opt im ized XMTC many steps were taken. S ince mem-
or y accesses are the b igges t source of overhead in XMT,
temporar ies were not used in XMTC programs un less i t was
necessar y. Th is is can be seen in table 5 .3 where th i r -
teen benchmarks used no temporar ies and fif teen use two
temporar ies or less. The other lesser source of overhead
comes f rom the creat ion and ter minat ion of th reads. Th is
overhead is ver y smal l in XMT and have neg l ig ib le e ffec t on
the va l id i ty o f the compar isons presented in th is sec t ion .
ICE ach ieves comparable per fo r mance to hand-opt im ized
XMTC, which takes cons iderably more programming effo r t




































Figure 5.5: 64 TCU XMT processor speedup compar ison of
































Figure 5.6: 64 TCU XMT processor speedup compar ison for


























Figure 5.7: 64 TCU XMT net speedup of non-nested ICE pro-
grams normal ized to hand-opt imized XMTC
speedup on average for non-nes ted benchmarks, w i th max i -
mum slowdown of 2 .5% when compared to the per fo r mance
of op t im ized XMTC. F igure 5.8 , shows that ICE nets no
s lowdown on average when compared to hand-opt im ized
XMTC, wi th a max imum slowdown of 0 .91%. Such minor
per fo r mance pena l t ies fo r a much eas ie r programming ef -
fo r t i s an obv ious good cho ice for programmers. For non-
per fo r mance-exper t p rogrammers who cannot wr i te h igh ly
opt im ized XMTC code, ICE might even prov ide a speedup.
The figures a lso show that fo r some benchmarks, ICE
has ach ieved a speed up when compared to hand-opt im ized























Figure 5.8: 64 TCU XMT net speedup of ICE normal ized to
opt imized XMTC for the nested benchmarks
speed ups over XMTC for exper t p rogrammers, s ince in tu -
i t i ve ly hand opt im ized para l le l code shou ld a lways be fas ter.
Upon inves t iga t ing , i t was found tha t there are mul t ip le
fac tors cont r ibu t ing to the obser ved speed ups. For some
benchmarks (Merg ing benchmark , Max imum find ing bench-
mark , non-nes ted Jacob i benchmark) , the ICE code was
accura te ly t rans la ted to i ts equ iva len t XMTC code ( i .e . , I t
has the same number o f spawn blocks and temporar ies) .
However, the program layout o f bo th vers ion is d i ffe ren t .
Th is suggests tha t the per fo r mance ga in is a resu l t o f fac-
to rs unre la ted to the t rans la t ion such as data loca t ion in
the read-on ly cache, ins t ruc t ion schedu l ing , the whether the
96
re levant da ta was pre- fe tched, or what op t im iza t ions oppor tu -
n i t ies were recogn ized and per fo r med by the XMTC compi le r.
For another benchmark subset (non-nes ted BFS benchmark ,
and t ree cont rac t ion benchmark) , the per fo r mance ga in was
a resu l t o f the LLVM compi le r ’s na t ive opt im iza t ions which
is more recent than the GCC compi le r used in XMTC imple-
menta t ion . Th is is combined wi th the ICE compi le r spec ific
opt im iza t ion tha t were imp lemented. When a PRAM algo-
r i thm requ i res mul t ip le synchron iza t ion po in ts wi th in a deep
nested i f -e lse block , the cond i t ion needs to be re-eva lua ted
af te r each po in t . However, the ICE compi le r used b i t vec tors
to record the eva lua t ion resu l ts fo r mul t ip le branches which
wi l l requ i re a s ing le memor y read per a spawn block wi l l
be suffic ien t to re t r ieve the cont ro l flow in for mat ion as was
d iscussed in 4 .3 .2 . S ince a programmer is ver y un l i ke ly
to use b i t vec tors to record resu l ts o f mul t ip le branches,
mul t ip le reads per spawn block are needed for cond i t ion
eva lua t ion .
To compare the per fo r mance of ICE for bo th the nested
and non-nes ted vers ions of an a lgor i thms, figure 5.9 is pro-
v ided. Th is figure shows a compar ison of bo th the nested




























Figure 5.9: 64 TCU XMT net speedup compar ison between
nested and non-nested ICE normal ized to opt imized XMTC
opt im ized XMTC. i t can be seen f rom the figure tha t fo r
th ree of the four benchmark pa i rs, the nes ted vers ions
ach ieved s l igh t ly be t te r speedups compared to the non-
nes ted vers ions of the benchmarks, whereas for the four th
benchmark , the nes ted vers ion ach ieved s ign ificant ly lower
per fo r mance when compared to i ts non-nes ted counter par t .
We be l ieve tha t th is was main ly due to the minor changes
made to the a lgor i thm to be able to wr i te a non-nes ted ver -
s ion of i t . We do not th ink tha t we can make conc lus ions
on which method is bet te r based on such a smal l subset .
The ease of programming of ICE per mi t ted the wr i t ing
o f programs d i rec t ly f rom a (PRAM) a lgor i thm wi th less
98
effo r t than tha t requ i red for wr i t ing hand-opt im ized XMTC,
whi le main ta in ing comparable per fo r mance through us ing a




Chapter 6 : Re la ted work and Conc lus ion
6.1 Related work
As ment ioned in the in t roduc t ion , over 225 para l le l lan-
guages have been proposed. I t i s no t prac t ica l to d iscuss
a l l o f them here. We wi l l focus on languages tha t are most
c lose ly re la ted , e i ther because they have an a lgor i thmic
foundat ion , such as PRAM, or have an ICE- l i ke lock-s tep
execut ion mode l ; o r are meant fo r hardware l i ke XMT su i ted
to i r regu la r programs. In summar y, we have not found any
re la ted work tha t has the fu l l ecosys tem that ICE offe rs
o f an easy- to -program language, w i th a sound a lgor i thmic
foundat ion in PRAM theor y, a capable compi le r mapp ing
to th readed programs, and a hardware tha t is capable o f
exp lo i t ing fine-gra ined i r regu la r para l le l p rograms.
The goa l o f th is proposa l is to a l low programmers to
use as f ree ly as poss ible an ex tended for m of lock-s tep
programming s imi la r to the way para l le l a lgor i thms are ex-
101
pressed in the PRAM l i te ra tu re, whenever they pre fer to do
so. We ca l l th is ex tended for m ICE programming. The gen-
era l work-depth f ramework (due to [30 ] ) i s used in severa l
para l le l a lgor i thms tex ts [25 , 26 , 27 ] fo r descr ib ing PRAM
algor i thms. ICE is the immedia te concur ren t execut ion ab-
s t rac t ion presented in [2 ] fo r popu lar iz ing th is f ramework .
To fac i l i ta te th is goa l , the s ta tement o f work in the pro-
posa l inc ludes mapp ing the ICE lock-s tep semant ics onto
the XMT mul t i - th readed semant ics wh i le ach iev ing the best
per fo r mance we can. As we star t f rom lock-s tep spec ifica-
t ion , the per fo r mance ob jec t ive enta i l s reduc ing synchrony
in an automat ic way. So far, XMT programming of PRAM
algor i thms was done us ing the modest XMTC extens ion to
C. In par t i cu la r, [29 ] suggested a “programmer ’s workflow”
tha t gu ides the programmer on how to advance the ICE ab-
s t rac t ion of an a lgor i thm, ca l led there h igh- leve l work-depth
(HLWD), to an XMTC program and how to tune i ts per fo r -
mance. The or ig ina l XMT hardware a l lowed us to ach ieve
s t rong speedups over the bes t ser ia l a lgor i thm for many par -
a l le l a lgor i thms implemented us ing th is workflow [29] . Fo l low-
up arch i tec tu re, compi le r and run- t ime enhancements fu r ther
improve these speedups. The cur ren t proposa l seeks to
102
s ign ificant ly reduce the a lgor i thm- to-computer -p rogram effo r t
by the programmer. Ins tead of an XMTC program, the pro-
grammer wi l l handoff a spec ifica t ion of the a lgor i thm us ing
ICE programming. The XMT implementa t ion (e.g . , r un t ime)
shou ld be “on par ” w i th hand-opt im ized XMTC code.
DARPA launched the HIGH Product iv i t y Comput ing Sys-
tems (HPCS) program wi th the pur pose of bu i ld ing sys tems
that can be programmed produc t ive ly. I t resu l ted in to th ree
languages; Cray ’s CHAPEL [31] , SUN’s For t ress [32 ] , and
IBM’s X10 [33 ] . A l though a l l these languages have ease of
programming and h igh produc t iv i t y as a goa l , none is su i ted
for the lock-s tep mode l o f PRAM algor i thms. Fur ther a l l
these languages requ i re manua l spec ifica t ion of synchrony
and concur rency, whereas the ICE compi le r au tomates the
process. F ina l l y, these languages are in tended to be mapped
to t rad i t iona l coarse-gra ined hardware ; hence they per fo r m
poor ly on i r regu la r programs when compared to XMT.
APL is an ear ly example o f h igh- leve l p rogramming tha t
a l lows for lock-s tep para l le l i sm. A ser ies o f papers tha t
appears to have cu lmina ted wi th [34 ] sought execut ion of
compi le r -ex t rac ted para l le l i sm f rom APL programs on the
IBM RP3. The IBM RP3 bu i l t on the NYU Ul t racomputer
103
pro jec t , wh ich a lso insp i red XMT. However, APL d id not
prov ide suffic ien t suppor t fo r the PRAM para l le l a lgor i thms
l i te ra tu re. The V-RAM [35] appears to be the firs t lock-s tep
programming mode l a imed at imp lement ing th is l i te ra tu re.
However, i t was a lock-s tep mode l ta rge t ing vec tor hardware.
NESL that fo l lowed was not lock-s tep, bu t , s t i l l appears to
have targe ted mach ine mode ls fo r wh ich synchron iza t ion
was re la t i ve ly easy ; see, e.g . , [36 ] . In any case, we are
unaware of speedup resu l ts fo r these approaches (APL, V-
RAM, NESL, e tc . ) tha t approach XMT resu l ts , espec ia l l y fo r
i r regu la r app l ica t ions.
The case for ( lock-step, nested) ICE programming I t i s
ins t ruc t i ve to go back and read the mot iva t ion for NESL
in [37 ] . Para l le l p rogramming const ruc ts were needed for
spec i fy ing para l le l a lgor i thms. B le l loch examined para l le l
a lgor i thms that are descr ibed in the l i te ra ture and the i r
pseudo-code. He found tha t near ly a l l a re descr ibed as
para l le l opera t ions over co l lec t ions of va lues. For example
“ in para l le l fo r each ver tex in a graph, find i ts min imum
neighbor ” , o r “ in para l le l fo r each row in a mat r ix , sum the
row” . The a lgor i thms usua l ly cons is t o f many such para l le l
ca l l s in te r leaved wi th opera t ions tha t rear range the order o f
104
a co l lec t ion , and can be ca l led recurs ive ly in para l le l . He
used Quicksor t as an example. H i l l i s and Stee le [38 ] ca l led
th is ab i l i t y to opera te in para l le l over se ts o f da ta data-
para l le l i sm. The languages based on i t a re o f ten re fer red to
as data-para l le l languages. Qu i te a few para l le l languages
compr ise data-para l le l fea tures as wel l as o ther fo r ms of
para l le l i sm [39, 40 , 41 , 42 ] .
Fo l low ing on h is ear l ie r work [35 ] , B le l loch cont ras ts fla t
da ta-para l le l languages, where a func t ion can be app l ied in
para l le l over a set o f va lues, bu t the func t ion i tse l f must be
sequent ia l w i th nes ted data-para l le l languages; in the la t te r,
any func t ion can be app l ied over a set o f va lues, inc lud ing
para l le l func t ions. For example, fo r mul t ip ly ing a mat r ix by
a vec tor, the summat ion of each row of the mat r ix cou ld
i tse l f execute us ing para l le l summat ion . We concur wi th
h is c la im that the ab i l i t y to nes t para l le l ca l l s is c r i t i ca l
fo r express ing a lgor i thms in a way tha t matches our h igh-
leve l in tu i t ion o f how they work ; e.g . , nes ted para l le l i sm can
be used to imp lement nes ted loops and d iv ide-and-conquer
a lgor i thms in para l le l .
As the mul t i - th readed arch i tec tu res ga ined popu lar i t y, the
need for nes t ing , encouraged by Ble l loch ’s work , ga ined
105
momentum. Ci lk [43 , 44 ] wou ld be a good example for gen-
era l mu l t i - th readed programming. Per [35 ] , fla t ten ing nes ted
para l le l i sm was impor tan t fo r the imp lementa t ion of h igh-
leve l languages s ince they a l low a compi le r to t rans la te the
h igh- leve l descr ip t ion of nes ted opera t ions onto i ts low- leve l
imp lementa t ion on a flat rea l (vec tor - l i ke) mach ine. Mul t i -
th readed arch i tec tu res, on the other hand, a l lowed greater
imp lementa t ion flex ib i l i t y. C i l k cont r ibu ted impor tan t compi le r
and run- t ime techn iques such as work-s tea l ing for imp lemen-
ta t ion of nes ted para l le l i sm. Our pr io r work [45 ] has a l ready
bu i l t on th is flex ib i l i t y by fur ther op t im iz ing work s tea l ing to
an improvement ca l led Lazy Binar y Sp l i t t ing (LBS) . C i lk++
[46 ] has incor pora ted a concept o f reducers. They showed
that the i r work-s tea l ing schedu ler can suppor t reducers wi th -
ou t incur r ing s ign ificant overhead. We wi l l a lso t r y to ex tend
our LBS enhancement o f work-s tea l ing schedu l ing to reduc-
ers.
In cont ras t to Ci lk programming, ICE stays c lear o f race
cond i t ions and other complex i t ies tha t led to the grave pro-
duc t iv i t y concer ns wi th respec t to genera l mu l t i - th readed
programming. ICE a lso d i rec t l y connects wi th the para l le l
a lgor i thms l i te ra tu re – the or ig ina l p roblem that nes ted data-
106
para l le l i sm addressed. Namely, wh i le be ing insp i red by the
compi le r and run- t ime suppor t in [44 ] , the XMTC program
handoff po in t was a l ready c loser to the a lgor i thm; thus, re -
duc ing the human effo r t . The proposed work wi l l ge t i t even
c loser, fu r ther reduc ing the human effo r t . The gap between
Ci lk and the para l le l a lgor i thms l i te ra tu re is made c lear by
compar ing the mul t i - th readed a lgor i thms sect ion in [47 ] w i th
para l le l a lgor i thms tex ts [25 , 26 , 27 ] . A key d i ffe rence is
tha t wh i le [47 ] a lso favors work-depth per fo r mance ana ly -
s is , i t does not equ ip the programmer wi th the same leve l
o f f reedom for des ign ing for work-depth per fo r mance. Th is
po in t was demonst ra ted wi th respec t to para l le l a lgor i thms
for merg ing in [2 ] . C i l k , o f course, has the impor tan t advan-
tage of be ing suppor ted by today ’s commodi ty hardware and
is, in fa i r ness, more accommodat ing to programmers than
much of i t s immedia te compet i t ion . However, commodi ty
hardware cannot exp lo i t i r regu la r para l le l i sm as effec t i ve ly
as XMT.
Our cent ra l ques t ion is : How shou ld the programming
(hand ing off human input to mach ine process ing) o f para l -
le l mach ines be? The th ink ing on th is cent ra l ques t ion has
been affec ted by changes in techno logy and para l le l a rch i tec-
107
tu res over t ime (e.g . , the move f rom vector to mul t i - th readed
arch i tec tu res) . In cont ras t , the l i te ra tu re on bas ic para l le l
a lgor i thms appears to be more res i l ien t to these changes,
in sp i te o f v igorous at tempts by numerous researchers. Th is
res i l ience suggests tha t there is cons iderable in te l lec tua l
and prac t ica l mer i t in advanc ing programming spec ifica t ion
tha t w i l l make i t eas ie r to un leash the wea l th o f th is knowl -
edge base: 1 . Th is spec ifica t ion shou ld be as c lose to
the or ig ina l para l le l a lgor i thm and s imple to produce as
poss ible. 2 . I t shou ld be effic ien t ly imp lemented on some
arch i tec tu re p la t fo r m, as par t o f i t s proo f o f concept . 3 .
Once the proof o f concept is es tabl i shed, a proper se t o f
a lgor i thm spec ifica t ion cou ld crea te a benchmark for gu id ing
fu tu re para l le l a rch i tec tu res ; obv ious ly d i ffe ren t arch i tec tu res
wi l l need d i ffe ren t ad jus tments to compi le r and run t ime so-
lu t ions. 4 . However, the main lesson for the success of
XMT on ease-o f -p rogramming is tha t suppor t o f the theor y
o f para l le l a lgor i thms, and in par t i cu la r i t s concept o f para l -
le l a lgor i thmic th ink ing is no less impor tan t fo r the des ign
of para l le l sys tems than any set o f spec ific app l ica t ions or
fea tures . Th is is a lso the b igges t depar tu re f rom standard
computer arch i tec tu re prac t ice.
108
6.2 Conclusion
We present ICE, a new lock-s tep easy- to -program para l -
le l p rogramming language based on the PRAM algor i thmic
mode l . We present the ICE compi le r tha t we deve loped
which t rans la tes the lock-s tep ICE programs in to a t rad i -
t iona l th readed XMTC programs. We demonst ra te tha t the
ICE compi le r can prov ide comparable per fo r mance to h igh ly -
op t im ized XMTC programs whi le requ i r ing much less effo r t
f rom the programmer. We show how ICE eas iness- to -program
works in synergy wi th XMT’s e ffic ien t para l le l i za t ion of i r -
regu la r programs to s t r i ke the ever -sought ba lance between
the compi le r and the programmer ro les in produc ing para l -
le l p rograms, where the programmer needs on ly to spec i fy
para l le l i sm and re ly on the compi le r to do the res t . F ina l l y,
g iven the re la t i ve ly s low progress in para l le l p rogramming
language techno log ies for i r regu la r programs, our works sug-
ges ts new oppor tun i t ies fo r benchmark ing para l le l mach ines
by the i r e ffic ien t suppor t o f h igh- leve l para l le l a lgor i thmic
languages.
We conc lude wi th a broader perspec t ive on the s ign ifi -
cance of our cont r ibu t ion . I t shou ld be c lear tha t ICE (or
109
work-depth) para l le l i sm ex is ts in ever y ser ia l a lgor i thm. The
on ly e ffo r t needed when we wish to use para l le l i sm inherent
in a ser ia l a lgor i thm is to express i t , wh ich in our exper i -
ence is jus t a mat te r o f sk i l l , w i th no creat iv i t y invo lved.
In cont ras t , p rac t ica l l y a l l commerc ia l approaches to para l le l
p rogramming are based on par t i t ion ing the work to be done
among processors or th reads. There is no c lear path for
der iv ing tha t f rom a ser ia l a lgor i thm, and, when doable, re -
qu i res s ign ificant c rea t iv i t y ; in fac t , in many cases i t e i ther
cannot be done or cannot be done beyond ver y l im i ted lev-
e ls o f para l le l i sm. Th is ex t ra leve l o f c rea t iv i t y ra ises the
bar on the sk i l l and effo r t o f p rogrammers, and has great ly
l im i ted the adopt ion of many cores among programmers and
app l ica t ion sof tware vendors. Our paper, a long wi th pr io r
XMT work , es tabl i shes tha t there is a way to aver t the
above prac t ice, wh ich arguably amounts to th rowing the par -
a l le l p rogrammer under the bus, th rough proper hardware
and sof tware des ign cho ices.
110
Chapter A: Dependence Tests and Ana lys is
Dependence tes t ing is a method by which i t i s de ter mined
whether dependence ex is t be tween two subscr ip ted memor y
re ferences to the same ar ray in nes ted loops [48 ] . I t i s
d i fficu l t to ca lcu la te data dependenc ies for ar rays, due to
ar rays re ferenc ing many d i ffe ren t memor y loca t ions. Th is
Append ix wi l l g ive an over v iew descr ip t ion of the methods
used for the h igh-prec is ion tes t ing of da ta dependenc ies.
Dependence tes t ing has two main goa ls ; to prove no de-
pendence ex is ts, and prov ide bes t poss ible charac ter iza t ion
o f the poss ible dependence, in the for m of d is tance and
d i rec t ion vec tors. Dependence tes ts are conser va t ive in na-
tu re, so i f a tes t cannot prove the independence of two
memor y re ferences, i t must assume that a dependence ex-
is ts. A l l the dependence tes ts be low assume that a l l the
index var iables o f loop nests have been ident ified , and tha t
a l l aux i l ia r y induc t ion var iables have been detec ted and re-
p laced by func t ions of the loop indexes.
111
To bu i ld the dependence graph i t uses for sp l i t t ing pardo
reg ions, the c lus ter ing pass in the ICE compi le r re l ies heav-
i l y on in fo r mat ion prov ided by LLVM’s nat ive Dependence
Ana lys is (DA) pass. The DA pass implements a l l the tes ts
d iscussed wi th in the chapter, and dec ides which tes t (o r
group of tes ts ) to use, based on the subscr ip ted re ference
pa i rs be ing quer ied . Th is append ix is in tended to g ive a
bet te r unders tand ing of the k inds of dependenc ies the DA
pass, and by ex tens ion the ICE compi le r, w i l l be able to
hand le. For tha t reason, th is append ix wi l l on ly d iscuss the
dependence tes ts imp lemented wi th in LLVM’s DA pass.
A.1 Overv iew of Dependence Test ing
As the var ious ar ray subscr ip ts c lass ifica t ions have d i f -
fe r ing complex i ty leve ls invo lved in tes t ing them. Subscr ip ts
shou ld be par t i t ioned accord ing to the i r complex i t ies, and be
tes ted accord ing ly. Th is a l lows tes t ing procedures of ar ray
re ference pa i rs based on the par t i t ion ing . These procedures
are :
1 . The subscr ip ts are par t i t ioned in to min ima l and separa-
ble coup led groups.
112
2. C lass i fy subscr ip ts accord ing to how they are indexed
( i .e . , Z IV, SIZ , MIV)
3 . For each subscr ip t the appropr ia te subscr ip t tes t shou ld
be app l ied (Z IV, MIV, SIV tes ts ) . These tes ts are de-
te r mined based on a subscr ip t ’s complex i ty. The goa l
o f these tes ts is e i ther prove independence, or a t tempt
to ca lcu la te d is tance and d i rec t ion vec tors between an
ar ray re ference pa i r.
4 . I f two memor y accesses subscr ip ts were proven to be
independent , by any of the tes ts used, tha t means the
memor y re ferences are independent and tes t ing shou ld
be ter minated .
5 . For each subscr ip t group, app ly mul t ip le tes ts to pro-
duce d is tance and d i rec t ion vec tors fo r the occur r ing
indexes wi th in tha t subscr ip t group.
6 . o therw ise, For ever y memor y re ference pa i r, merge a l l
the d is tance/d i rec t ion vec tors ca lcu la ted in prev ious s teps
in a s ing le d is tance/d i rec t ion vec tor.
113
A.2 Zero Index Var iable Test
The Zero Index Var iable (Z IV) tes t i s a loop-car r ied de-
pendence tes t tha t looks at ar ray subscr ip ts has no index
var iables ( i .e . , subscr ip ts tha t are loop invar ian t ) . Examples
of subscr ip ts where ZIV tes t can be used are A [ 3 ] and A [ 4 ] ,
o r A [ t ] and A [ t + 1] where t is no t an induc t ion var iable
In th is tes t , i f two subscr ip ts are proved to be unequa l ,
then the cor respond ing memor y re ferences are independent .
However, i f the tes t fa i led , then ca luc la t ing d is tance/d i rec t iona l
vec tors can be ignored. S ince they conta in no induc t ion
var iables, the subscr ip ts have no cont r ibu t ion to any d is -
tance/d i rec t ion vec tors. An example of subscr ip ts
A.3 Single-Subscr ipt Dependence Tests
Af te r the par t i t ion ing and c lass ifica t ion of subscr ip ts have
been comple ted , the tes ts used to check for the ex is tence
of dependences between memor y re ferences are app l ied .
When a dependence ex is ts, the tes ts wi l l a t tempt to prov ide
as much in fo r mat ion as poss ible about the nature o f the
dependence in the for m of the d is tance/d i rec t ion vec tors.
114
This sec t ions looks at tes ts tha t are app l ied over s ing le
subscr ip ts.
A.3.1 Single Index Var iable Tests
The Sing le Index Var iable tes ts is a group of loop-car r ied
dependence tes ts tha t looks at ar ray subscr ip t pa i rs tha t use
a s ing le loop induc t ion var iable. These subscr ip ts take the
for m a1 i+ c1 and a2 i+ c2 , where a1 , a2 are cons tan ts, c1 , c2 are
loop invar ian t var iables, and i i s the loop induc t ion var iable.
S IV scr ip ts is the most commonly occur r ing for m of ar ray
subscr ip ts.
An exac t SIV tes t fo r l inear SIV subscr ip ts is suggested
in [49 , 50 , 51 ] . These methods re ly on find ing a l l poss ible
so lu t ions for a two-var iables l inear D iophant ine equat ion .
Other s imp ler methods ex is ts where the SIV subscr ip ts are
categor ized in to weak SIV and st rong SIV. A l l these methods
are d iscussed be low.
A.3.1 .1 Strong SIV Test
A pai r o f S IV subscr ip ts is sa id to be s t rong SIV, i f i t
takes the for m ai + c1 and ai + c2 where i i s an induc t ion
var iable, a i s a cons tan t , and c1 and c2 are loop invar i -
115
ants ; i .e . , when i t i s l inear, and the index i coeffic ien ts are
cons tan t and equa l . Th is tes t i s capable o f prov ing inde-
pendence, and i f i t fa i l s to do tha t , i t can ca lcu la te the
d is tance/d i rec t ion vec tors.
A geomet r ic representa t ion of the subscr ip t pa i r wou ld
t rans la te in to two para l le l l ines, due to which common el -
ements wi l l a lways be separa ted by a constan t d is tance
between them for the d i ffe ren t loop i te ra t ions.
To find the d is tance separa t ing the subscr ip t pa i r, work ing
f rom the beg inn ing :
a1 i + c1 = a2 i
′ + c2 (A .1)
S ince a1 = a2 , the equat ion is s imp l ified g iv ing :
ai + c1 = ai
′ + c2 (A .2)
Af te r work ing out the above equat ion , the dependence
d is tance d i s :
d = i ′ − i = c1 − c2
a
(A .3)
A dependence ex is ts between a memor y re ference pa i r i f
and on ly i f common access happen to e lements wi th in loop
bounds. Th is t rans la tes to tha t subscr ip t pa i r a re indepen-
dent when d does not have an in teger va lue, or d > (U − L) ,
where U i s the upper bound of the loop, and L i s the lower
bound of the loop. However, i f d has an in teger va lue, and
116
|d | ≤ (U − L) , then a dependence ex is ts and the d i rec t ion of
the dependence is defined as :
direction =

< d > 0
= d = 0
> d < 0
An except ion to the dependence ru le is when |d | = 0 , tha t
means tha t the subscr ip t pa i r a re dependent on ly when they
have the same ar ray index , wh ich in tu r n means tha t the
memor y re ference pa i r a re loop independent re fe rences1 .
One of the main advantages of th is tes t i s i t s ab i l i t y to
be eas i l y ex tended to hand le loop- invar ian t symbol ic expres-
s ions. Th is is accompl ished by firs t eva lua t ing the d is tance
symbol ica l l y, then i f the d is tance eva lua tes to a cons tan t ,
the tes t proceeds as d iscussed above. However, i f tha t is
no t the case, loop bounds d i ffe rence is ca lcu la ted and is
compared wi th the d is tance symbol ica l l y. An example of th is
case is the subscr ip t pa i r ( i + 2N ) , ( i + N ) re fe renced wi th in
a loop bounded by [ 1 , N ] . When the d is tance is ca lcu la ted ,
i t i s found to be d = N . Th is is compared to the the d i ffe r -
ence between the loop bounds (N − 1) . However, N > (N − 1) ,
wh ich proves tha t th is pa i r a re independent .
1 However, tha t does not mean that the re ference pa i r a re independent
f rom one another w i th in the same i te ra t ion of the loop
117
A.3.1 .2 Weak SIV Subscr ipts
A pai r o f S IV subscr ip ts is sa id to be weak SIV i f they
take the for m a1 i + c1 and a2 i + c2 where i i s an induc t ion
var iable, a1 , a2 are cons tan ts, and c1 and c2 are loop invar i -
an ts. Th is can be so lved us ing the exac t SIV tes t . However,
there are spec ia l cases where i t i s eas ie r to tes t us ing
the i r spec ific proper t ies. A geomet r ic representa t ion of the
subscr ip t i s two l ines in te rsec t ing at a spec ific po in t where :
a1 i + c1 = a2 i
′ + c2 (A .4)
A weak SIV tes t can be for mula ted to check i f the po in t
where the l ines in te rsec t is o f an in teger va lue wi th in the
loop bounds. There are two spec ia l cases based on th is ;
the Weak-Zero SIV and the Weak-Cross ing SIV. find ing e i ther
o f those two cases a l lows tes t ing wi thout requ i r ing the use
of exac t SIV tes t .
A.3.1 .3 Weak-Zero SIV Test
A pai r o f S IV subscr ip ts is sa id to be zero-weak SIV
tha t take the for m ai + c1 and c2 where i i s an induc t ion
var iable, a i s a cons tan t , and c1 and c2 are loop invar ian ts.
118
This tes t is capable o f de ter min ing independence. I f i t fa i l s
to prove tha t , i t can somet imes refine for d i rec t ion .
Th is tes t re l ies on the fac t tha t one of the coeffic ien ts






Th is suggests tha t one of the subscr ip t pa i r re fe rences
on ly one spec ific ar ray e lement . So th is tes ts amounts to
check ing at what in teger loop index causes the subscr ip t
pa i r to access the same ar ray e lement , and whether i t i s
w i th in loop bounds.
The weak-zero SIV tes t checks for dependences caused
by a spec ific loop i te ra t ion . I f th is is the firs t o r las t loop
i te ra t ion , wh ich can be e l im ina ted by the loop pee l ing opt i -
m iza t ion2 .
A.3.1 .4 Weak-Crossing SIV Test
A pai r o f S IV subscr ip ts is sa id to be weak-cross ing SIV
i f they take the for m ai+ c1 and −ai+ c2 where i i s an induc-
2 Loop pee l ing is an opt im iza t ion where the firs t o r las t loop i te ra t ion
is sp l i t and executed separa te ly f rom the res t o f the loop. loop bound
are ad jus ted accord ing ly.
119
t ion var iable, a1 , a2 are cons tan ts, and c1 and c2 are loop
invar ian ts. Th is tes t is capable o f prov ing independence,
fa i l ing tha t i t can someimtes be refined for d i rec t ion .
Th is tes t re l ies on the fac t tha t a2 = −a1 = a , s ince sym-
met r y he lps s imp l i f y the ana lys is. Th is suggests tha t as
the loop index progresses, the subscr ip t pa i r a re mov ing
away f rom a spec ific po in t a t the same ra te. Th is po in t i s
the cross ing po in t be tween the subscr ip t pa i r. To loca te the
cross ing po in t , i ′ i s se t to i , and a2 i s subs t i tu ted wi th −a1
which resu l ts in to :
a1 i + c1 = −a1 i + c2 (A .6)





Deter min ing whether dependence ex is ts amounts to check-
ing whether the cross ing po in t i s w i th in the loop bounds,
and has a va lue tha t is e i ther an in teger or non- in teger
mul t ip le o f 1
2
. S ince the subscr ip t pa i r a re mov ing wi th the
same ra te away f rom the cross ing po in t , a po in t tha t is no t
in the midd le between two in tegers, then the pa i r cannot
in te rsec t a t an in teger, thus prov ing independence.
120
Weak-cross ing SIV dependences can be e l im ina ted by
loop sp l i t t ing opt im iza t ion3 .
A.3.1 .5 Exact SIV Test
As d iscussed ear l ie r on in th is sec t ion , the exac t SIV
tes t is used for ar ray subscr ip t pa i rs tha t has the for m
a1 i + c1 and a2 i ′ + c2 where i i s an induc t ion var iable, a1 , a2
are cons tan ts, and c1 and c2 are loop invar ian ts. Th is tes t
is s lower than any of the spec ia l i zed SIV tes ts ( i .e . , s t rong,
weak-zero, weak-c ross ing) , thus i t i s recommended to use
them ins tead. These tes ts are a lso bet te r w i th symbol ics
and s t rong SIV tes t can even ca lcu la te d is tances as d is -
cussed in A.3 .1 .1 . Th is tes t can prove dependence, fa i l ing
tha t i t can refine for d i rec t ion .
Star t ing f rom the in te rsec t ion po in t
a1 i + c1 = a2 i
′ + c2 (A .8)
Th is tes t requ i res look ing for a l l poss ible so lu t ions for
the equat ion :
a1 i − a2 i ′ = c2 − c1 (A .9)
Th is equat ion sys tem has a so lu t ion i f and on ly i f gcd(a1 ,a2 )
c2−c1
i s an in teger va lue.
3 Loop Spl i t t ing is a compi le r op t im iza t ion where dependences wi th in
a loop are e l im ina ted by break ing the loop in to mul t ip le loops wi th
same bod ies and d i ffe ren t bounds cover ing the ent i re index range of
the or ig ina l loop
121
A.4 Mul t ip le Index Var iable Test
ZIV and SIV subscr ip ts are re la t i ve ly s imp le l inear map-
p ings f rom Z to Z , when res t r i c ted to l inear func t ions of
loop induc t ion var iables, and where Z i s the set o f na tura l
numbers. However, be ing l inear mapp ings f rom Zm to Z ,
(m is the number o f subscr ip t ’s loop induc t ion var iables) ,
MIV subscr ip ts are more compl ica ted . As such, in order
to accura te ly deter mine dependences, MIV subscr ip t pa i rs
requ i re use of more advanced mathemat ica l methods. Th is
sec t ion exp la ins the genera l dependence equat ions used in
the var ious MIV dependence tes ts.
MIV tes ts are usefu l fo r loops tha t take the genera l fo r m:
for i1 = L1 to U1 do
for i2 = L2 to U2 do
. . .
for in = Ln to Un do
A[f ( i1 , i2 , . . . , in ) ] = . . .





To deter mine i f a dependence ex is ts and has a d i rec t ion
vec tor D = (D1 , D2 , . . . , Dn ) i s equ iva len t to deter min ing i f an
in teger so lu t ion to the fo l low ing equat ion sys tem ex is ts :
f (v1 , v2 , . . . , vn ) = g (u1 , u2 , . . . , un ) (A .10)
122
where v , u are defined by
L i ≤ v i , u i ≤ U i , ∀i , 1 ≤ i ≤ n (A .11)
and d i rec t ion vec tor added res t r i c t ion
v iD iy i∀i , 1 ≤ i ≤ n, andD i ∈ {<, = , >} (A .12)
The equat ion has a so lu t ion i f
h(v1 , v2 , . . . , vn , u1 , u2 , . . . , un ) = f (v1 , v2 , . . . , vn ) − g (u1 , u2 , . . . , un ) = 0
(A .13)
has an in teger so lu t ion ins ide the reg ion defined by A.11
and A.12. However, exac t ly so lv ing th is equat ion in re-
s t r i c ted space is ver y d i fficu l t , wh ich means find ing an ap-
prox imate so lu t ion wi th acceptable prec is ion for compi le rs.
A s imp l ifica t ion to the requ i rement fo r the so lu t ion to be
in teger, wh ich wi l l make the so lu t ion space cont inuous. The
absence of a rea l number so lu t ion ind ica tes tha t the equa-
t ion cannot have an in teger so lu t ion , wh ich proves tha t no
dependence ex is ts. Th is sec t ion wi l l look on ly a t a ffine
func t ions of the for m:
f (v1 , v2 , . . . , vn ) = a0 + a1v1 + a2v2 + . . . + anvn (A .14)
g (u1 , u2 , . . . , un ) = b0 + b1u1 + b2u2 + . . . + bnun (A .15)
123
And to so lve the dependence problem means find ing so lu -
t ions in the reg ion defined by A.11 and A.12 to the l inear
equat ion :
a0 − b0 + a1v1 − b1u1 + a2v2 − b2u2 + . . . + anvn − bnun = 0 (A .16)
There are two impor tan t dependence tes ts used for MIV
subscr ip ts, the GCD tes t , and the Benar jee Inequa l i t y tes t .
The Benar jee Inequa l i t y tes t w i l l no t be d iscussed fur ther.
A.4.1 The Greatest Common Denominator Test
When equat ion A.16 ter ms are rear ranged, i t y ie lds :
a1v1 − b1u1 + a2v2 − b2u2 + . . . + anvn − bnun = b0 − a0 (A .17)
Th is equat ion has the l inear D iophant ine equat ion ’s s tan-
dard for m. A theorem about these equat ions is :
Theorem A.4.1 A l inear D iophant ine equat ion of the for m:
a1x1 − b1y1 + a2x2 − b2y2 + . . . + anxn − bnyn = b0 − a0
has a so lu t ion i f and on ly i f the GCD (a1 , a2 , . . . , an , b1 , b2 , . . . , bn )
d iv ides b0 − a0
124
When th is theorem is app l ied to equat ion A.17, indepen-
dence is proved i f the GCD of a l l coe ffic ien ts a, b does not
d iv ide the cons tan t te r m b0 − a0 .
Tes t ing for a spec ific d i rec t ion vec tor D = (D1 , D2 , . . .Dn ) ,
w i th a t leas t one d i rec t ion be ing ′ = ′ , t igh tens the cond i t ion
o f so lu t ion be ing wi th in the reg ion defined by A.11 and
A.12 s ince v i = u i . Subs t i tu t ing in equat ion A.17 y ie lds :
a1v1 − b1u1 + a2v2 − b2u2 + . . . + (a i − b i )v i + . . . + anvn − bnun = b0 − a0
(A .18)
In th is case the GCD, shou ld inc lude the ter m (a i − b i ) ,
and exc lude the ter ms a i , b i , mak ing the process more pre-
c ise. Th is a lso ind ica tes the tha t s t rong SIV tes t d iscussed
in subsec t ion A.3 .1 .1 is a spec ia l case of th is tes t .
A.4.2 Restr ic ted Double Index Var iable
RDIV is a spec ia l case of MIV subscr ip ts, in wh ich a
pa i r o f subscr ip ts take the for m a1 i + c1 and a2 j + c2 where
a1 , a2 are cons tan ts, c1 and c2 are loop invar ian ts, and i , j
are induc t ion var iables. Th is tes t is an easy adapta t ion of
the exac t SIV tes t .
125
A.5 Test ing in Coupled Groups
Coup led subscr ip ts are any two subscr ip ts conta in ing
the same index var iable. Recogn i t ion o f coup l ing is im-
por tan t s ince in Mul t id imens iona l a r ray re ferences, coup led
subscr ip ts can cause imprec is ions dur ing dependence tes t -
ing . When tes ts used for separable subscr ip ts are used
on each subscr ip t o f a coup le group, and the tes t proves
independence then no dependency ex is ts . However, tes t ing
subscr ip t -by-subscr ip t can a lso ind ica te fa lse dependences.
A bet te r tes t wou ld be to tes t each subscr ip t separa te ly,
in te rsec t ing the resu l t ing d i rec t ion vec tors se ts [52 ] , wh ich
per mi ts e ffic ien t tes t ing , wh i le in some cases, i t conser va-
t i ve ly approx imates the set o f d i rec t ions wi th in a coup led
group, y ie ld ing non-ex is ten t d i rec t ion vec tors. A ver y e ffec-
t i ve s t ra tegy for address ing th is is by in te rsec t ing d is tance
ra ther than d i rec t ion vec tors. There are many other methods
for mul t ip le subscr ip t tes ts suggested in research such as
[53 , 54 , 55 ] .
126
A.5.1 Del ta Test
This is an in tu i t i ve, exac t and effic ien t tes t used in tes t -
ing common coup led subscr ip ts. The main idea beh ind i t i s
based on the in tu i t ion o f d is tance vec tor in te rsec t ion . Given
tha t SIV subscr ip ts are the most common in prac t ice and
that the SIV tes ts are easy to per fo r m, and g ives exac t re -
su l ts in most cases, the in fo r mat ion obta ined by us ing them
may be used to make i t eas ie r to tes t o ther subscr ip ts
wi th in same group.
In the de l ta tes t , each SIV subscr ip t in the coup led group
is examined in order to produce const ra in ts used wi th o ther
subscr ip ts o f the same group, wh ich usua l ly resu l ts in s im-
p l i fica t ion produc ing a prec ise set o f d i rec t iona l vec tor.
The Del ta tes t ge ts i ts name f rom represent ing the d is -
tance between memor y re ferences in fo r mal ly w i th a ∆I . As
such, the index var iable in the source memor y re ferences
is assumed to be a spec ific va lue I , and the s ink memor y
re ference is assumed to have the same va lue as the source
wi th an added d is tance I + ∆I .
The Del ta tes t can detec t independence i f any of the
SIV tes ts invo lved detec ts dependence. I f no t , then i t con-
127
ver ts a l l S IV subscr ip ts in to cons t ra in ts, wh ich in tu r n are
used wi th MIV subscr ip ts whenever poss ible. Th is process
is repeated unt i l no more cons t ra in ts can be found. These
const ra in ts are used to s imp l i f y RDIV subscr ip ts. A l l re -
main ing MIV subscr ip ts are then tes ted , and the resu l ts are
in te rsec ted wi th o ther ex is t ing cons t ra in ts.
A.6 Symbol ic Tests
This tes t is in tended for cases where the subscr ip t pa i r
take the for m a1 i + c1 and a2 j + c2 where a1 , a2 are cons tan ts,
c1 and c2 are loop invar ian ts, and i , j are induc t ion var i -
ables, w i th bounds L1 ≤ i ≤ U1 and L2 ≤ j ≤ U2 .
For a dependence to ex is t the subscr ip t pa i r must po in t
to the same memor y loca t ion , w i th in the bounds of bo th
index var iable i , j . As such
a1 i + c1 = a2 j + c2 (A .19)
a1 i − a2 j = c2 − c1 (A .20)
Tes t ing for dependence requ i re comput ing c2 − c1 and mak-
ing sure i t i s in the range of the max imum and min imum
128
poss ible va lues of a1 × i − a2 × j . Based on the s igns of a1 , a2
there are four poss ib i l i t i es :
• i f a1 ≥ 0 and a2 ≥ 0
a1L1 − a2U2 ≤ c2 − c1 ≤ a1U1 − a2L2 (A .21)
• i f a1 ≥ 0 and a2 ≤ 0
a1L1 − a2L2 ≤ c2 − c1 ≤ a1U1 − a2U2 (A .22)
• i f a1 ≤ 0 and a2 ≥ 0
a1U1 − a2U2 ≤ c2 − c1 ≤ a1L1 − a2L2 (A .23)
• i f a1 ≤ 0 and a2 ≤ 0
a1U1 − a2L1 ≤ c2 − c1 ≤ a1L1 − a2U2 (A .24)
c2− c1 does not sa t is fy the inequa l i t y case f rom above tha t
app l ies accord ing to the s igns of a1 , a2 , then a dependence
cannot ex is t .
Th is tes t can hand le some RDIV cases and is on ly ca-
ble o f d isprov ing dependence, and i t cannot ca lcu la te any
d is tance or d i rec t ion in fo r mat ion . I t can be usefu l in case
of memor y re ference pa i rs tha t are in two d i ffe ren t nes ted
loops of the same leve l . Fur ther more, These equat ions can
be used to deter mine dependence wi th the ex is tence of
symbol ic va lues for c1 , c2 , L1 , L2 , U1 , U2 . I t be t te r ser ves as a
backup for the RDIV tes t .
129
I t can a lso be used for dependenc ies in the same loop
af te r se t t ing L2 = L1 and U2 = U1 . Th is tes t can hand le some
SIV cases when i , j are equa l ( i .e . , same var iable ) . sub-
s t i tu t ing in the above equat ions wi l l resu l t in the fo l low ing
two inequa l i t ies :
• i f a1 and a2 have the same s ign
a1L1 − a2U1 ≤ |c2 − c1 | ≤ a1U1 − a2L1 (A .25)
• i f a1 and a2 have d i ffe ren t s igns
L1 ≤ |
c2 − c1
(a1 − a2 )
| ≤ U1 (A .26)
130
Bibl iography
[1 ] D. Cu l le r e t a l . “LogP: towards a rea l i s t i c mode l o f
para l le l computa t ion” . In : SIGPLAN Not . 28.7 (1993) ,
pp. 1–12.
[2 ] U. V ishk in . “Us ing s imp le abs t rac t ion to gu ide the re in -
vent ion of comput ing for para l le l i sm” . In : CACM 54,1
(2011) , pp. 75–85.
[3 ] X . Wen and U. V ishk in . “FPGA-based pro to type of a
PRAM-on-ch ip processor ” . In : Proc . ACM Comput ing Fron-
t ie rs . 2008.
[4 ] D. Na ish los e t a l . “Towards a firs t ver t i ca l p ro to typ ing
of an ex t remely fine-gra ined para l le l p rogramming ap-
proach” . In : Proc . 13 th annu. ACM symp. on Para l le l
a lgor i thms and arch i tec tu res (SPAA) . 2001.
[5 ] U. V ishk in e t a l . “Exp l i c i t mu l t i - th read ing (XMT) br idg ing
mode ls fo r ins t ruc t ion para l le l i sm (ex tended abst rac t ) ” .
In : Proc . 10 th annu. ACM symp. on Para l le l a lgor i thms
and arch i tec tu res (SPAA) . 1998.
[6 ] U. V ishk in . “Prefix sums & an app l ica t ion thereoff. ” In :
U.S. Paten t 6 542 918 (2003) .
[7 ] A . Ba lkan et a l . “Layout -Accura te Des ign and Implemen-
ta t ion of a High-Throughput In te rconnect ion Network fo r
S ing le -Ch ip Para l le l Process ing” . In : Los Alami tos, CA,
USA: IEEE Computer Soc ie ty, 2007.
[8 ] A . Ba lkan, G. Qu, and U. V ishk in . “An Area-Effic ien t
H igh-Throughput Hybr id In te rconnect ion Network fo r S ing le -
ch ip Para l le l Process ing” . In : 45th Des ign Automat ion
Conference . Anahe im, CA, June 2008.
[9 ] U. V ishk in . “Spawn- jo in ins t ruc t ion set arch i tec tu re fo r
prov id ing exp l ic i t mu l t i - th read ing (XMT)” . In : U.S. Paten t
6 463 527 (2002) .
[10 ] G. C. Caragea et a l . “Resource-Aware Compi le r Pre fe tch-
ing for Many-Cores” . In : In te r na t iona l Sympos ium on
Para l le l and Dis t r ibu ted Comput ing . 2010.
131
[11 ] X . Wen and U. V ishk in . “PRAM-on-ch ip : fi rs t commi t -
ment to s i l i con” . In : SPAA ’07: Proceed ings of the n ine-
teenth annua l ACM sympos ium on Para l le l a lgor i thms
and arch i tec tu res . San Diego, Ca l i fo r n ia , USA: ACM
Press , 2007, pp. 301–302.
[12 ] A. O. Ba lkan and U. V ishk in . Programmer ’s Manua l
fo r XMTC Language, XMTC Compi le r and XMT Simula-
to r . Tech. rep. UMIACS-TR-2005-45. Un ivers i ty o f Mar y-
land Ins t i tu te fo r Advanced Computer Stud ies (UMIACS) ,
2006. U R L : http://www.umiacs.umd.edu/users/vishkin/
XMT/manual4xmtc1out- of2.pdf .
[13 ] A . Got t l ieb e t a l . “The NYU Ul t racomputer : des ign ing
a MIMD, shared-memor y para l le l mach ine (Extended Ab-
s t rac t ) ” . In : ISCA ’82: Proceed ings of the 9th annua l
sympos ium on Computer Arch i tec tu re . Aus t in , Texas,
Un i ted Sta tes : IEEE Computer Soc ie ty Press , 1982,
pp. 27–42.
[14 ] G. C. C. Br yant C. Lee Uz i V ishk in . Handbook of Par -
a l le l Comput ing : Mode ls, A lgor i thms and App l ica t ions,
Chapter Mode ls fo r Advanc ing PRAM and Other A lgo-
r i thms in to Para l le l Programs for a PRAM-On-Ch ip P la t -
fo r m . 1s t ed .
[15 ] J. A . Edwards and U. V ishk in . “Bet te r Speedups Us-
ing Simpler Para l le l Programming for Graph Connect iv i t y
and Biconnect iv i t y ” . In : Proceed ings of the 2012 In ter -
na t iona l Workshop on Programming Mode ls and App l ica-
t ions for Mul t i cores and Manycores (PMAM) . ACM, 2012,
pp. 103–114.
[16 ] J. A . Edwards and U. V ishk in . “Br ie f Announcement :
Tru ly Para l le l Bur rows-whee ler Compress ion and Decom-
press ion” . In : Proceed ings of the Twenty -fi f th Annua l
ACM Sympos ium on Para l le l i sm in A lgor i thms and Ar-
ch i tec tu res (SPAA) . ACM, 2013, pp. 93–96.
[17 ] Z . He and B. Hong. “Dynamica l l y tuned push- re labe l
a lgor i thm for the max imum flow problem on cpu-gpu-
hybr id p la t fo r ms” . In : Proc . 24 th IEEE In t . Para l le l and
Dis t r ibu ted Process ing Symp. 2010.
[18 ] G. Caragea and U. V ishk in . “Bet te r speedups for par -
a l le l max-flow” . In : Proc . 23rd ACM Symp. on Para l le l
A lgor i thms and Arch i tec tu res (SPAA) . 2011.
132
[19 ] J. A . Edwards and U. V ishk in . “Para l le l a lgor i thms for
Bur rows-Whee ler compress ion and decompress ion” . In :
Theor. Comput . Sc i . 525 (2014) , pp. 10–22.
[20 ] A. Saybas i l i e t a l . “H igh ly para l le l mu l t i -d imens iona l
fas t Four ie r t rans for m on fine- and course-gra ined many-
core approaches” . In : In Proc . 21s t Conference on Par -
a l le l and Dis t r ibu ted Comput ing Sys tems (PDCS) . Cam-
br idge, MA, Nov. 2009.
[21 ] P. Gu and U. V ishk in . “Case s tudy of ga te- leve l log ic
s imu la t ion on an ext remely fine-gra ined ch ip mul t ip ro -
cessor ” . In : J. Embedded Comp. 2 (2006) , pp. 181–190.
[22 ] S. Torber t e t a l . “ I s Teach ing Para l le l A lgor i thmic Th ink-
ing to High Schoo l Students Poss ible? : One Teacher ’s
Exper ience” . In : Proceed ings of the 41s t ACM Techn ica l
Sympos ium on Computer Sc ience Educat ion . S IGCSE
’10. Mi lwaukee, Wiscons in , USA, 2010, pp. 290–294.
[23 ] L . Hochs te in e t a l . “A Pi lo t S tudy to Compare Pro-
gramming Effo r t fo r Two Para l le l Programming Mode ls ” .
In : J. Sys t . Sof tw. 81.11 (Nov. 2008) , pp. 1920–1930.
[24 ] D. Padua, U. V ishk in , and J. Car ver . “Jo in t UIUC/UMD
para l le l a lgor i thms/programming course” . In : Proc . 1s t
NSF/TCPP Workshop on Para l le l and Dis t r ibu ted Com-
put ing Educat ion (EduPar -11) . in con junc t ion wi th 25th
IEEE In t . Para l le l and Dis t r ibu ted Process ing Symp. ,
2011.
[25 ] J. JaJa. An In t roduc t ion to Para l le l A lgor i thms . Add ison-
Wes ley Publ ish ing Company, 1992.
[26 ] J. Ke l le r , C. Kess le r , and J. Trae ff . Prac t ica l PRAM
Programming . Wi ley- In te rsc ience, 2001.
[27 ] U. V ishk in . “Th ink ing in Para l le l : Some Bas ic Data-
Para l le l A lgor i thms and Techn iques - Course c lass notes” .
U R L : http://www.umiacs.umd.edu/users/vishkin/PUBLICATIONS/
classnotes.pdf .
[28 ] R. E. Tar jan and U. V ishk in . “An Effic ien t Para l le l B i -
connect iv i t y A lgor i thm” . In : SIAM J. Comput . 14.4 (1985) ,
pp. 862–874.
[29 ] U. V ishk in , G. Caragea, and B. Lee. “Mode ls fo r Ad-
vanc ing PRAM and Other A lgor i thms in to Para l le l Pro-
grams for a PRAM-On-Ch ip P la t fo r m. In Handbook on
Para l le l Comput ing (Ed i to rs : S. Ra jasekaran, J. Re i f ) ” .
In : Chapman and Hal l /CRC Press , 2008.
133
[30 ] Y. Sh i loach and U. V ishk in . “An O (n2 l og n) para l le l
max-flow algor i thm” . In : J. A lgor i thms 3 (1982) , pp. 128–
146.
[31 ] The Chape l Para l le l Programming Language . U R L : http:
//chapel.cray.com/ .
[32 ] Pro jec t For t ress . U R L : http : / / projectfortress . java .
net/ .
[33 ] X10: Per fo r mance and Product iv i t y a t Sca le . U R L : http:
//x10- lang.org/ .
[34 ] W. Ch ing and D. Ju . “Execut ion of au tomat ica l l y para l -
le l i zed API programs on RP3” . In : IBM J. o f research
and Deve lopment 35 (5 /6 1991) , pp. 767–778.
[35 ] G. E. B le l loch . Vector Mode ls fo r Data-Para l le l Comput -
ing . MIT Press , 1990.
[36 ] G. B le l loch and J. Gre iner . “A Provable T ime and
Space Effic ien t Imp lementa t ion of NESL” . In : ACM SIG-
PLAN In t . Conf. on Funct iona l Programming . 1996.
[37 ] G. E. B le l loch . “Programming para l le l a lgor i thms” . In :
Commun. ACM 39 (3 Mar. 1996) , pp. 85–97. I S S N : 0001-
0782.
[38 ] W. D. Hi l l i s and G. L . Stee le, J r. “Data Para l le l A l -
gor i thms” . In : Commun. ACM 29.12 (1986) , pp. 1170–
1183.
[39 ] K. M. Chandy and J. Mis ra . Para l le l Program Des ign :
A Foundat ion . Add ison Wes ley , 1988.
[40 ] Ar v ind , R. S. N ikh i l , and K. K. P inga l i . “ I -S t ruc tures :
da ta s t ruc tures for para l le l comput ing” . In : ACM Trans.
on Programming Languages and Syst . 11.4 (Oct . 1989) ,
pp. 598–632.
[41 ] J. T. Feo, D. C. Cann, and R. R. Oldehoef t . “A Repor t
on the Sisa l Language Pro jec t ” . In : J. o f Para l le l and
Dis t r ibu ted Comput ing 10.4 (Dec. 1990) , pp. 349–366.
[42 ] P. Mi l l s e t a l . “Pro to typ ing para l le l and d is t r ibu ted pro-
grams in Pro teus” . In : Symp. Para l le l and Dis t r ibu ted
Process ing 1991 . IEEE Comput . Soc .
134
[43 ] M. Fr igo , C. E. Le iserson, and K. H. Randa l l . “The
implementa t ion of the Ci lk -5 mul t i th readed language” .
In : PLDI ’98 : Proceed ings of the ACM SIGPLAN 1998
conference on Programming language des ign and im-
p lementa t ion . Mont rea l , Quebec, Canada: ACM Press ,
1998, pp. 212–223. I S B N : 0 -89791-987-4 . D O I : http : / /
doi.acm.org/10.1145/277650.277725 .
[44 ] The MIT Ci lk home page: h t tp : / /super tech .csa i l .m i t .edu/c i l k / .
[45 ] A . Tzannes et a l . “Lazy b inar y -sp l i t t ing : a run- t ime
adapt ive work-s tea l ing schedu ler ” . In : Proc . 15 th ACM
SIGPLAN symp. on Pr inc ip les and prac t ice o f para l le l
p rogramming (PPOPP) . 2010.
[46 ] M. Fr igo et a l . “Reducers and other C i lk++ hyperob-
jec ts ” . In : Proc . 21s t Annu. ACM Symp. on Para l le l i sm
in A lgor i thms and Arch i tec tu res (SPAA) . 2009.
[47 ] T. Cor men et a l . In t roduc t ion to A lgor i thms, 3rd Ed.
MIT Press , 2009.
[48 ] C. S. Michae l Joseph Wol fe and L. Or tega. High Per -
fo r mance Compi le rs fo r Para l le l Comput ing . Bos ton , MA,
USA: Add ison-Wes ley Longman Publ ish ing Co. , Inc . , 1995.
I S B N : 0805327304.
[49 ] U. K. Baner jee. Dependence Ana lys is fo r Supercom-
put ing . Norwe l l , MA, USA: K luwer Academic Publ ishers,
1988. I S B N : 0898382890.
[50 ] W. L . Cohagen. “Vec tor op t im iza t ion for the asc” . In :
1973.
[51 ] M. J. Wol fe . High Per fo r mance Compi le rs fo r Para l le l
Comput ing . Ed . by C. Shank l in and L. Or tega. Boston ,
MA, USA: Add ison-Wes ley Longman Publ ish ing Co. , Inc . ,
1995. I S B N : 0805327304.
[52 ] M. J. Wol fe . “Opt im iz ing Supercompi le rs fo r Supercom-
puters ” . AAI8303027. PhD thes is. Champaign , IL , USA,
1982.
[53 ] Z . L i , P. -C. Yew, and C. -Q. Zhu. “Data Dependence
Ana lys is on Mul t i -d imens iona l Ar ray References” . In :
Proceed ings of the 3rd In te r na t iona l Conference on
Supercomput ing . ICS ’89 . Cre te, Greece: ACM, 1989,
pp. 215–224.
135
[54 ] D. R. Wal lace. “Dependence of Mul t i -d imens iona l Ar -
ray References” . In : Proceed ings of the 2Nd In ter na-
t iona l Conference on Supercomput ing . ICS ’88 . St . Ma lo,
France: ACM, 1988, pp. 418–428.
[55 ] M. Wol fe and C. W. Tseng. “The Power Test fo r Data
Dependence” . In : IEEE Trans. Para l le l D is t r ib. Sys t . 3.5
(Sept . 1992) , pp. 591–601.
136
